
# 6.3.5 - Deployment of QA Models

This notebook covers QA model deployment strategies for both cloud and edge environments.

Topics covered:
- Exporting a Hugging Face model to ONNX
- Using AWS Lambda for serverless inference
- CPU-efficient deployment with BitNet (mock example)


In [None]:

!pip install transformers onnx onnxruntime


## Export Hugging Face QA Model to ONNX

In [None]:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers.onnx import export
from pathlib import Path
import torch

model_name = "distilbert-base-uncased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

onnx_path = Path("onnx_qa")
onnx_path.mkdir(exist_ok=True)
export(tokenizer, model=model, output=onnx_path/"qa_model.onnx")


## Run Inference with ONNX Runtime

In [None]:

import onnxruntime as ort

session = ort.InferenceSession("onnx_qa/qa_model.onnx")

question = "What is the capital of Italy?"
context = "Italy is a European country. Rome is the capital of Italy."

inputs = tokenizer(question, context, return_tensors="np", padding="max_length", truncation=True, max_length=384)
outputs = session.run(None, dict(inputs))
start, end = outputs[0].argmax(), outputs[1].argmax()
answer = tokenizer.decode(inputs["input_ids"][0][start:end])
print("ONNX Answer:", answer)



## AWS Lambda Deployment Overview

You can package your QA model and deploy it on AWS Lambda using the following:
1. Convert to `torchscript` or `onnx`
2. Bundle model and inference script in a zip
3. Create a Lambda function with appropriate memory
4. Use API Gateway to expose it

**Key Python handler snippet:**
```python
def lambda_handler(event, context):
    question = event['question']
    context_str = event['context']
    # load tokenizer/model and return answer span
```



## BitNet Deployment on CPU

BitNet is optimized for CPU with 1.58-bit quantization and GGUF format.

### CLI Steps:
```bash
python convert_bitnet_to_gguf.py
quantize --qtype Q1_58
./bitnet -m qa_model.gguf -p "Q: Who wrote Hamlet?"
```

Due to BitNet's custom format, we use a placeholder here in notebook for demo.


In [None]:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "sshleifer/tiny-gpt2"  # mock for BitNet
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Q: What is the speed of light? Context: Light travels at 299,792,458 meters per second."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output_ids = model.generate(input_ids, max_length=64)
print("BitNet-mock Answer:", tokenizer.decode(output_ids[0], skip_special_tokens=True))
