
# Quantizing and Exporting DistilBERT for Mobile and Edge Deployment

This notebook demonstrates how to efficiently deploy Small Language Models (SLMs), specifically DistilBERT, using quantization techniques to reduce model size and inference latency. The notebook further explains exporting the quantized model into ONNX format for efficient deployment on mobile devices and edge computing platforms.
    

## Setup Environment

In [None]:

!pip install torch transformers onnx onnxruntime
    

## Load and Quantize DistilBERT Model

In [None]:

import torch
from transformers import DistilBertForTokenClassification, DistilBertTokenizer

# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Load the fine-tuned DistilBERT model
model = DistilBertForTokenClassification.from_pretrained('./results')

# Set model to evaluation mode
model.eval()

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Save the quantized model
quantized_model.save_pretrained('./quantized_model')

print("Quantization complete. Model saved.")
    

## Test Quantized Model

In [None]:

# Tokenize sample input
inputs = tokenizer("Apple is looking at buying U.K. startup for $1 billion", return_tensors="pt")

# Perform inference
with torch.no_grad():
    outputs = quantized_model(**inputs)

print("Model Output:", outputs)
    

## Export Quantized Model to ONNX

In [None]:

import torch
from transformers import DistilBertForTokenClassification

# Reload the quantized model
model = DistilBertForTokenClassification.from_pretrained('./quantized_model')

# Set the model to evaluation mode
model.eval()

# Create a realistic dummy input for exporting (batch size=1, seq length=10)
dummy_input = torch.randint(0, tokenizer.vocab_size, (1, 10), dtype=torch.long)

# Export to ONNX format
torch.onnx.export(
    model, 
    dummy_input, 
    "distilbert_ner_model.onnx", 
    opset_version=11,
    input_names=['input_ids'], 
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence_length'}, 'logits': {0: 'batch_size', 1: 'sequence_length'}}
)

print("Model successfully exported to ONNX format.")
    

## Verify ONNX Model with ONNX Runtime

In [None]:

import onnxruntime as ort

# Load ONNX model with ONNX Runtime
onnx_session = ort.InferenceSession("distilbert_ner_model.onnx")

# Prepare input for ONNX Runtime
onnx_inputs = {'input_ids': dummy_input.numpy()}

# Perform inference
onnx_outputs = onnx_session.run(None, onnx_inputs)

print("ONNX Model Output:", onnx_outputs)
    

## Performance Comparison (Optional Benchmarking)

In [None]:

import time

# Measure inference time for PyTorch model
start_time = time.time()
with torch.no_grad():
    _ = quantized_model(dummy_input)
pytorch_inference_time = time.time() - start_time

# Measure inference time for ONNX model
start_time = time.time()
_ = onnx_session.run(None, onnx_inputs)
onnx_inference_time = time.time() - start_time

print(f"PyTorch Quantized Inference Time: {pytorch_inference_time:.6f} seconds")
print(f"ONNX Inference Time: {onnx_inference_time:.6f} seconds")
    


## Conclusion

This notebook demonstrated the full workflow from quantizing a fine-tuned DistilBERT model to exporting it as an ONNX model, suitable for deployment in resource-constrained environments such as mobile devices and edge platforms. Quantization significantly reduces model size and improves inference efficiency, making it ideal for real-time Named Entity Recognition (NER) applications.
    