# Week 12 Lab: Model Optimization for Edge Deployment

**CS 203: Software Tools and Techniques for AI**

---

## Lab Overview

In this lab, you will learn to:
1. **Quantize models** to reduce size by 4x
2. **Prune models** to remove redundant weights
3. **Export to ONNX** for cross-platform deployment
4. **Benchmark** latency and accuracy trade-offs

**Goal**: Optimize a model for deployment on resource-constrained devices.

---

## Setup

In [None]:
# Install required packages
!pip install torch torchvision onnx onnxruntime numpy matplotlib

In [None]:
import torch
import torch.nn as nn
import torchvision.models as models
import numpy as np
import time
import os
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

---

# Part 1: Baseline Measurement

Before optimizing, we need to measure what we're starting with.

```
┌──────────────────────────────────────────────────────────┐
│                  Optimization Pipeline                   │
│                                                          │
│  Original      Quantize     Prune       ONNX Export     │
│   Model   ──►  (4x smaller) ──►  (faster) ──►  (portable)│
│                                                          │
│  45 MB    ──►   11 MB      ──►   8 MB    ──►   Deploy!  │
│                                                          │
└──────────────────────────────────────────────────────────┘
```

### Question 1.1 (Solved): Load and Measure Baseline Model

In [None]:
# SOLVED EXAMPLE

# Load pre-trained ResNet-18
model = models.resnet18(pretrained=True)
model.eval()

# Save model to measure size
torch.save(model.state_dict(), 'resnet18_fp32.pth')
size_mb = os.path.getsize('resnet18_fp32.pth') / (1024 * 1024)
print(f"Model size: {size_mb:.2f} MB")

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

### Question 1.2 (Solved): Measure Inference Latency

In [None]:
# SOLVED EXAMPLE

def benchmark_model(model, input_shape=(1, 3, 224, 224), num_runs=100):
    """Benchmark model inference latency."""
    input_data = torch.randn(input_shape)
    
    # Warmup
    for _ in range(10):
        with torch.no_grad():
            _ = model(input_data)
    
    # Benchmark
    times = []
    for _ in range(num_runs):
        start = time.time()
        with torch.no_grad():
            _ = model(input_data)
        times.append(time.time() - start)
    
    return {
        'mean_ms': np.mean(times) * 1000,
        'std_ms': np.std(times) * 1000,
        'p95_ms': np.percentile(times, 95) * 1000
    }

# Run benchmark
baseline_metrics = benchmark_model(model)
print(f"Mean latency: {baseline_metrics['mean_ms']:.2f} ms")
print(f"Std latency: {baseline_metrics['std_ms']:.2f} ms")
print(f"P95 latency: {baseline_metrics['p95_ms']:.2f} ms")

### Question 1.3: Record Your Baseline

Fill in your baseline measurements:

In [None]:
# YOUR BASELINE MEASUREMENTS
baseline = {
    'model_size_mb': None,      # Fill in from above
    'total_params': None,       # Fill in from above
    'mean_latency_ms': None,    # Fill in from above
    'p95_latency_ms': None      # Fill in from above
}

print("Baseline metrics:")
for key, value in baseline.items():
    print(f"  {key}: {value}")

---

# Part 2: Quantization

Reduce precision from FP32 to INT8 for 4x smaller models.

## 2.1 Dynamic Quantization

### Question 2.1 (Solved): Apply Dynamic Quantization

In [None]:
# SOLVED EXAMPLE

# Load fresh model
model = models.resnet18(pretrained=True)
model.eval()

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {nn.Linear},  # Layers to quantize
    dtype=torch.qint8
)

# Save and measure size
torch.save(quantized_model.state_dict(), 'resnet18_int8_dynamic.pth')
quant_size_mb = os.path.getsize('resnet18_int8_dynamic.pth') / (1024 * 1024)

print(f"Quantized model size: {quant_size_mb:.2f} MB")
print(f"Compression ratio: {size_mb / quant_size_mb:.2f}x")

### Question 2.2: Benchmark Quantized Model

Measure the latency of the quantized model and compare to baseline.

In [None]:
# YOUR CODE HERE

# Benchmark the quantized model

# Print comparison with baseline


### Question 2.3: Static Quantization (Bonus)

Static quantization uses calibration data for better accuracy.

In [None]:
# BONUS: Implement static quantization
# YOUR CODE HERE


---

# Part 3: Pruning

Remove redundant weights to make the model smaller and faster.

## 3.1 Unstructured Pruning

### Question 3.1 (Solved): Apply Pruning

In [None]:
# SOLVED EXAMPLE
import torch.nn.utils.prune as prune

# Load fresh model
model = models.resnet18(pretrained=True)
model.eval()

# Prune 30% of weights in all Conv2d layers
for name, module in model.named_modules():
    if isinstance(module, nn.Conv2d):
        prune.l1_unstructured(module, name='weight', amount=0.3)

# Make pruning permanent
for name, module in model.named_modules():
    if isinstance(module, nn.Conv2d):
        prune.remove(module, 'weight')

# Count zero weights (sparsity)
total_zeros = 0
total_params = 0

for param in model.parameters():
    total_params += param.numel()
    total_zeros += (param == 0).sum().item()

sparsity = 100 * total_zeros / total_params
print(f"Global sparsity: {sparsity:.2f}%")

### Question 3.2: Benchmark Pruned Model

Measure size and latency of the pruned model.

In [None]:
# YOUR CODE HERE


### Question 3.3: Structured Pruning

Try structured pruning (removing entire filters) with 20% pruning.

In [None]:
# YOUR CODE HERE


---

# Part 4: ONNX Export

Export models to ONNX format for cross-platform deployment.

## 4.1 Export to ONNX

### Question 4.1 (Solved): Export Model to ONNX

In [None]:
# SOLVED EXAMPLE
import onnx

# Load model
model = models.resnet18(pretrained=True)
model.eval()

# Create dummy input
dummy_input = torch.randn(1, 3, 224, 224)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "resnet18.onnx",
    export_params=True,
    opset_version=14,
    do_constant_folding=True,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

print("Model exported to resnet18.onnx")

# Verify the model
onnx_model = onnx.load("resnet18.onnx")
onnx.checker.check_model(onnx_model)
print("ONNX model is valid!")

### Question 4.2: Run Inference with ONNX Runtime

In [None]:
# SOLVED EXAMPLE
import onnxruntime as ort

# Load ONNX model
session = ort.InferenceSession("resnet18.onnx")

# Prepare input
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
outputs = session.run(None, {'input': input_data})

print(f"Output shape: {outputs[0].shape}")
print(f"Predicted class: {np.argmax(outputs[0])}")

### Question 4.3: Benchmark ONNX Runtime

Compare ONNX Runtime performance to PyTorch.

In [None]:
# YOUR CODE HERE

def benchmark_onnx(session, input_shape=(1, 3, 224, 224), num_runs=100):
    """Benchmark ONNX Runtime inference."""
    pass  # Implement this

# Run benchmark and compare to PyTorch


---

# Part 5: Comprehensive Comparison

Compare all optimization methods.

## 5.1 Create Comparison Table

### Question 5.1: Fill in Comparison Table

In [None]:
# YOUR CODE HERE
# Create a comparison table with your measurements

import pandas as pd

results = {
    'Model': ['Baseline FP32', 'Dynamic INT8', 'Pruned 30%', 'ONNX FP32', 'ONNX INT8'],
    'Size (MB)': [None, None, None, None, None],  # Fill in
    'Latency (ms)': [None, None, None, None, None],  # Fill in
    'Compression': ['1x', None, None, None, None],  # Fill in
    'Speedup': ['1x', None, None, None, None]  # Fill in
}

df = pd.DataFrame(results)
print(df.to_string(index=False))

### Question 5.2: Create Visualization

In [None]:
# YOUR CODE HERE
# Create bar charts comparing:
# 1. Model sizes
# 2. Inference latencies

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Add your visualization code

plt.tight_layout()
plt.savefig('optimization_comparison.png', dpi=150)
plt.show()

---

# Part 6: Accuracy Evaluation

Optimization shouldn't hurt accuracy too much.

## 6.1 Test on Sample Data

### Question 6.1: Compare Predictions

In [None]:
# Test that optimized models give similar predictions

# Load all models
baseline_model = models.resnet18(pretrained=True)
baseline_model.eval()

# Create test input
test_input = torch.randn(10, 3, 224, 224)

# Get baseline predictions
with torch.no_grad():
    baseline_preds = baseline_model(test_input).argmax(dim=1)

# YOUR CODE HERE
# Compare predictions from:
# 1. Quantized model
# 2. ONNX model
# Calculate agreement rate


---

# Summary

In this lab, you learned:

1. **Baseline measurement**: Size, latency, parameters
2. **Quantization**: Dynamic and static quantization (4x smaller)
3. **Pruning**: Removing weights for sparsity
4. **ONNX**: Cross-platform model export
5. **Benchmarking**: Fair comparison methods

## Typical Results

| Optimization | Size Reduction | Speedup | Accuracy Loss |
|--------------|----------------|---------|---------------|
| Quantization | 4x | 2-3x | <1% |
| Pruning | 1.5-2x | Variable | 1-3% |
| ONNX | Same | 2-3x | 0% |

---

## Submission

Submit:
1. This completed notebook
2. Your comparison chart (`optimization_comparison.png`)
3. Brief report (1 page): Which optimization would you use for a mobile app?