# Lesson 6: Quantization

**Module 4: Model Development & Optimization**  
**Estimated Time**: 1-2 hours  
**Difficulty**: Advanced

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Understand Floating Point (FP32) vs Integer (INT8) representation  
âœ… Learn the trade-off: 4x Smaller Size, potential Accuracy drop  
âœ… Implement Post-Training Quantization (PTQ) in PyTorch  
âœ… Answer interview questions on Edge AI optimizations  

---

## ðŸ“š Table of Contents

1. [The Math: FP32 to INT8](#1-math)
2. [PTQ vs QAT](#2-types)
3. [Hands-On: Quantizing ResNet](#3-hands-on)
4. [Interview Preparation](#4-interview-questions)

---

## 1. The Math: FP32 to INT8

**Standard Model**: Weights are 32-bit floats (FP32).
- Range: $\pm 3.4 \times 10^{38}$
- Size: 4 bytes per weight.

**Quantized Model**: Weights are 8-bit integers (INT8).
- Range: $[-128, 127]$
- Size: 1 byte per weight.

**Benefit**: 
1. **4x Smaller Model Size** (100MB -> 25MB).
2. **Faster Inference**: Integer math is cheap on CPU/DSP.

**Mapping**: We map the min/max of the float range to -128/127.
$Q(x) = \text{round}(x / S + Z)$

## 2. PTQ vs QAT

### Post-Training Quantization (PTQ)
- Train model in FP32 usually.
- Convert to INT8 **after** training.
- Needs a "Calibration" step (run a few images to find min/max).
- **Pros**: Easy. **Cons**: Accuracy drop often significant.

### Quantization-Aware Training (QAT)
- Train model while **simulating** quantization errors.
- The model learns to be robust to low precision.
- **Pros**: Best accuracy. **Cons**: Complex setup.

## 3. Hands-On: Quantizing ResNet (PTQ)

Using PyTorch's Eager Mode Quantization.

In [None]:
import torch
from torchvision import models

# 1. Load FP32 Model
model_fp32 = models.resnet18(pretrained=True)
model_fp32.eval()

# 2. Prepare for Quantization
# Fuse layers (Conv+BN+ReLU) for speed
model_fp32.fuse_model()

# 3. Configure Layout
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model_fp32, inplace=True)

# 4. Calibration (Run dummy data)
print("Calibrating...")
with torch.no_grad():
    model_fp32(torch.randn(1, 3, 224, 224))

# 5. Convert to INT8
model_int8 = torch.quantization.convert(model_fp32, inplace=True)

print("Quantization Complete.")
print(model_int8.conv1) # Look for QuantizedConv2d

## 4. Interview Preparation

### Common Questions

#### Q1: "Why do we need Calibration?"
**Answer**: "To map FP32 values to INT8, we need to know the dynamic range (min/max) of activations. Calibration runs a small dataset through the model to observe these ranges so the mapping scale factor ($S$) is optimal."

#### Q2: "When to use QAT?"
**Answer**: "If PTQ degrades accuracy too much (>1% drop). This often happens with small models (MobileNet) where every bit of precision counts."