# Project Step 3: Optimization (ONNX + Quantization)

**Objective**: Deploy the model efficiently by converting it to **ONNX** and **Quantizing** it.

**Why?**
- **ONNX**: Open Standard. Runs on C++, C#, Java, JavaScript. No Python dependency in production.
- **Quantization**: FP32 -> INT8. Reduces size by 75% and speeds up inference on CPU.

---

## ðŸ“š Steps
1. **Export**: Convert YOLOv8 to ONNX.
2. **Benchmark FP32**: Measure speed of the ONNX model.
3. **Quantize**: Convert to INT8.
4. **Benchmark INT8**: Compare size and speed.

## 1. Export to ONNX

Ultralytics has a built-in export function that simplifies this.

In [None]:
from ultralytics import YOLO
import os

# Load the model we trained in Step 2
# If it doesn't exist, we load the pre-trained n-model
model_path = "../models/yolo_baseline/weights/best.pt"
if not os.path.exists(model_path):
    print("Custom model not found, using standard yolov8n.pt")
    model_path = "yolov8n.pt"

model = YOLO(model_path)

# Export to ONNX
print("Exporting to ONNX...")
success = model.export(format="onnx", dynamic=False, imgsz=640)

# The file is saved alongside the .pt file
onnx_path = model_path.replace(".pt", ".onnx")
print(f"Exported to: {onnx_path}")

## 2. Benchmark FP32 (ONNX Runtime)

We use `onnxruntime` to run inference.

In [None]:
import onnxruntime as ort
import numpy as np
import time

# Create Session
session = ort.InferenceSession(onnx_path, providers=['CPUExecutionProvider'])

# Get Input Name
input_name = session.get_inputs()[0].name

# Create Dummy Input (Batch 1, 3 Channels, 640x640)
dummy_input = np.random.randn(1, 3, 640, 640).astype(np.float32)

# Warmup
for _ in range(5):
    session.run(None, {input_name: dummy_input})

# Benchmark
start = time.time()
for _ in range(50):
    session.run(None, {input_name: dummy_input})
end = time.time()

fps_fp32 = 50 / (end - start)
print(f"FP32 Inference Speed: {fps_fp32:.2f} FPS")

## 3. Quantize to INT8

We use `onnxruntime.quantization` to apply dynamic quantization.

In [None]:
from onnxruntime.quantization import quantize_dynamic, QuantType

int8_path = onnx_path.replace(".onnx", "_int8.onnx")

print("Quantizing...")
quantize_dynamic(
    onnx_path,
    int8_path,
    weight_type=QuantType.QUInt8
)
print(f"Quantized Model saved to: {int8_path}")

## 4. Compare Size & Speed

Let's see the benefits.

In [None]:
# Compare Size
size_fp32 = os.path.getsize(onnx_path) / (1024 * 1024)
size_int8 = os.path.getsize(int8_path) / (1024 * 1024)

print(f"FP32 Size: {size_fp32:.2f} MB")
print(f"INT8 Size: {size_int8:.2f} MB")
print(f"Reduction: {(1 - size_int8/size_fp32):.0%}")

# Benchmark INT8
session_int8 = ort.InferenceSession(int8_path, providers=['CPUExecutionProvider'])

# Warmup
for _ in range(5):
    session_int8.run(None, {input_name: dummy_input})

start = time.time()
for _ in range(50):
    session_int8.run(None, {input_name: dummy_input})
end = time.time()

fps_int8 = 50 / (end - start)
print(f"INT8 Inference Speed: {fps_int8:.2f} FPS")
print(f"Speedup: {fps_int8/fps_fp32:.2f}x")

## Summary

You have successfully:
1. Downloaded & Versioned data (DVC).
2. Configured & Tracked training (Hydra + W&B).
3. Optimized the model for deployment (ONNX + Quantization).

This represents a complete **MLOps Model Development Lifecycle**.