# Lesson 8: ONNX Runtime

**Module 4: Model Development & Optimization**  
**Estimated Time**: 1-2 hours  
**Difficulty**: Advanced

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Compare PyTorch Inference vs ONNX Runtime (ORT)  
âœ… Understand Graph Optimizations (Constant Folding, Fusion)  
âœ… Use Execution Providers (CUDA, TensorRT, OpenVINO)  
âœ… Answer interview questions on inference acceleration  

---

## ðŸ“š Table of Contents

1. [What is ONNX Runtime?](#1-what-is-ort)
2. [The Magic: Graph Optimizations](#2-optimizations)
3. [Hands-On: Benchmarking PyTorch vs ORT](#3-hands-on)
4. [Interview Preparation](#4-interview-questions)

---

## 1. What is ONNX Runtime?

ONNX Runtime (ORT) is a high-performance inference engine developed by Microsoft.

**Why use it?**
- **Faster**: Often 2x-10x faster than pure PyTorch.
- **Hardware Support**: One API for CPU, NVIDIA Models, Intel OpenVINO, Android NNAPI.
- **Lightweight**: No training overhead code.

## 2. The Magic: Graph Optimizations

ORT applies compiler optimizations to your model graph:

1. **Constant Folding**: Pre-calculating static math (e.g., `3 + 5` -> `8`).
2. **Operator Fusion**: Combining layers (e.g., `Conv + BatchNormalization + ReLU` -> `FusedConvBNReLU`). This reduces memory access steps.
3. **Memory Planning**: Reusing memory buffers efficiently.

## 3. Hands-On: Benchmarking PyTorch vs ORT

Requires `pip install onnxruntime`.

In [None]:
import torch
import onnxruntime as ort
import numpy as np
import time

# 1. Setup Data
input_data = np.random.randn(1, 10).astype(np.float32)
input_tensor = torch.from_numpy(input_data)

# 2. Benchmark PyTorch
model = torch.load("model.pt") if False else None # Dummy placeholders
# Assume model is loaded from Lesson 7 export

start = time.time()
# PyTorch Inference loop (Simulation)
time.sleep(0.01)
pt_lat = time.time() - start

# 3. Benchmark ONNX Runtime
# Load Session
session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider'])

start = time.time()
# ORT Inference
outputs = session.run(['output'], {'input': input_data})
ort_lat = time.time() - start

print(f"PyTorch Time: {pt_lat:.4f}s (Simulated)")
print(f"ORT Time: {ort_lat:.4f}s")

## 4. Interview Preparation

### Common Questions

#### Q1: "How do you deploy a model to an NVIDIA GPU?"
**Answer**: "I would export the model to ONNX, then run it with ONNX Runtime using the `CUDAExecutionProvider` or `TensorRTExecutionProvider`. TensorRT performs aggressive optimizations specifically for NVIDIA hardware, giving maximum throughput."

#### Q2: "What is quantization in ONNX Runtime?"
**Answer**: "ORT supports running quantized INT8 models directly on CPUs using the VNNI instruction set (AVX512), which can give 4x speedups over FP32 on Intel CPUs."