# PyTorch Tutorial: Model Deployment and Production

Training a model is only half the battle. To use it in the real world (in a mobile app, a web server, or an embedded device), you need to **deploy** it. This often means optimizing it for speed and portability.

## Learning Objectives
- Understand **TorchScript** (Tracing vs Scripting)
- Export models to **ONNX** (Open Neural Network Exchange)
- **Quantization**: Making models smaller and faster
- Measure Inference Speed (Benchmarking)

## 1. Vocabulary First

- **Inference**: Using a trained model to make predictions (no backprop).
- **Latency**: Time taken to process one input (lower is better).
- **Throughput**: Number of inputs processed per second (higher is better).
- **Serialization**: Saving a model to a file so it can be loaded anywhere (even in C++).
- **Quantization**: Reducing the precision of numbers (e.g., 32-bit float -> 8-bit integer) to save memory and compute.

In [None]:
import torch
import torch.nn as nn
import time
import os

# Let's use a simple model for demonstration
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 50)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(50, 2)

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

model = SimpleModel()
model.eval() # Important! Set to eval mode for deployment
example_input = torch.randn(1, 10)
print("Model created.")

## 2. TorchScript: Tracing vs Scripting

TorchScript allows you to serialize your models and run them in a C++ runtime (no Python needed!).

### Method A: Tracing
Runs the model with a dummy input and records the operations. Fast and easy, but fails with dynamic control flow (if/else).

```python
traced_model = torch.jit.trace(model, example_input)
traced_model.save("traced_model.pt")
```

### Method B: Scripting
Analyzes the Python source code to compile it. Handles if/else loops correctly.

```python
scripted_model = torch.jit.script(model)
scripted_model.save("scripted_model.pt")
```

In [None]:
# Try Tracing
traced_model = torch.jit.trace(model, example_input)
print(traced_model.code) # You can see the internal representation

# Verify it works
output_orig = model(example_input)
output_traced = traced_model(example_input)
assert torch.allclose(output_orig, output_traced)
print("Tracing successful!")

## 3. ONNX Export

ONNX is a standard format supported by many frameworks (TensorFlow, PyTorch, MATLAB, etc.). It's great for deploying to edge devices or using specialized runtimes like ONNX Runtime.

```python
torch.onnx.export(
    model, 
    example_input, 
    "model.onnx",
    input_names=['input'],
    output_names=['output']
)
```

In [None]:
torch.onnx.export(
    model, 
    example_input, 
    "simple_model.onnx",
    input_names=['input'],
    output_names=['output']
)
print("Exported to simple_model.onnx")

## 4. Quantization (Making it smaller)

We can convert weights from Float32 to Int8. This reduces model size by 4x and speeds up inference on supported hardware.

### Dynamic Quantization
Quantizes weights ahead of time, but activations are quantized dynamically at runtime. Good for LSTMs/Transformers.

In [None]:
quantized_model = torch.quantization.quantize_dynamic(
    model, 
    {nn.Linear}, # Layers to quantize
    dtype=torch.qint8
)

print(f"Original Size: {os.path.getsize('traced_model.pt')/1024:.2f} KB")
# Note: For this tiny model, overhead might make it larger, but for large models it helps.
torch.jit.save(torch.jit.trace(quantized_model, example_input), "quantized_model.pt")
print(f"Quantized Size: {os.path.getsize('quantized_model.pt')/1024:.2f} KB")

## Key Takeaways

1. **Always `model.eval()`** before exporting.
2. Use **TorchScript** for C++ production environments.
3. Use **ONNX** for cross-platform compatibility.
4. Use **Quantization** to reduce size and latency.