# PyTorch to TensorRT Optimization Demo

This notebook demonstrates the complete pipeline for optimizing deep learning models using NVIDIA TensorRT.

## Overview

We'll walk through:
1. Converting a PyTorch model to ONNX format
2. Building optimized TensorRT engines (FP32, FP16, INT8)
3. Running inference and comparing performance
4. Visualizing benchmark results

## Prerequisites

- NVIDIA GPU with CUDA support
- TensorRT 8.6+
- All dependencies from requirements.txt installed

## Setup and Imports

In [None]:
import sys
import os
import json
import time
from pathlib import Path

# Add parent directory to path
sys.path.append('../src')

import numpy as np
import torch
import torchvision.models as models
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Import our modules
from convert_to_onnx import load_pytorch_model, export_to_onnx, validate_onnx_model
from convert_to_tensorrt import EngineBuilder
from calibration import generate_calibration_data, INT8EntropyCalibrator
from inference import TensorRTInferenceEngine
from benchmark import BenchmarkSuite, PyTorchBenchmark
from visualize_results import BenchmarkVisualizer

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 1: PyTorch to ONNX Conversion

First, we'll load a pretrained ResNet50 model and convert it to ONNX format.

In [None]:
# Configuration
MODEL_NAME = 'resnet50'
BATCH_SIZE = 1
INPUT_SIZE = (224, 224)
CHANNELS = 3

# Create directories
os.makedirs('../models', exist_ok=True)
os.makedirs('../engines', exist_ok=True)
os.makedirs('../results', exist_ok=True)
os.makedirs('../plots', exist_ok=True)

print(f"Loading {MODEL_NAME} model...")

In [None]:
# Load PyTorch model
model = load_pytorch_model(MODEL_NAME, pretrained=True)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

# Model summary
print(f"\nModel architecture: {MODEL_NAME}")
print(f"Input shape: ({BATCH_SIZE}, {CHANNELS}, {INPUT_SIZE[0]}, {INPUT_SIZE[1]})")

In [None]:
# Export to ONNX
onnx_path = f'../models/{MODEL_NAME}.onnx'
input_shape = (BATCH_SIZE, CHANNELS, *INPUT_SIZE)

print("Exporting to ONNX...")
export_to_onnx(
    model=model,
    output_path=onnx_path,
    input_shape=input_shape,
    dynamic_batch=True,
    opset_version=16
)

# Validate ONNX model
print("\nValidating ONNX model...")
is_valid = validate_onnx_model(onnx_path, input_shape)
print(f"ONNX validation: {'PASSED' if is_valid else 'FAILED'}")

# Check file size
onnx_size = os.path.getsize(onnx_path) / (1024 * 1024)
print(f"ONNX model size: {onnx_size:.2f} MB")

## Step 2: Generate Calibration Data for INT8

For INT8 quantization, we need calibration data. We'll generate synthetic data for this demo.

In [None]:
# Generate calibration images
calibration_dir = '../calibration_images'

if not os.path.exists(calibration_dir):
    print("Generating calibration data...")
    generate_calibration_data(
        output_dir=calibration_dir,
        num_images=100,  # Use 100 images for demo
        image_size=INPUT_SIZE
    )
else:
    num_existing = len(list(Path(calibration_dir).glob('*.jpg')))
    print(f"Using existing calibration data: {num_existing} images")

## Step 3: Build TensorRT Engines

Now we'll build TensorRT engines with different precision modes.

In [None]:
def build_engine_if_needed(precision, force_rebuild=False):
    """Build TensorRT engine if it doesn't exist."""
    engine_path = f'../engines/{MODEL_NAME}_{precision}.trt'
    
    if os.path.exists(engine_path) and not force_rebuild:
        print(f"{precision.upper()} engine already exists: {engine_path}")
        return engine_path
    
    print(f"\nBuilding {precision.upper()} engine...")
    print("This may take a few minutes...")
    
    builder = EngineBuilder(verbose=False)
    
    # Load ONNX
    if not builder.load_onnx(onnx_path):
        print(f"Failed to load ONNX for {precision}")
        return None
    
    # Create calibrator for INT8
    calibrator = None
    if precision == 'int8':
        calibrator = INT8EntropyCalibrator(
            data_dir=calibration_dir,
            cache_file=f'../calibration_{MODEL_NAME}.cache',
            batch_size=8,
            max_batches=10,
            input_shape=(CHANNELS, *INPUT_SIZE)
        )
    
    # Configure builder
    builder.configure_builder(
        precision=precision,
        max_workspace_size=1024,  # 1GB workspace
        max_batch_size=16,
        calibrator=calibrator
    )
    
    # Build engine
    engine = builder.build_engine()
    if engine is None:
        print(f"Failed to build {precision} engine")
        return None
    
    # Save engine
    builder.save_engine(engine, engine_path)
    
    # Clean up
    del engine
    del builder
    
    return engine_path

In [None]:
# Build engines for each precision
precisions = ['fp32', 'fp16', 'int8']
engine_paths = {}

for precision in precisions:
    engine_path = build_engine_if_needed(precision, force_rebuild=False)
    if engine_path:
        engine_paths[precision] = engine_path
        
print("\nEngine building complete!")
print("Available engines:")
for precision, path in engine_paths.items():
    size = os.path.getsize(path) / (1024 * 1024)
    print(f"  {precision.upper()}: {size:.2f} MB")

## Step 4: Quick Inference Test

Let's test inference with each engine to verify they work correctly.

In [None]:
# Create dummy input
dummy_input = np.random.randn(1, CHANNELS, *INPUT_SIZE).astype(np.float32)

print("Testing inference with each engine...\n")

for precision, engine_path in engine_paths.items():
    print(f"Testing {precision.upper()} engine:")
    
    with TensorRTInferenceEngine(engine_path, max_batch_size=16) as engine:
        # Warmup
        engine.warmup(num_iterations=5)
        
        # Run inference
        start = time.perf_counter()
        output = engine.infer(dummy_input)
        end = time.perf_counter()
        
        latency = (end - start) * 1000
        
        print(f"  Output shape: {output.shape}")
        print(f"  Latency: {latency:.2f} ms")
        print(f"  Output range: [{output.min():.3f}, {output.max():.3f}]")
        print()

## Step 5: Comprehensive Benchmarking

Now let's run comprehensive benchmarks comparing PyTorch vs TensorRT.

In [None]:
# Quick benchmark for demo (reduced iterations)
batch_sizes = [1, 4, 8]
iterations = 20  # Reduced for demo
warmup = 5

print(f"Running benchmarks for batch sizes: {batch_sizes}")
print(f"Iterations per test: {iterations}")
print("\nThis will take a few minutes...\n")

# Initialize benchmark suite
suite = BenchmarkSuite(
    pytorch_model=MODEL_NAME,
    trt_engines_dir='../engines',
    batch_sizes=batch_sizes,
    input_size=INPUT_SIZE,
    verbose=False
)

# Run benchmarks
results = suite.run_benchmarks(iterations=iterations, warmup=warmup)

# Save results
results_path = '../results/demo_benchmark.json'
suite.save_results(results_path)

print("\n" + "="*60)
suite.print_summary()

## Step 6: Visualize Results

Let's create visualizations to better understand the performance improvements.

In [None]:
# Create inline visualizations
def plot_speedup_comparison(results):
    """Create a simple speedup comparison plot."""
    fig, ax = plt.subplots(figsize=(10, 6))
    
    batch_sizes = results['metadata']['batch_sizes']
    speedups = {'FP32': [], 'FP16': [], 'INT8': []}
    
    for batch_size in batch_sizes:
        batch_key = f'batch_{batch_size}'
        if batch_key not in results['benchmarks']:
            continue
            
        batch_results = results['benchmarks'][batch_key]
        
        # Get baseline
        baseline = None
        if 'pytorch' in batch_results and 'fp32' in batch_results['pytorch']:
            baseline = batch_results['pytorch']['fp32'].get('mean_latency_ms')
            
        if baseline:
            for precision in ['fp32', 'fp16', 'int8']:
                key = f'tensorrt_{precision}'
                if key in batch_results and 'mean_latency_ms' in batch_results[key]:
                    speedup = baseline / batch_results[key]['mean_latency_ms']
                    speedups[precision.upper()].append(speedup)
                else:
                    speedups[precision.upper()].append(0)
    
    # Plot bars
    x = np.arange(len(batch_sizes))
    width = 0.25
    
    colors = {'FP32': '#2ecc71', 'FP16': '#27ae60', 'INT8': '#76B900'}
    
    for i, (precision, values) in enumerate(speedups.items()):
        ax.bar(x + i * width, values, width, label=f'TensorRT {precision}',
               color=colors[precision])
    
    ax.axhline(y=1.0, color='red', linestyle='--', label='PyTorch Baseline')
    
    ax.set_xlabel('Batch Size')
    ax.set_ylabel('Speedup Factor')
    ax.set_title('TensorRT Speedup vs PyTorch FP32 Baseline')
    ax.set_xticks(x + width)
    ax.set_xticklabels(batch_sizes)
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()

plot_speedup_comparison(results)

In [None]:
# Create latency comparison
def plot_latency_bars(results):
    """Create latency comparison bar plot."""
    batch_size = results['metadata']['batch_sizes'][0]  # Use first batch size
    batch_key = f'batch_{batch_size}'
    batch_results = results['benchmarks'][batch_key]
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    frameworks = []
    latencies = []
    colors_list = []
    
    # PyTorch
    if 'pytorch' in batch_results and 'fp32' in batch_results['pytorch']:
        frameworks.append('PyTorch FP32')
        latencies.append(batch_results['pytorch']['fp32']['mean_latency_ms'])
        colors_list.append('#3498db')
    
    # TensorRT
    for precision, color in [('fp32', '#2ecc71'), ('fp16', '#27ae60'), ('int8', '#76B900')]:
        key = f'tensorrt_{precision}'
        if key in batch_results and 'mean_latency_ms' in batch_results[key]:
            frameworks.append(f'TensorRT {precision.upper()}')
            latencies.append(batch_results[key]['mean_latency_ms'])
            colors_list.append(color)
    
    bars = ax.bar(range(len(frameworks)), latencies, color=colors_list)
    
    # Add value labels
    for bar, lat in zip(bars, latencies):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
               f'{lat:.1f} ms', ha='center', va='bottom')
    
    ax.set_xlabel('Framework & Precision')
    ax.set_ylabel('Latency (ms)')
    ax.set_title(f'Inference Latency Comparison (Batch Size = {batch_size})')
    ax.set_xticks(range(len(frameworks)))
    ax.set_xticklabels(frameworks)
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()

plot_latency_bars(results)

## Step 7: Memory Usage Analysis

Let's analyze memory usage for different configurations.

In [None]:
# Analyze model sizes
print("Model Size Comparison:")
print("="*50)

# PyTorch model size
pytorch_size = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024 * 1024)
print(f"PyTorch model (FP32): {pytorch_size:.2f} MB")

# ONNX model size
print(f"ONNX model: {onnx_size:.2f} MB")

# TensorRT engine sizes
for precision, path in engine_paths.items():
    size = os.path.getsize(path) / (1024 * 1024)
    reduction = (1 - size/pytorch_size) * 100
    print(f"TensorRT {precision.upper()}: {size:.2f} MB ({reduction:.1f}% reduction)")

## Step 8: Performance Summary

Let's create a summary table with all key metrics.

In [None]:
# Create summary table
def create_summary_table(results):
    """Create an HTML summary table."""
    batch_size = results['metadata']['batch_sizes'][0]
    batch_key = f'batch_{batch_size}'
    batch_results = results['benchmarks'][batch_key]
    
    # Get PyTorch baseline
    baseline = batch_results['pytorch']['fp32']['mean_latency_ms']
    
    table_html = """
    <table style='width:100%; border-collapse: collapse;'>
    <tr style='background-color: #76B900; color: white;'>
        <th style='padding: 10px; border: 1px solid #ddd;'>Configuration</th>
        <th style='padding: 10px; border: 1px solid #ddd;'>Latency (ms)</th>
        <th style='padding: 10px; border: 1px solid #ddd;'>Speedup</th>
        <th style='padding: 10px; border: 1px solid #ddd;'>Throughput (FPS)</th>
    </tr>
    """
    
    # PyTorch baseline
    table_html += f"""
    <tr>
        <td style='padding: 10px; border: 1px solid #ddd;'><b>PyTorch FP32</b> (Baseline)</td>
        <td style='padding: 10px; border: 1px solid #ddd;'>{baseline:.2f}</td>
        <td style='padding: 10px; border: 1px solid #ddd;'>1.0x</td>
        <td style='padding: 10px; border: 1px solid #ddd;'>{batch_results['pytorch']['fp32']['throughput_fps']:.1f}</td>
    </tr>
    """
    
    # TensorRT results
    for precision in ['fp32', 'fp16', 'int8']:
        key = f'tensorrt_{precision}'
        if key in batch_results and 'mean_latency_ms' in batch_results[key]:
            latency = batch_results[key]['mean_latency_ms']
            speedup = baseline / latency
            throughput = batch_results[key]['throughput_fps']
            
            # Highlight best performer
            style = 'background-color: #e8f5e9;' if precision == 'int8' else ''
            
            table_html += f"""
            <tr style='{style}'>
                <td style='padding: 10px; border: 1px solid #ddd;'><b>TensorRT {precision.upper()}</b></td>
                <td style='padding: 10px; border: 1px solid #ddd;'>{latency:.2f}</td>
                <td style='padding: 10px; border: 1px solid #ddd;'><b>{speedup:.1f}x</b></td>
                <td style='padding: 10px; border: 1px solid #ddd;'>{throughput:.1f}</td>
            </tr>
            """
    
    table_html += "</table>"
    
    return HTML(table_html)

display(HTML("<h3>Performance Summary (Batch Size = 1)</h3>"))
display(create_summary_table(results))

## Conclusions

### Key Findings

1. **Performance Improvements**: TensorRT provides significant speedup over PyTorch:
   - FP16: Typically 2-3x faster
   - INT8: Can achieve 3-4x speedup with minimal accuracy loss

2. **Memory Efficiency**: TensorRT engines use less memory:
   - FP16: ~50% memory reduction
   - INT8: ~75% memory reduction

3. **Optimization Techniques**: TensorRT applies several optimizations:
   - Layer fusion
   - Kernel auto-tuning
   - Precision calibration
   - Memory optimization

### Production Deployment Recommendations

1. **Use FP16 by default**: Best balance of speed and accuracy
2. **Consider INT8 for edge devices**: Maximum speed and minimum memory
3. **Profile your specific hardware**: Performance varies by GPU architecture
4. **Cache engines**: Avoid rebuild overhead in production

### Next Steps

- Test with your own models and data
- Experiment with different batch sizes
- Profile on target deployment hardware
- Integrate into production inference pipeline

## Cleanup

Clean up GPU memory and resources.

In [None]:
# Cleanup
import gc

# Clear PyTorch cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Force garbage collection
gc.collect()

print("Cleanup complete!")