# Week 13 Lab: Profiling & Optimization

**CS 203: Software Tools and Techniques for AI**

---

## Lab Overview

In this lab, you will learn to:
1. **Profile training loops** to find bottlenecks
2. **Optimize data loading** with DataLoader settings
3. **Use mixed precision** training (AMP)
4. **Apply torch.compile** for faster execution

**Goal**: Optimize a ResNet training loop from baseline to 2-3x faster.

---

## Setup

In [None]:
# Install required packages
!pip install torch torchvision matplotlib pandas numpy

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
from torch.profiler import profile, record_function, ProfilerActivity
from torch.cuda.amp import autocast, GradScaler
import time
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

---

# Part 1: Baseline Measurement

Establish a baseline before optimizing.

```
┌─────────────────────────────────────────────────────────┐
│                 Training Pipeline                       │
│                                                         │
│  Data Loading ──► Forward Pass ──► Backward ──► Update  │
│     (CPU)          (GPU)          (GPU)        (GPU)    │
│                                                         │
│  Bottleneck?      Bottleneck?   Bottleneck?  Bottleneck?│
└─────────────────────────────────────────────────────────┘
```

### Question 1.1 (Solved): Setup Training

In [None]:
# SOLVED EXAMPLE

# Hyperparameters
BATCH_SIZE = 32
NUM_EPOCHS = 1
NUM_WORKERS = 0  # Start with 0 for baseline
IMG_SIZE = 224

# Data preprocessing
transform = transforms.Compose([
    transforms.RandomResizedCrop(IMG_SIZE),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Load CIFAR-10
train_dataset = datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

# Baseline DataLoader
train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=NUM_WORKERS,
    pin_memory=False  # Baseline without optimization
)

# Model
model = models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, 10)  # CIFAR-10 has 10 classes
model = model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

print(f"Dataset size: {len(train_dataset)}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Batches per epoch: {len(train_loader)}")

### Question 1.2 (Solved): Baseline Training Loop

In [None]:
# SOLVED EXAMPLE

def train_epoch(model, dataloader, criterion, optimizer, device):
    """Train for one epoch and return metrics."""
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    batch_times = []
    
    for batch_idx, (images, labels) in enumerate(dataloader):
        batch_start = time.perf_counter()
        
        images, labels = images.to(device), labels.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        # Metrics
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
        
        batch_time = time.perf_counter() - batch_start
        batch_times.append(batch_time)
        
        if batch_idx % 100 == 0:
            print(f"Batch {batch_idx}/{len(dataloader)}, "
                  f"Loss: {loss.item():.4f}, "
                  f"Batch time: {batch_time*1000:.2f}ms")
    
    return {
        'loss': total_loss / len(dataloader),
        'accuracy': 100. * correct / total,
        'throughput': total / sum(batch_times),
        'batch_times': batch_times
    }

In [None]:
# Run baseline training
print("="*50)
print("BASELINE TRAINING")
print("="*50)

start_time = time.time()
baseline_metrics = train_epoch(model, train_loader, criterion, optimizer, device)
baseline_time = time.time() - start_time

print(f"\nBaseline Results:")
print(f"  Epoch time: {baseline_time:.2f}s")
print(f"  Throughput: {baseline_metrics['throughput']:.2f} samples/sec")
print(f"  Accuracy: {baseline_metrics['accuracy']:.2f}%")

---

# Part 2: Data Loading Optimization

Data loading is often the bottleneck.

## 2.1 num_workers Experiment

### Question 2.1: Test Different num_workers

In [None]:
# YOUR CODE HERE
# Test num_workers = [0, 2, 4, 8] and record throughput

worker_configs = [0, 2, 4, 8]
worker_results = []

for num_workers in worker_configs:
    print(f"\nTesting num_workers={num_workers}")
    
    # Create dataloader with new config
    
    # Measure throughput (just iterate through data, no training)
    
    # Record results
    pass

# Plot results

### Question 2.2: Add pin_memory

In [None]:
# YOUR CODE HERE
# Compare with and without pin_memory=True


---

# Part 3: Mixed Precision Training (AMP)

Use FP16 for faster training with minimal accuracy loss.

## 3.1 Enable AMP

### Question 3.1 (Solved): AMP Training Loop

In [None]:
# SOLVED EXAMPLE

def train_epoch_amp(model, dataloader, criterion, optimizer, device, use_amp=True):
    """Train with optional AMP."""
    model.train()
    scaler = GradScaler() if use_amp else None
    
    total_loss = 0
    correct = 0
    total = 0
    batch_times = []
    
    for batch_idx, (images, labels) in enumerate(dataloader):
        batch_start = time.perf_counter()
        
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        
        # Forward with autocast
        if use_amp:
            with autocast():
                outputs = model(images)
                loss = criterion(outputs, labels)
        else:
            outputs = model(images)
            loss = criterion(outputs, labels)
        
        # Backward with gradient scaling
        if use_amp:
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            optimizer.step()
        
        # Metrics
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
        batch_times.append(time.perf_counter() - batch_start)
    
    return {
        'loss': total_loss / len(dataloader),
        'accuracy': 100. * correct / total,
        'throughput': total / sum(batch_times)
    }

### Question 3.2: Compare FP32 vs AMP

In [None]:
# YOUR CODE HERE
# Compare training with and without AMP
# Measure: throughput, memory usage, accuracy


---

# Part 4: torch.compile (PyTorch 2.0+)

Compile the model for optimized execution.

## 4.1 Using torch.compile

### Question 4.1: Apply torch.compile

In [None]:
# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

if hasattr(torch, 'compile'):
    # Create fresh model
    model = models.resnet18(pretrained=True)
    model.fc = nn.Linear(model.fc.in_features, 10)
    model = model.to(device)
    
    # Compile model
    compiled_model = torch.compile(model, mode='default')
    print("Model compiled successfully!")
    
    # YOUR CODE HERE
    # Benchmark compiled vs non-compiled model
    
else:
    print("torch.compile requires PyTorch 2.0+")

---

# Part 5: Profiling with PyTorch Profiler

Use the profiler to identify bottlenecks.

## 5.1 Profile Training

### Question 5.1 (Solved): Profile Training Loop

In [None]:
# SOLVED EXAMPLE

model = models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, 10)
model = model.to(device)

activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
    activities.append(ProfilerActivity.CUDA)

with profile(
    activities=activities,
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for batch_idx, (images, labels) in enumerate(train_loader):
        if batch_idx >= 10:  # Profile first 10 batches
            break
        
        images, labels = images.to(device), labels.to(device)
        
        with record_function("forward"):
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
        
        with record_function("backward"):
            loss.backward()
        
        with record_function("optimizer_step"):
            optimizer.step()

# Print top operations
sort_key = "cuda_time_total" if torch.cuda.is_available() else "cpu_time_total"
print(prof.key_averages().table(sort_by=sort_key, row_limit=10))

### Question 5.2: Analyze Profile Results

Based on the profiler output:
1. What is the most time-consuming operation?
2. Is data transfer (HtoD) significant?
3. Where would you focus optimization efforts?

In [None]:
# YOUR ANALYSIS HERE
analysis = {
    "most_time_consuming_op": None,  # Fill in
    "data_transfer_significant": None,  # True or False
    "optimization_focus": None  # Describe
}

print(analysis)

---

# Part 6: Comprehensive Comparison

Compare all optimizations.

## 6.1 Final Benchmark

### Question 6.1: Create Comparison Dashboard

In [None]:
# YOUR CODE HERE
# Run all configurations and create a comparison

configs = [
    {'name': 'Baseline', 'num_workers': 0, 'pin_memory': False, 'use_amp': False},
    {'name': 'Optimized DataLoader', 'num_workers': 4, 'pin_memory': True, 'use_amp': False},
    {'name': '+ AMP', 'num_workers': 4, 'pin_memory': True, 'use_amp': True},
]

results = []

for config in configs:
    print(f"\nTesting: {config['name']}")
    # Run training and collect metrics
    pass

# Create comparison chart

---

# Summary

In this lab, you learned:

1. **Baseline measurement**: Establish metrics before optimizing
2. **DataLoader optimization**: num_workers, pin_memory
3. **Mixed precision (AMP)**: 2x speedup with minimal accuracy loss
4. **torch.compile**: PyTorch 2.0 compilation
5. **Profiling**: Finding bottlenecks with PyTorch Profiler

## Key Takeaways

| Optimization | Expected Speedup | Complexity |
|--------------|------------------|------------|
| num_workers | 1.5-2x | Low |
| pin_memory | 1.1-1.3x | Low |
| AMP | 1.5-2.5x | Medium |
| torch.compile | 1.2-1.5x | Low |

**Combined**: 2-3x total speedup!

---

## Submission

Submit:
1. This completed notebook
2. Comparison chart showing speedups
3. Brief report: Which optimizations gave the best results on your hardware?