# Lab 1.3.5: Profiling Workshop

**Module:** 1.3 - CUDA Python & GPU Programming  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Profile GPU code using PyTorch Profiler
- [ ] Understand Nsight Systems timeline analysis
- [ ] Identify common bottlenecks (data loading, CPU‚ÜîGPU sync)
- [ ] Apply optimizations based on profiling results

---

## üìö Prerequisites

- Completed: Labs 1.3.1-1.3.4
- Knowledge of: Basic PyTorch training loops

---

## üåç Real-World Context

**"You can't improve what you can't measure."** - Peter Drucker

Professional ML engineers spend significant time profiling and optimizing:

| Company | Optimization Impact |
|---------|--------------------|
| OpenAI | Reduced GPT-4 training costs by 3x through optimizations |
| Google | TensorFlow optimization team saves millions in compute |
| Meta | PyTorch 2.0 compile gives 2x speedup on many workloads |

**Common bottlenecks discovered through profiling:**
- Data loading: 40% of training time wasted waiting for data
- CPU‚ÜîGPU transfers: Unnecessary synchronization points
- Memory copies: Data not staying on GPU between operations
- Kernel launch overhead: Too many small kernels
- Suboptimal algorithms: Wrong approach for the hardware

**NVIDIA provides world-class profiling tools:**
- **Nsight Systems**: Timeline view, CPU‚ÜîGPU interactions
- **Nsight Compute**: Deep kernel analysis
- **PyTorch Profiler**: Integration with CUDA events

---

## üßí ELI5: What is Profiling?

> **Imagine you're a detective** investigating why dinner takes 2 hours to cook when it should take 30 minutes.
>
> You set up cameras in the kitchen and watch the recording:
> - "Hmm, you spent 45 minutes looking for the spatula..."
> - "You waited 30 minutes for the oven to preheat... while doing nothing!"
> - "The rice cooker was done, but you didn't notice for 20 minutes."
>
> **Profiling** is like having cameras in your code. It shows:
> - Which parts take the most time (the "hot spots")
> - Where you're waiting unnecessarily (synchronization)
> - What resources are being used (memory, compute)
>
> Once you SEE the problem, you can FIX it!

### Profiling Tool Hierarchy

```
Level 1: High-Level Overview (Nsight Systems)
‚îú‚îÄ‚îÄ What is CPU doing? What is GPU doing?
‚îú‚îÄ‚îÄ Are they working in parallel or waiting for each other?
‚îú‚îÄ‚îÄ Where are the gaps (idle time)?
‚îî‚îÄ‚îÄ Timeline view: see everything at a glance

Level 2: Kernel Deep Dive (Nsight Compute)
‚îú‚îÄ‚îÄ How efficient is this specific kernel?
‚îú‚îÄ‚îÄ Is it memory-bound or compute-bound?
‚îú‚îÄ‚îÄ Are there bank conflicts or uncoalesced access?
‚îî‚îÄ‚îÄ What's the occupancy?

Level 3: Code-Level (PyTorch Profiler)
‚îú‚îÄ‚îÄ Which PyTorch operations are slow?
‚îú‚îÄ‚îÄ How much time in forward vs backward?
‚îú‚îÄ‚îÄ Memory usage over time
‚îî‚îÄ‚îÄ Easy to integrate in Python code
```

---

## Part 0: Environment Setup

In [None]:
import numpy as np
import time
import os
from typing import List, Tuple
import warnings
warnings.filterwarnings('ignore')

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

print(f"‚úÖ PyTorch {torch.__version__}")
print(f"   CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   Device: {torch.cuda.get_device_name()}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# PyTorch Profiler - with fallback for older PyTorch versions
try:
    from torch.profiler import profile, record_function, ProfilerActivity
    HAS_PROFILER = True
    print(f"\n‚úÖ PyTorch Profiler available")
except ImportError:
    HAS_PROFILER = False
    print(f"\n‚ö†Ô∏è PyTorch Profiler not available (requires PyTorch >= 1.8)")
    print("   Some profiling examples will use basic timing instead.")
    print("   For full profiler support, use NGC container: nvcr.io/nvidia/pytorch:25.11-py3")
    
    # Create dummy classes for compatibility
    class ProfilerActivity:
        CPU = "cpu"
        CUDA = "cuda"
    
    from contextlib import contextmanager
    @contextmanager
    def profile(*args, **kwargs):
        yield None
    
    @contextmanager  
    def record_function(name):
        yield

# Check for Nsight tools
nsight_sys = os.system('which nsys > /dev/null 2>&1') == 0
nsight_compute = os.system('which ncu > /dev/null 2>&1') == 0
print(f"\n   Nsight Systems available: {'‚úÖ' if nsight_sys else '‚ùå'}")
print(f"   Nsight Compute available: {'‚úÖ' if nsight_compute else '‚ùå'}")

---

## Part 1: Creating a Training Pipeline to Profile

Let's create a realistic but simple training pipeline with intentional bottlenecks.

In [None]:
class SimpleNet(nn.Module):
    """Simple neural network for profiling demo."""
    def __init__(self, input_dim: int = 784, hidden_dim: int = 256, output_dim: int = 10):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
    
    def forward(self, x):
        return self.layers(x)


# Create synthetic dataset (like MNIST but simpler)
def create_synthetic_data(n_samples: int = 60000, input_dim: int = 784, n_classes: int = 10):
    """Create synthetic classification data."""
    X = torch.randn(n_samples, input_dim)
    y = torch.randint(0, n_classes, (n_samples,))
    return TensorDataset(X, y)


# Training function with profiling hooks
def train_epoch_slow(model, dataloader, criterion, optimizer, device):
    """
    Training loop with INTENTIONAL BOTTLENECKS for demonstration.
    
    Bottlenecks:
    1. Data transfer inside training loop
    2. Unnecessary synchronization
    3. CPU operations mixed with GPU
    """
    model.train()
    total_loss = 0.0
    
    for batch_idx, (data, target) in enumerate(dataloader):
        # BOTTLENECK 1: Data transfer inside loop (should use pin_memory)
        data = data.to(device)
        target = target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        
        # BOTTLENECK 2: Unnecessary .item() causes CPU-GPU sync
        batch_loss = loss.item()  # Forces synchronization!
        total_loss += batch_loss
        
        loss.backward()
        optimizer.step()
        
        # BOTTLENECK 3: CPU computation in hot loop
        if batch_idx % 100 == 0:
            # This forces another sync and wastes time
            accuracy = (output.argmax(1) == target).float().mean().item()
            print(f"Batch {batch_idx}, Loss: {batch_loss:.4f}, Acc: {accuracy:.2%}", end='\r')
    
    return total_loss / len(dataloader)


print("‚úÖ Training components defined")

---

## Part 2: PyTorch Profiler Basics

PyTorch's built-in profiler is the easiest way to start profiling.

In [None]:
# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Small dataset for quick profiling
dataset = create_synthetic_data(n_samples=10000)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=0)

print(f"Dataset: {len(dataset)} samples")
print(f"Batches per epoch: {len(dataloader)}")
print(f"Device: {device}")

In [None]:
# Profile the slow training function
print("üìä Profiling SLOW training loop...")
print("="*60)

if HAS_PROFILER:
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        profile_memory=True,
        with_stack=True
    ) as prof:
        # Run one epoch
        loss = train_epoch_slow(model, dataloader, criterion, optimizer, device)

    print(f"\n\nEpoch complete. Average loss: {loss:.4f}")

    # Print profiling results
    print("\n" + "="*60)
    print("üìä Top 20 Operations by CUDA Time")
    print("="*60)
    print(prof.key_averages().table(
        sort_by="cuda_time_total", 
        row_limit=20
    ))
else:
    # Fallback: basic timing
    print("(Using basic timing - install PyTorch >= 1.8 for full profiler)")
    start = time.perf_counter()
    loss = train_epoch_slow(model, dataloader, criterion, optimizer, device)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    print(f"\n\nEpoch complete. Average loss: {loss:.4f}")
    print(f"Total time: {elapsed:.3f} seconds")

In [None]:
# Also show CPU time
if HAS_PROFILER:
    print("\n" + "="*60)
    print("üìä Top 20 Operations by CPU Time")
    print("="*60)
    print(prof.key_averages().table(
        sort_by="cpu_time_total", 
        row_limit=20
    ))
else:
    print("(Skipped - PyTorch Profiler not available)")

### üîç Reading the Profiler Output

**Key columns:**
- **Name**: The operation (e.g., `aten::linear`, `aten::to`)
- **Self CPU / Self CUDA**: Time spent in this operation only
- **CPU / CUDA total**: Time including child operations
- **# Calls**: How many times it was called

**What to look for:**
1. **`aten::to`**: Data transfers - should be minimized
2. **`cudaStreamSynchronize`**: Blocking synchronizations
3. **High call counts**: Potential for batching
4. **CPU-heavy operations**: Move to GPU if possible

---

## Part 3: Using record_function for Custom Labels

Add custom labels to understand what each part of your code does.

In [None]:
def train_epoch_labeled(model, dataloader, criterion, optimizer, device):
    """
    Training loop with labeled sections for better profiling.
    """
    model.train()
    total_loss = 0.0
    
    for batch_idx, (data, target) in enumerate(dataloader):
        with record_function("DATA_TRANSFER"):
            data = data.to(device)
            target = target.to(device)
        
        with record_function("FORWARD_PASS"):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
        
        with record_function("SYNC_AND_LOG"):
            batch_loss = loss.item()
            total_loss += batch_loss
        
        with record_function("BACKWARD_PASS"):
            loss.backward()
        
        with record_function("OPTIMIZER_STEP"):
            optimizer.step()
    
    return total_loss / len(dataloader)


# Profile with labels
print("üìä Profiling with LABELED sections...")
print("="*60)

if HAS_PROFILER:
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
    ) as prof:
        loss = train_epoch_labeled(model, dataloader, criterion, optimizer, device)

    print(f"\nEpoch complete. Average loss: {loss:.4f}")

    # Show only our custom labels
    print("\n" + "="*60)
    print("üìä Custom Section Timings")
    print("="*60)
    print(prof.key_averages().table(
        sort_by="cpu_time_total", 
        row_limit=10
    ))
else:
    # Fallback
    print("(Using basic timing - record_function labels require PyTorch Profiler)")
    start = time.perf_counter()
    loss = train_epoch_labeled(model, dataloader, criterion, optimizer, device)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    print(f"\nEpoch complete. Average loss: {loss:.4f}")
    print(f"Total time: {elapsed:.3f} seconds")

---

## Part 4: Identifying and Fixing Bottlenecks

Now let's create an optimized version and compare.

In [None]:
def train_epoch_optimized(model, dataloader, criterion, optimizer, device):
    """
    OPTIMIZED training loop.
    
    Fixes:
    1. Use non_blocking transfers (async)
    2. Avoid unnecessary .item() calls
    3. Accumulate loss on GPU, sync only at end
    """
    model.train()
    total_loss = torch.tensor(0.0, device=device)  # Keep on GPU!
    
    for batch_idx, (data, target) in enumerate(dataloader):
        # FIX 1: non_blocking=True allows async transfer
        data = data.to(device, non_blocking=True)
        target = target.to(device, non_blocking=True)
        
        optimizer.zero_grad(set_to_none=True)  # Faster than zero_grad()
        output = model(data)
        loss = criterion(output, target)
        
        # FIX 2: Accumulate on GPU, no sync!
        total_loss += loss.detach()  # detach to avoid keeping graph
        
        loss.backward()
        optimizer.step()
    
    # FIX 3: Only sync at the very end
    return (total_loss / len(dataloader)).item()


# Create optimized dataloader with pin_memory
dataloader_optimized = DataLoader(
    dataset, 
    batch_size=64, 
    shuffle=True, 
    num_workers=2,      # Parallel data loading
    pin_memory=True,    # Faster CPU->GPU transfer
    persistent_workers=True  # Keep workers alive
)

print("üìä Profiling OPTIMIZED training loop...")
print("="*60)

# Warm up
_ = train_epoch_optimized(model, dataloader_optimized, criterion, optimizer, device)

if HAS_PROFILER:
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
    ) as prof_opt:
        loss = train_epoch_optimized(model, dataloader_optimized, criterion, optimizer, device)

    print(f"\nEpoch complete. Average loss: {loss:.4f}")

    print("\n" + "="*60)
    print("üìä Optimized - Top Operations")
    print("="*60)
    print(prof_opt.key_averages().table(
        sort_by="cuda_time_total", 
        row_limit=15
    ))
else:
    start = time.perf_counter()
    loss = train_epoch_optimized(model, dataloader_optimized, criterion, optimizer, device)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    print(f"\nEpoch complete. Average loss: {loss:.4f}")
    print(f"Optimized time: {elapsed:.3f} seconds")

### Comparing Slow vs Optimized

In [None]:
# Benchmark both versions
print("‚è±Ô∏è  Benchmarking: Slow vs Optimized")
print("="*60)

# Reset model
model = SimpleNet().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Time slow version
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(3):
    _ = train_epoch_slow(model, dataloader, criterion, optimizer, device)
torch.cuda.synchronize()
time_slow = (time.perf_counter() - start) / 3

# Reset model
model = SimpleNet().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Time optimized version
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(3):
    _ = train_epoch_optimized(model, dataloader_optimized, criterion, optimizer, device)
torch.cuda.synchronize()
time_optimized = (time.perf_counter() - start) / 3

print(f"\n\nSlow version:      {time_slow:.3f} seconds/epoch")
print(f"Optimized version: {time_optimized:.3f} seconds/epoch")
print(f"\nüöÄ Speedup: {time_slow/time_optimized:.2f}x faster!")

---

## Part 5: Memory Profiling

Understanding memory usage is crucial for fitting large models.

In [None]:
def profile_memory():
    """Profile memory usage during training."""
    # Reset
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    
    # Create larger model for more interesting memory profile
    model_large = nn.Sequential(
        nn.Linear(784, 2048),
        nn.ReLU(),
        nn.Linear(2048, 2048),
        nn.ReLU(),
        nn.Linear(2048, 2048),
        nn.ReLU(),
        nn.Linear(2048, 10)
    ).to(device)
    
    optimizer_large = optim.Adam(model_large.parameters(), lr=0.001)
    
    # Large batch for memory pressure
    dataset_large = create_synthetic_data(n_samples=10000)
    dataloader_large = DataLoader(dataset_large, batch_size=512, shuffle=True)
    
    print("üìä Memory Usage During Training")
    print("="*60)
    
    # Track memory at each stage
    stages = []
    
    # After model creation
    stages.append(("After model creation", torch.cuda.memory_allocated() / 1e6))
    
    # Get one batch
    data, target = next(iter(dataloader_large))
    data, target = data.to(device), target.to(device)
    stages.append(("After data transfer", torch.cuda.memory_allocated() / 1e6))
    
    # Forward pass
    output = model_large(data)
    stages.append(("After forward pass", torch.cuda.memory_allocated() / 1e6))
    
    # Compute loss
    loss = nn.CrossEntropyLoss()(output, target)
    stages.append(("After loss computation", torch.cuda.memory_allocated() / 1e6))
    
    # Backward pass
    loss.backward()
    stages.append(("After backward pass", torch.cuda.memory_allocated() / 1e6))
    
    # Optimizer step
    optimizer_large.step()
    stages.append(("After optimizer step", torch.cuda.memory_allocated() / 1e6))
    
    # Clear gradients
    optimizer_large.zero_grad(set_to_none=True)
    stages.append(("After zero_grad", torch.cuda.memory_allocated() / 1e6))
    
    # Print
    print(f"\n{'Stage':<30} {'Memory (MB)':<15} {'Delta (MB)':<15}")
    print("-"*60)
    prev = 0
    for stage, mem in stages:
        delta = mem - prev
        print(f"{stage:<30} {mem:<15.1f} {delta:>+14.1f}")
        prev = mem
    
    print(f"\nüìä Peak memory: {torch.cuda.max_memory_allocated() / 1e6:.1f} MB")
    print(f"   Current memory: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
    
    # Cleanup
    del model_large, optimizer_large, data, target, output, loss
    torch.cuda.empty_cache()

profile_memory()

### üîç Memory Profile Insights

**Memory breakdown:**
1. **Model parameters**: Constant overhead (weights and biases)
2. **Activations**: Forward pass saves intermediate results for backward
3. **Gradients**: Same size as parameters
4. **Optimizer states**: Adam uses 2x parameters (momentum + variance)

**Memory estimation formula:**
```
Training memory ‚âà Parameters √ó 16-20 bytes (mixed precision)
                  or
                  Parameters √ó 24-32 bytes (FP32)
```

> **üöÄ DGX Spark Advantage:** With 128GB of **unified memory**, the CPU and GPU share the same physical memory pool. This eliminates explicit memory copies for large tensors that exceed traditional GPU VRAM. The system handles page migration transparently, though explicit `.to(device)` calls are still recommended for optimal performance.

For a 7B parameter model:
- Parameters: 7B √ó 2 bytes (BF16) = 14 GB  ‚Üê Use BF16 on Blackwell for native Tensor Core support!
- Gradients: 14 GB
- Adam states: 28 GB
- Activations: Variable (depends on batch size, sequence length)
- **Total: 56+ GB** just for weights!

---

## Part 6: Nsight Systems Command Line

For deeper analysis, use NVIDIA's Nsight Systems from the command line.

In [None]:
# Create a standalone script for Nsight profiling
nsight_script = '''
#!/usr/bin/env python3
"""Script to profile with Nsight Systems."""

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Same model as before
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )
    
    def forward(self, x):
        return self.layers(x)

def main():
    device = torch.device('cuda')
    model = SimpleNet().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters())
    
    # Synthetic data
    X = torch.randn(10000, 784)
    y = torch.randint(0, 10, (10000,))
    dataset = TensorDataset(X, y)
    dataloader = DataLoader(dataset, batch_size=64, shuffle=True, 
                           num_workers=2, pin_memory=True)
    
    # Train 3 epochs
    for epoch in range(3):
        model.train()
        for data, target in dataloader:
            data = data.to(device, non_blocking=True)
            target = target.to(device, non_blocking=True)
            
            optimizer.zero_grad(set_to_none=True)
            loss = criterion(model(data), target)
            loss.backward()
            optimizer.step()
        print(f"Epoch {epoch+1} complete")

if __name__ == '__main__':
    main()
'''

# Save the script
script_path = '/tmp/nsight_demo.py'
with open(script_path, 'w') as f:
    f.write(nsight_script)

print(f"Script saved to: {script_path}")
print("\nüìä Nsight Systems Usage")
print("="*60)
print()
print("To profile this script, run from terminal:")
print()
print(f"  nsys profile -o training_report python {script_path}")
print()
print("Useful nsys options:")
print("  --trace=cuda,nvtx,osrt       # What to trace")
print("  --cuda-memory-usage=true     # Track memory")
print("  --python-backtrace=cuda      # Python stack traces")
print("  --sample=cpu                 # CPU sampling")
print()
print("View the report:")
print("  nsys-ui training_report.nsys-rep  # GUI viewer")
print("  nsys stats training_report.nsys-rep  # CLI stats")

---

## Part 7: Common Bottlenecks Checklist

Use this checklist when profiling your own code.

In [None]:
print("üìã GPU Performance Optimization Checklist")
print("="*60)

checklist = [
    ("Data Loading", [
        "‚úì Using num_workers > 0 in DataLoader?",
        "‚úì Using pin_memory=True for CUDA?",
        "‚úì Using persistent_workers=True?",
        "‚úì Data preprocessing on GPU (CuPy) where possible?",
    ]),
    ("Data Transfer", [
        "‚úì Using non_blocking=True for .to(device)?",
        "‚úì Minimizing CPU‚ÜîGPU transfers?",
        "‚úì Keeping tensors on GPU between operations?",
        "‚úì Using DLPack for zero-copy framework interop?",
    ]),
    ("Synchronization", [
        "‚úì Avoiding unnecessary .item() or .cpu() in training loop?",
        "‚úì Accumulating metrics on GPU?",
        "‚úì Only syncing for logging/checkpointing?",
        "‚úì Using async operations where possible?",
    ]),
    ("Memory", [
        "‚úì Using set_to_none=True in zero_grad()?",
        "‚úì Using gradient checkpointing for large models?",
        "‚úì Using mixed precision (AMP)?",
        "‚úì Clearing cache between experiments?",
    ]),
    ("Compute", [
        "‚úì Using torch.compile() for PyTorch 2.0+?",
        "‚úì Batch size large enough for GPU utilization?",
        "‚úì Using fused optimizers (e.g., FusedAdam)?",
        "‚úì Enabling TensorFloat-32 on Ampere+?",
    ]),
]

for section, items in checklist:
    print(f"\nüîπ {section}")
    for item in items:
        print(f"   {item}")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Profiling in Debug Mode

In [None]:
print("üí° Always profile in release/production mode!")
print()
print("   ‚ùå WRONG:")
print("      CUDA_LAUNCH_BLOCKING=1 python train.py  # Disables async!")
print("      torch.autograd.set_detect_anomaly(True)  # Huge slowdown!")
print()
print("   ‚úÖ CORRECT:")
print("      python train.py  # Normal execution")
print("      # Or with profiling:")
print("      nsys profile python train.py")

### Mistake 2: Not Warming Up Before Benchmarking

In [None]:
print("üí° GPU kernels need warm-up!")
print()
print("   First kernel launch: compilation + memory allocation")
print("   Subsequent launches: cached and faster")
print()
print("   ‚úÖ Always do 1-3 warm-up iterations before timing!")
print()
print("   Example:")
print("   # Warm up")
print("   for _ in range(3):")
print("       _ = model(dummy_input)")
print("   torch.cuda.synchronize()")
print("   ")
print("   # Now benchmark")
print("   start = time.time()")
print("   ...")

### Mistake 3: Measuring Only One Run

In [None]:
print("üí° Always measure multiple runs and report statistics!")
print()
print("   ‚ùå WRONG: 'It took 5.2 seconds'")
print()
print("   ‚úÖ CORRECT: 'Mean: 5.2s ¬± 0.3s (n=10)'")
print()
print("   Example:")
print("   times = []")
print("   for _ in range(10):")
print("       start = time.perf_counter()")
print("       # ... operation ...")
print("       torch.cuda.synchronize()")
print("       times.append(time.perf_counter() - start)")
print("   print(f'Mean: {np.mean(times):.2f}s ¬± {np.std(times):.2f}s')")

---

## ‚úã Try It Yourself: Profile Your Own Code

**Challenge:** Create and profile a CNN training loop.

1. Create a simple CNN for image classification
2. Profile it using PyTorch Profiler
3. Identify the top 3 bottlenecks
4. Apply optimizations and measure improvement

In [None]:
# TODO: Create and profile a CNN

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        # TODO: Define CNN layers
        # Hint: Use Conv2d, MaxPool2d, Linear
        pass
    
    def forward(self, x):
        # TODO: Implement forward pass
        pass


# Create synthetic image data
# Images: (batch, channels=3, height=32, width=32)
# Labels: (batch,) integers 0-9

# TODO: Create DataLoader with and without optimizations

# TODO: Profile and compare

---

## üéâ Checkpoint

Congratulations! You've learned:

- ‚úÖ **PyTorch Profiler** - Easy Python-native profiling
- ‚úÖ **Custom labels** - `record_function` for clarity
- ‚úÖ **Memory profiling** - Track GPU memory usage
- ‚úÖ **Common bottlenecks** - Data loading, sync, transfers
- ‚úÖ **Nsight Systems** - Command-line deep profiling
- ‚úÖ **Optimization checklist** - Systematic approach

You can now identify and fix performance bottlenecks in GPU code!

---

## üìñ Further Reading

- [PyTorch Profiler Documentation](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
- [Nsight Systems Documentation](https://docs.nvidia.com/nsight-systems/)
- [Nsight Compute Documentation](https://docs.nvidia.com/nsight-compute/)
- [NVIDIA Deep Learning Performance Guide](https://docs.nvidia.com/deeplearning/performance/index.html)
- [PyTorch Performance Tuning Guide](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html)

---

## üßπ Cleanup

In [None]:
import gc

# Clean up
del model, dataloader, dataset
gc.collect()
torch.cuda.empty_cache()

print("‚úÖ GPU memory cleared!")
print("\nüéì Module 1.3: CUDA Python & GPU Programming - COMPLETE!")
print("\n‚û°Ô∏è Next: Module 1.4: Mathematics for Deep Learning")