# Lab 2.1.4: Mixed Precision Training - Speed Up with AMP

**Module:** 2.1 - Deep Learning with PyTorch  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate-Advanced)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the difference between FP32, FP16, and BF16 precision
- [ ] Implement Automatic Mixed Precision (AMP) training
- [ ] Compare memory usage between precision modes
- [ ] Measure training speedup from mixed precision
- [ ] Handle gradient scaling to prevent underflow

---

## Prerequisites

- Completed: Tasks 6.1-6.3
- Knowledge of: Neural network training, floating-point representation

---

## Real-World Context

Training large models is expensive! Mixed precision training can:
- **2-3x faster training** on modern GPUs
- **50% less memory** for activations and gradients
- **Same accuracy** when done correctly

All major AI labs use mixed precision:
- **OpenAI**: GPT models trained with FP16/BF16
- **Google**: BERT, T5 trained with BFloat16
- **Meta**: LLaMA models use mixed precision throughout

Your DGX Spark's Blackwell GPU has specialized Tensor Cores that make mixed precision even faster!

---

## ELI5: What is Mixed Precision?

> **Imagine you're doing math homework...** üìù
>
> For most problems, you round to 2 decimal places (3.14). That's fast and usually good enough.
>
> But sometimes you need more precision (3.14159265...) for the final answer to be correct.
>
> **Mixed precision is exactly this:**
> - Do most calculations with "rough" numbers (FP16/BF16) - it's faster!
> - Keep important things (like weight updates) in "precise" numbers (FP32)
> - Get the speed of rough math with the accuracy of precise math!
>
> **In AI terms:**
> - FP32 (32-bit): High precision, more memory, slower
> - FP16 (16-bit): Lower precision, less memory, faster (but can overflow!)
> - BF16 (16-bit): Same range as FP32, less precision, fast and stable

---

## Part 1: Understanding Floating-Point Formats

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.amp import autocast, GradScaler  # Updated import for PyTorch 2.0+
import torchvision
import torchvision.transforms as transforms

import time
import matplotlib.pyplot as plt
import numpy as np
from typing import Tuple, Dict, List

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    # Check for Tensor Core support
    major, minor = torch.cuda.get_device_capability()
    print(f"Compute Capability: {major}.{minor}")
    print(f"Tensor Cores: {'Yes' if major >= 7 else 'No'}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# Compare floating-point formats
print("=== Floating-Point Format Comparison ===")
print()

formats = {
    'float32': torch.float32,
    'float16': torch.float16,
    'bfloat16': torch.bfloat16,
}

for name, dtype in formats.items():
    info = torch.finfo(dtype)
    print(f"{name:10s}: bits={info.bits:2d}, "
          f"range=[{info.min:.2e}, {info.max:.2e}], "
          f"tiny={info.tiny:.2e}, eps={info.eps:.2e}")

print("\n=== Memory Usage ===")
x = torch.randn(1000, 1000)
for name, dtype in formats.items():
    tensor = x.to(dtype)
    size_mb = tensor.element_size() * tensor.nelement() / 1e6
    print(f"{name:10s}: {size_mb:.2f} MB for 1M elements")

### Key Differences:

| Format | Bits | Exponent | Mantissa | Range | Use Case |
|--------|------|----------|----------|-------|----------|
| FP32 | 32 | 8 | 23 | ¬±3.4e38 | Default, master weights |
| FP16 | 16 | 5 | 10 | ¬±65504 | Fast math, but overflow risk |
| BF16 | 16 | 8 | 7 | ¬±3.4e38 | Same range as FP32, less precision |

**BF16 is preferred for training** because it has the same range as FP32, avoiding overflow issues!

In [None]:
# Demonstrate precision differences
print("=== Precision Demonstration ===")

# A number that's fine in FP32 but problematic in FP16
large_value = torch.tensor(70000.0)
print(f"\nOriginal (FP32): {large_value}")
print(f"As FP16: {large_value.half()} (overflow to inf!)")
print(f"As BF16: {large_value.bfloat16()} (works fine)")

# Small gradients that might underflow
small_value = torch.tensor(1e-6)
print(f"\nSmall value (FP32): {small_value}")
print(f"As FP16: {small_value.half()} (loses precision)")
print(f"As BF16: {small_value.bfloat16()}")

---

## Part 2: Setting Up the Experiment

Let's create a model and dataset to compare FP32 vs mixed precision training.

In [None]:
# Simple ResNet-like model for benchmarking
class SimpleResNet(nn.Module):
    """
    A simplified ResNet for CIFAR-10.
    
    Large enough to benefit from mixed precision,
    small enough to train quickly.
    """
    
    def __init__(self, num_classes: int = 10):
        super().__init__()
        
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        
        # Layer 1: 64 channels
        self.layer1 = self._make_layer(64, 64, 2)
        # Layer 2: 128 channels, downsample
        self.layer2 = self._make_layer(64, 128, 2, stride=2)
        # Layer 3: 256 channels, downsample
        self.layer3 = self._make_layer(128, 256, 2, stride=2)
        # Layer 4: 512 channels, downsample
        self.layer4 = self._make_layer(256, 512, 2, stride=2)
        
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)
    
    def _make_layer(self, in_ch, out_ch, num_blocks, stride=1):
        layers = []
        # First block may downsample
        layers.append(self._block(in_ch, out_ch, stride))
        # Remaining blocks
        for _ in range(1, num_blocks):
            layers.append(self._block(out_ch, out_ch, 1))
        return nn.Sequential(*layers)
    
    def _block(self, in_ch, out_ch, stride):
        return nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, stride, 1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, 1, 1, bias=False),
            nn.BatchNorm2d(out_ch),
        )
    
    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

# Test the model
model = SimpleResNet(10)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

x = torch.randn(2, 3, 32, 32)
y = model(x)
print(f"Input: {x.shape} -> Output: {y.shape}")

In [None]:
# Load CIFAR-10
transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

trainset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform
)
testset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=transform_test
)

# Adaptive batch size based on available GPU memory (M4 fix)
if torch.cuda.is_available():
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    if gpu_mem > 100:  # DGX Spark with 128GB
        BATCH_SIZE = 256
    elif gpu_mem > 16:
        BATCH_SIZE = 128
    else:
        BATCH_SIZE = 64
else:
    BATCH_SIZE = 64

trainloader = DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True, 
                         num_workers=4, pin_memory=True)
testloader = DataLoader(testset, batch_size=BATCH_SIZE * 2, shuffle=False,
                        num_workers=4, pin_memory=True)

print(f"Training samples: {len(trainset):,}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Batches per epoch: {len(trainloader)}")

---

## Part 3: Baseline FP32 Training

First, let's establish a baseline with standard FP32 training.

In [None]:
def train_epoch_fp32(model, trainloader, criterion, optimizer, device):
    """
    Standard FP32 training for one epoch.
    
    Returns:
        Tuple of (avg_loss, accuracy, time_seconds)
    """
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    start_time = time.time()
    
    for inputs, labels in trainloader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
    
    epoch_time = time.time() - start_time
    
    return running_loss / len(trainloader), 100. * correct / total, epoch_time


def evaluate(model, testloader, criterion, device):
    """Evaluate model accuracy."""
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for inputs, labels in testloader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            
            running_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    return running_loss / len(testloader), 100. * correct / total

In [None]:
# Train FP32 baseline
NUM_EPOCHS = 5

model_fp32 = SimpleResNet(10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer_fp32 = optim.SGD(model_fp32.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

# Track memory
torch.cuda.reset_peak_memory_stats() if torch.cuda.is_available() else None

print("=== FP32 Training ===")
fp32_results = {'loss': [], 'acc': [], 'time': []}

for epoch in range(NUM_EPOCHS):
    loss, acc, epoch_time = train_epoch_fp32(
        model_fp32, trainloader, criterion, optimizer_fp32, device
    )
    test_loss, test_acc = evaluate(model_fp32, testloader, criterion, device)
    
    fp32_results['loss'].append(loss)
    fp32_results['acc'].append(test_acc)
    fp32_results['time'].append(epoch_time)
    
    print(f"Epoch {epoch+1}/{NUM_EPOCHS} | "
          f"Loss: {loss:.4f} | Train Acc: {acc:.2f}% | "
          f"Test Acc: {test_acc:.2f}% | Time: {epoch_time:.1f}s")

fp32_memory = torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0
fp32_total_time = sum(fp32_results['time'])

print(f"\nFP32 Peak Memory: {fp32_memory:.2f} GB")
print(f"FP32 Total Time: {fp32_total_time:.1f}s")
print(f"FP32 Final Accuracy: {fp32_results['acc'][-1]:.2f}%")

---

## Part 4: Mixed Precision Training with AMP

PyTorch's Automatic Mixed Precision (AMP) makes this easy!

### Key Components:
1. **`autocast`**: Automatically chooses precision for each operation
2. **`GradScaler`**: Scales gradients to prevent underflow in FP16

In [None]:
def train_epoch_amp(model, trainloader, criterion, optimizer, scaler, device, dtype=torch.float16):
    """
    Mixed precision training for one epoch using AMP.
    
    Args:
        model: Neural network
        trainloader: DataLoader
        criterion: Loss function
        optimizer: Optimizer
        scaler: GradScaler for gradient scaling
        device: Device to train on
        dtype: Precision for autocast (float16 or bfloat16)
    
    Returns:
        Tuple of (avg_loss, accuracy, time_seconds)
    """
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    start_time = time.time()
    
    for inputs, labels in trainloader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        # Forward pass with autocast (PyTorch 2.0+ API with device_type)
        with autocast(device_type='cuda', dtype=dtype):
            outputs = model(inputs)
            loss = criterion(outputs, labels)
        
        # Backward pass with gradient scaling
        scaler.scale(loss).backward()
        
        # Optimizer step with unscaling
        scaler.step(optimizer)
        scaler.update()
        
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
    
    epoch_time = time.time() - start_time
    
    return running_loss / len(trainloader), 100. * correct / total, epoch_time

In [None]:
# Train with FP16 mixed precision
torch.cuda.reset_peak_memory_stats() if torch.cuda.is_available() else None

model_fp16 = SimpleResNet(10).to(device)
optimizer_fp16 = optim.SGD(model_fp16.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
# FP16 requires gradient scaling to prevent underflow
scaler_fp16 = GradScaler('cuda')

print("=== FP16 Mixed Precision Training ===")
fp16_results = {'loss': [], 'acc': [], 'time': []}

for epoch in range(NUM_EPOCHS):
    loss, acc, epoch_time = train_epoch_amp(
        model_fp16, trainloader, criterion, optimizer_fp16, scaler_fp16, device,
        dtype=torch.float16
    )
    test_loss, test_acc = evaluate(model_fp16, testloader, criterion, device)
    
    fp16_results['loss'].append(loss)
    fp16_results['acc'].append(test_acc)
    fp16_results['time'].append(epoch_time)
    
    print(f"Epoch {epoch+1}/{NUM_EPOCHS} | "
          f"Loss: {loss:.4f} | Train Acc: {acc:.2f}% | "
          f"Test Acc: {test_acc:.2f}% | Time: {epoch_time:.1f}s")

fp16_memory = torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0
fp16_total_time = sum(fp16_results['time'])

print(f"\nFP16 Peak Memory: {fp16_memory:.2f} GB")
print(f"FP16 Total Time: {fp16_total_time:.1f}s")
print(f"FP16 Final Accuracy: {fp16_results['acc'][-1]:.2f}%")

In [None]:
# Train with BF16 mixed precision (preferred on modern hardware)
torch.cuda.reset_peak_memory_stats() if torch.cuda.is_available() else None

model_bf16 = SimpleResNet(10).to(device)
optimizer_bf16 = optim.SGD(model_bf16.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
# BF16 typically doesn't need gradient scaling due to larger dynamic range
scaler_bf16 = GradScaler('cuda', enabled=False)  # Disable for BF16

print("=== BF16 Mixed Precision Training ===")
bf16_results = {'loss': [], 'acc': [], 'time': []}

for epoch in range(NUM_EPOCHS):
    loss, acc, epoch_time = train_epoch_amp(
        model_bf16, trainloader, criterion, optimizer_bf16, scaler_bf16, device,
        dtype=torch.bfloat16
    )
    test_loss, test_acc = evaluate(model_bf16, testloader, criterion, device)
    
    bf16_results['loss'].append(loss)
    bf16_results['acc'].append(test_acc)
    bf16_results['time'].append(epoch_time)
    
    print(f"Epoch {epoch+1}/{NUM_EPOCHS} | "
          f"Loss: {loss:.4f} | Train Acc: {acc:.2f}% | "
          f"Test Acc: {test_acc:.2f}% | Time: {epoch_time:.1f}s")

bf16_memory = torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0
bf16_total_time = sum(bf16_results['time'])

print(f"\nBF16 Peak Memory: {bf16_memory:.2f} GB")
print(f"BF16 Total Time: {bf16_total_time:.1f}s")
print(f"BF16 Final Accuracy: {bf16_results['acc'][-1]:.2f}%")

---

## Part 5: Results Comparison

In [None]:
# Summary comparison
print("=" * 70)
print("RESULTS SUMMARY")
print("=" * 70)
print(f"{'Metric':<25} {'FP32':>12} {'FP16':>12} {'BF16':>12}")
print("-" * 70)
print(f"{'Peak Memory (GB)':<25} {fp32_memory:>12.2f} {fp16_memory:>12.2f} {bf16_memory:>12.2f}")
print(f"{'Training Time (s)':<25} {fp32_total_time:>12.1f} {fp16_total_time:>12.1f} {bf16_total_time:>12.1f}")
print(f"{'Final Test Accuracy (%)':<25} {fp32_results['acc'][-1]:>12.2f} {fp16_results['acc'][-1]:>12.2f} {bf16_results['acc'][-1]:>12.2f}")
print("-" * 70)
print(f"{'Memory Savings vs FP32':<25} {'-':>12} {(1-fp16_memory/fp32_memory)*100:>11.1f}% {(1-bf16_memory/fp32_memory)*100:>11.1f}%")
print(f"{'Speedup vs FP32':<25} {'-':>12} {fp32_total_time/fp16_total_time:>11.2f}x {fp32_total_time/bf16_total_time:>11.2f}x")
print("=" * 70)

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

epochs = range(1, NUM_EPOCHS + 1)

# Loss comparison
ax1 = axes[0]
ax1.plot(epochs, fp32_results['loss'], 'b-o', label='FP32')
ax1.plot(epochs, fp16_results['loss'], 'r-s', label='FP16')
ax1.plot(epochs, bf16_results['loss'], 'g-^', label='BF16')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Training Loss')
ax1.set_title('Training Loss Comparison')
ax1.legend()
ax1.grid(True)

# Accuracy comparison
ax2 = axes[1]
ax2.plot(epochs, fp32_results['acc'], 'b-o', label='FP32')
ax2.plot(epochs, fp16_results['acc'], 'r-s', label='FP16')
ax2.plot(epochs, bf16_results['acc'], 'g-^', label='BF16')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Test Accuracy (%)')
ax2.set_title('Test Accuracy Comparison')
ax2.legend()
ax2.grid(True)

# Time per epoch
ax3 = axes[2]
ax3.bar(['FP32', 'FP16', 'BF16'], 
        [np.mean(fp32_results['time']), np.mean(fp16_results['time']), np.mean(bf16_results['time'])],
        color=['blue', 'red', 'green'])
ax3.set_ylabel('Time per Epoch (s)')
ax3.set_title('Training Speed Comparison')

plt.tight_layout()
plt.show()

---

## Part 6: Understanding GradScaler

The `GradScaler` is crucial for FP16 training. Let's understand why.

In [None]:
# Demonstrate gradient underflow problem
print("=== Gradient Underflow Demonstration ===")

# Simulate small gradients (common in deep networks)
small_grad = torch.tensor(1e-5)
print(f"Original gradient (FP32): {small_grad}")
print(f"In FP16: {small_grad.half()}")

# After several multiplications (backprop through many layers)
tiny_grad = small_grad ** 3
print(f"\nAfter multiplications (FP32): {tiny_grad}")
print(f"In FP16: {tiny_grad.half()} (underflow to 0!)")

# Solution: Scale up before operations
scale = 65536.0  # 2^16
scaled_grad = small_grad * scale
scaled_result = (scaled_grad.half() ** 3) / (scale ** 3)
print(f"\nWith scaling: {scaled_result} (preserved!)")

In [None]:
# How GradScaler works internally
print("=== GradScaler Internal Mechanics ===")

scaler = GradScaler(init_scale=65536.0, growth_interval=2000)

print(f"Initial scale: {scaler.get_scale()}")
print(f"Growth factor: {scaler._growth_factor}")
print(f"Backoff factor: {scaler._backoff_factor}")

# The scaler:
# 1. Multiplies loss by scale before backward()
# 2. Divides gradients by scale before optimizer.step()
# 3. Increases scale if no inf/nan for growth_interval steps
# 4. Decreases scale if inf/nan detected (and skips that step)

---

## ‚úã Try It Yourself: Exercise

Implement a training loop that **automatically falls back to FP32** if too many inf/nan gradients are detected.

**Requirements:**
1. Track the number of skipped steps (inf/nan)
2. If >10% of steps are skipped, switch to FP32
3. Log when the fallback occurs

<details>
<summary>üí° Hint</summary>

Use `scaler.get_scale()` before and after `scaler.update()` - if the scale decreased, a step was skipped due to inf/nan.

</details>

In [None]:
# YOUR CODE HERE: Implement adaptive precision training
def train_with_fallback(model, trainloader, criterion, optimizer, device, max_skip_ratio=0.1):
    """
    Train with automatic fallback from FP16 to FP32 if unstable.
    """
    # TODO: Implement
    pass

# Test your implementation
# train_with_fallback(model, trainloader, criterion, optimizer, device)

---

## Common Mistakes

### Mistake 1: Using autocast in the wrong scope

```python
# ‚ùå Wrong - autocast doesn't cover backward
with autocast(device_type='cuda', dtype=torch.float16):
    output = model(input)
    loss = criterion(output, target)
    loss.backward()  # This should be OUTSIDE autocast!

# ‚úÖ Right - backward outside autocast
with autocast(device_type='cuda', dtype=torch.float16):
    output = model(input)
    loss = criterion(output, target)
scaler.scale(loss).backward()  # Outside autocast
```

### Mistake 2: Forgetting to use scaler for FP16

```python
# ‚ùå Wrong - gradients may underflow
with autocast(device_type='cuda', dtype=torch.float16):
    output = model(input)
    loss = criterion(output, target)
loss.backward()
optimizer.step()

# ‚úÖ Right - use GradScaler
scaler = GradScaler()
with autocast(device_type='cuda', dtype=torch.float16):
    output = model(input)
    loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
```

### Mistake 3: Not disabling GradScaler for BF16

```python
# ‚ùå Unnecessary - BF16 has same range as FP32
scaler = GradScaler()  # Enabled by default

# ‚úÖ Better - disable for BF16
scaler = GradScaler(enabled=False)  # BF16 doesn't need scaling
```

### Mistake 4: Using deprecated import path (PyTorch 2.0+)

```python
# ‚ùå Deprecated - will show warning
from torch.cuda.amp import autocast, GradScaler

# ‚úÖ Correct - PyTorch 2.0+ path
from torch.amp import autocast, GradScaler
```

---

## Checkpoint

You've learned:
- ‚úÖ The differences between FP32, FP16, and BF16
- ‚úÖ How to use `autocast` and `GradScaler`
- ‚úÖ Memory and speed benefits of mixed precision
- ‚úÖ Why gradient scaling is needed for FP16
- ‚úÖ BF16 is often the best choice (same range as FP32, no scaling needed)

---

## DGX Spark Recommendations

On your DGX Spark with Blackwell GPU:

1. **Use BF16 by default** - Blackwell has excellent BF16 support
2. **Try NVFP4 for inference** - Blackwell supports native 4-bit compute!
3. **Increase batch size** - With 128GB unified memory, you can go much larger
4. **Monitor memory** - Use the saved memory for larger models

---

## Further Reading

- [PyTorch AMP Documentation](https://pytorch.org/docs/stable/amp.html)
- [Mixed Precision Training (NVIDIA)](https://developer.nvidia.com/automatic-mixed-precision)
- [BFloat16 Paper](https://cloud.google.com/tpu/docs/bfloat16)

In [None]:
# Cleanup
import gc

del model_fp32, model_fp16, model_bf16
torch.cuda.empty_cache()
gc.collect()

print(f"GPU Memory after cleanup: {torch.cuda.memory_allocated()/1e9:.2f} GB")