# Lab 1.5.6: GPU Acceleration on DGX Spark

**Module:** 1.5 - Neural Network Fundamentals  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Port your NumPy MLP to PyTorch
- [ ] Measure CPU vs GPU training times
- [ ] Understand the speedup from GPU acceleration
- [ ] Find optimal batch sizes for DGX Spark's 128GB unified memory
- [ ] Use mixed precision training with torch.cuda.amp
- [ ] Appreciate why GPUs revolutionized deep learning

---

## üìö Prerequisites

- Completed: Labs 1.5.1-1.5.5
- Environment: DGX Spark with PyTorch NGC container

---

## üåç Real-World Context

**Why GPUs transformed AI:**

Before GPUs, training a neural network on ImageNet took weeks on CPUs. With GPUs, it takes hours. This 100x+ speedup enabled the deep learning revolution!

**Your DGX Spark advantage:**
- **128GB unified memory**: No CPU‚ÜîGPU transfer bottleneck
- **192 Tensor Cores**: Hardware-accelerated matrix operations
- **1 PFLOP NVFP4**: Native low-precision inference

---

## üßí ELI5: Why GPUs Are Faster

> **Imagine you need to add up 1000 numbers.**
>
> **CPU approach (like one really smart mathematician):**
> - Add numbers one by one: 1+2=3, 3+3=6, 6+4=10...
> - Very fast at each addition, but does them sequentially
> - 1000 operations total
>
> **GPU approach (like 500 average calculators working together):**
> - Split into pairs: (1+2), (3+4), (5+6)...
> - Each pair adds simultaneously
> - 500 results ‚Üí pair those ‚Üí 250 results ‚Üí ... ‚Üí 1 result
> - Only ~10 rounds total!
>
> Neural networks are mostly matrix multiplications, which are **embarrassingly parallel** - perfect for GPUs!
>
> **DGX Spark's special power:** Unified memory means the CPU and GPU share the same 128GB RAM - no time wasted copying data back and forth!

---

## Setup

**Important:** This notebook requires PyTorch. On DGX Spark, use the NGC container:

```bash
docker run --gpus all -it --rm \
    -v $HOME/workspace:/workspace \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    -p 8888:8888 \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    jupyter lab --ip=0.0.0.0 --allow-root --no-browser
```

### Understanding the Docker Flags

| Flag | Purpose |
|------|---------|
| `--gpus all` | **Required!** Enables GPU access inside the container |
| `-it` | Interactive terminal with TTY |
| `--rm` | Clean up container on exit |
| `-v $HOME/workspace:/workspace` | Mounts your workspace directory |
| `-v $HOME/.cache/huggingface:/root/.cache/huggingface` | Mounts Hugging Face cache for model downloads |
| `--ipc=host` | **Required for DataLoader workers!** PyTorch DataLoader uses shared memory for inter-process communication. Without this flag, you'll get errors when using `num_workers > 0` |
| `-p 8888:8888` | **Required for Jupyter!** Maps container port 8888 to host port 8888 so you can access Jupyter Lab in your browser |
| `nvcr.io/nvidia/pytorch:25.11-py3` | NGC container with PyTorch optimized for ARM64 |

**Why `--ipc=host` matters:** PyTorch DataLoader creates worker processes that share data through shared memory (`/dev/shm`). By default, Docker containers have a small shared memory limit (64MB). The `--ipc=host` flag shares the host's IPC namespace, giving you access to the full shared memory space.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import time
import sys
import os

# Check if PyTorch is available
try:
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader, TensorDataset
    PYTORCH_AVAILABLE = True
    print(f"‚úÖ PyTorch version: {torch.__version__}")
except ImportError:
    PYTORCH_AVAILABLE = False
    print("‚ùå PyTorch not available. Please use NGC container.")

# Check GPU availability
if PYTORCH_AVAILABLE:
    if torch.cuda.is_available():
        print(f"‚úÖ CUDA available: {torch.cuda.get_device_name(0)}")
        print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    else:
        print("‚ö†Ô∏è CUDA not available. GPU comparisons will be simulated.")

np.random.seed(42)
if PYTORCH_AVAILABLE:
    torch.manual_seed(42)
    
%matplotlib inline

---

## Part 1: Loading Data

In [None]:
# Load MNIST
import gzip
import urllib.request

def load_mnist(path='../data'):
    os.makedirs(path, exist_ok=True)
    base_url = 'http://yann.lecun.com/exdb/mnist/'
    files = {
        'train_images': 'train-images-idx3-ubyte.gz',
        'train_labels': 'train-labels-idx1-ubyte.gz',
        'test_images': 't10k-images-idx3-ubyte.gz',
        'test_labels': 't10k-labels-idx1-ubyte.gz'
    }
    
    def download(filename):
        filepath = os.path.join(path, filename)
        if not os.path.exists(filepath):
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(base_url + filename, filepath)
        return filepath
    
    def load_images(fp):
        with gzip.open(fp, 'rb') as f:
            f.read(16)
            return np.frombuffer(f.read(), dtype=np.uint8).reshape(-1, 784).astype(np.float32) / 255.0
    
    def load_labels(fp):
        with gzip.open(fp, 'rb') as f:
            f.read(8)
            return np.frombuffer(f.read(), dtype=np.uint8)
    
    return (load_images(download(files['train_images'])),
            load_labels(download(files['train_labels'])),
            load_images(download(files['test_images'])),
            load_labels(download(files['test_labels'])))

X_train_np, y_train_np, X_test_np, y_test_np = load_mnist()
print(f"Loaded {len(X_train_np)} training samples")

---

## Part 2: NumPy Implementation (CPU Baseline)

First, let's establish our CPU baseline using the NumPy implementation from earlier.

In [None]:
class NumPyMLP:
    """
    Our NumPy MLP from Notebook 01 - runs on CPU.
    """
    
    def __init__(self, layer_sizes):
        self.layers = []
        for i in range(len(layer_sizes) - 1):
            W = np.random.randn(layer_sizes[i], layer_sizes[i + 1]).astype(np.float32) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros(layer_sizes[i + 1], dtype=np.float32)
            self.layers.append({'W': W, 'b': b, 'cache': {}})
    
    def forward(self, X):
        out = X
        for i, layer in enumerate(self.layers[:-1]):
            layer['cache']['X'] = out
            out = out @ layer['W'] + layer['b']
            layer['cache']['Z'] = out
            out = np.maximum(0, out)  # ReLU
        
        self.layers[-1]['cache']['X'] = out
        out = out @ self.layers[-1]['W'] + self.layers[-1]['b']
        
        # Softmax
        out_shifted = out - np.max(out, axis=1, keepdims=True)
        exp_out = np.exp(out_shifted)
        self.probs = exp_out / np.sum(exp_out, axis=1, keepdims=True)
        return self.probs
    
    def backward(self, targets, lr):
        batch_size = len(targets)
        grad = self.probs.copy()
        grad[np.arange(batch_size), targets] -= 1
        
        for i in range(len(self.layers) - 1, -1, -1):
            layer = self.layers[i]
            X = layer['cache']['X']
            
            dW = X.T @ grad / batch_size
            db = np.mean(grad, axis=0)
            grad = grad @ layer['W'].T
            
            if i > 0:
                Z = self.layers[i - 1]['cache']['Z']
                grad = grad * (Z > 0)
            
            layer['W'] -= lr * dW
            layer['b'] -= lr * db
    
    def predict(self, X):
        return np.argmax(self.forward(X), axis=1)

In [None]:
def train_numpy(model, X_train, y_train, epochs, batch_size, lr):
    """Train NumPy model and return time."""
    start_time = time.time()
    
    for epoch in range(epochs):
        indices = np.random.permutation(len(X_train))
        for start in range(0, len(X_train), batch_size):
            batch_idx = indices[start:start + batch_size]
            X_batch = X_train[batch_idx]
            y_batch = y_train[batch_idx]
            
            model.forward(X_batch)
            model.backward(y_batch, lr)
    
    elapsed = time.time() - start_time
    return elapsed

---

## Part 3: PyTorch Implementation

Now let's create the same model in PyTorch, which can run on CPU or GPU.

In [None]:
if PYTORCH_AVAILABLE:
    class PyTorchMLP(nn.Module):
        """
        Same architecture as NumPyMLP, but in PyTorch.
        
        PyTorch handles:
        - Automatic differentiation (no manual backward!)
        - GPU acceleration (just move to device)
        - Optimized kernels (cuDNN)
        """
        
        def __init__(self, layer_sizes):
            super().__init__()
            
            layers = []
            for i in range(len(layer_sizes) - 1):
                layers.append(nn.Linear(layer_sizes[i], layer_sizes[i + 1]))
                if i < len(layer_sizes) - 2:
                    layers.append(nn.ReLU())
            
            self.model = nn.Sequential(*layers)
            
            # Initialize weights like NumPy version (He initialization)
            for m in self.modules():
                if isinstance(m, nn.Linear):
                    nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
                    nn.init.zeros_(m.bias)
        
        def forward(self, x):
            return self.model(x)
    
    print("‚úÖ PyTorch MLP class defined")

In [None]:
if PYTORCH_AVAILABLE:
    def train_pytorch(model, train_loader, epochs, lr, device):
        """Train PyTorch model and return time."""
        model = model.to(device)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=lr)
        
        # Warmup (important for GPU timing!)
        if device.type == 'cuda':
            for X_batch, y_batch in train_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                optimizer.zero_grad()
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()
                break
            torch.cuda.synchronize()
        
        start_time = time.time()
        
        for epoch in range(epochs):
            for X_batch, y_batch in train_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                
                optimizer.zero_grad()
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()
        
        # Ensure GPU operations are complete
        if device.type == 'cuda':
            torch.cuda.synchronize()
        
        elapsed = time.time() - start_time
        return elapsed
    
    print("‚úÖ Training function defined")

---

## Part 4: CPU vs GPU Comparison

Let's measure the speedup!

In [None]:
# Parameters
EPOCHS = 3
BATCH_SIZE = 64
LR = 0.1
ARCHITECTURE = [784, 256, 128, 10]

print("üèéÔ∏è CPU vs GPU Comparison")
print("=" * 60)
print(f"Architecture: {ARCHITECTURE}")
print(f"Epochs: {EPOCHS}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Training samples: {len(X_train_np)}")
print("=" * 60)

In [None]:
# 1. NumPy (CPU)
print("\n1Ô∏è‚É£ NumPy (CPU)...")
np.random.seed(42)
model_numpy = NumPyMLP(ARCHITECTURE)
time_numpy = train_numpy(model_numpy, X_train_np, y_train_np, EPOCHS, BATCH_SIZE, LR)
acc_numpy = np.mean(model_numpy.predict(X_test_np) == y_test_np)
print(f"   Time: {time_numpy:.2f}s")
print(f"   Accuracy: {acc_numpy:.2%}")

In [None]:
if PYTORCH_AVAILABLE:
    # Prepare PyTorch data
    X_train_torch = torch.FloatTensor(X_train_np)
    y_train_torch = torch.LongTensor(y_train_np)
    X_test_torch = torch.FloatTensor(X_test_np)
    y_test_torch = torch.LongTensor(y_test_np)
    
    train_dataset = TensorDataset(X_train_torch, y_train_torch)
    
    # DataLoader options explained:
    # - num_workers: Number of subprocesses for data loading (requires --ipc=host in Docker)
    #   Use 0 for simple datasets, 2-4 for larger datasets with transforms
    # - pin_memory: Copies data to CUDA pinned memory for faster CPU‚ÜíGPU transfer
    #   Less important on DGX Spark due to unified memory architecture
    train_loader = DataLoader(
        train_dataset, 
        batch_size=BATCH_SIZE, 
        shuffle=True,
        # num_workers=2,    # Uncomment for faster loading (requires --ipc=host)
        # pin_memory=True,  # Uncomment for faster GPU transfer (less important on DGX Spark)
    )
    
    # 2. PyTorch (CPU)
    print("\n2. PyTorch (CPU)...")
    torch.manual_seed(42)
    model_cpu = PyTorchMLP(ARCHITECTURE)
    device_cpu = torch.device('cpu')
    time_pytorch_cpu = train_pytorch(model_cpu, train_loader, EPOCHS, LR, device_cpu)
    
    model_cpu.eval()
    with torch.no_grad():
        preds_cpu = model_cpu(X_test_torch).argmax(dim=1)
        acc_pytorch_cpu = (preds_cpu == y_test_torch).float().mean().item()
    
    print(f"   Time: {time_pytorch_cpu:.2f}s")
    print(f"   Accuracy: {acc_pytorch_cpu:.2%}")

In [None]:
if PYTORCH_AVAILABLE and torch.cuda.is_available():
    # 3. PyTorch (GPU)
    print("\n3Ô∏è‚É£ PyTorch (GPU)...")
    torch.manual_seed(42)
    model_gpu = PyTorchMLP(ARCHITECTURE)
    device_gpu = torch.device('cuda')
    time_pytorch_gpu = train_pytorch(model_gpu, train_loader, EPOCHS, LR, device_gpu)
    
    model_gpu.eval()
    with torch.no_grad():
        preds_gpu = model_gpu(X_test_torch.to(device_gpu)).argmax(dim=1).cpu()
        acc_pytorch_gpu = (preds_gpu == y_test_torch).float().mean().item()
    
    print(f"   Time: {time_pytorch_gpu:.2f}s")
    print(f"   Accuracy: {acc_pytorch_gpu:.2%}")
else:
    print("\n3Ô∏è‚É£ PyTorch (GPU)... SKIPPED (no GPU available)")
    time_pytorch_gpu = None
    acc_pytorch_gpu = None

In [None]:
# Summary
print("\n" + "=" * 60)
print("                         RESULTS SUMMARY")
print("=" * 60)

print(f"\n{'Method':<20} {'Time (s)':<12} {'Speedup':<12} {'Accuracy'}")
print("-" * 60)
print(f"{'NumPy (CPU)':<20} {time_numpy:<12.2f} {'1.0x (baseline)':<12} {acc_numpy:.2%}")

if PYTORCH_AVAILABLE:
    speedup_cpu = time_numpy / time_pytorch_cpu
    print(f"{'PyTorch (CPU)':<20} {time_pytorch_cpu:<12.2f} {speedup_cpu:.1f}x{'':<8} {acc_pytorch_cpu:.2%}")
    
    if time_pytorch_gpu:
        speedup_gpu = time_numpy / time_pytorch_gpu
        print(f"{'PyTorch (GPU)':<20} {time_pytorch_gpu:<12.2f} {speedup_gpu:.1f}x{'':<8} {acc_pytorch_gpu:.2%}")

print("=" * 60)

In [None]:
# Visualize
if PYTORCH_AVAILABLE:
    methods = ['NumPy (CPU)', 'PyTorch (CPU)']
    times = [time_numpy, time_pytorch_cpu]
    colors = ['#FF6B6B', '#4ECDC4']
    
    if time_pytorch_gpu:
        methods.append('PyTorch (GPU)')
        times.append(time_pytorch_gpu)
        colors.append('#45B7D1')
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Time comparison
    bars = axes[0].bar(methods, times, color=colors, edgecolor='black', linewidth=1.5)
    axes[0].set_ylabel('Time (seconds)', fontsize=12)
    axes[0].set_title('Training Time Comparison', fontsize=14)
    for bar, t in zip(bars, times):
        axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
                     f'{t:.2f}s', ha='center', fontsize=11)
    axes[0].grid(True, alpha=0.3, axis='y')
    
    # Speedup comparison
    speedups = [1.0, time_numpy/time_pytorch_cpu]
    if time_pytorch_gpu:
        speedups.append(time_numpy/time_pytorch_gpu)
    
    bars = axes[1].bar(methods, speedups, color=colors, edgecolor='black', linewidth=1.5)
    axes[1].set_ylabel('Speedup (vs NumPy)', fontsize=12)
    axes[1].set_title('Speedup Factor', fontsize=14)
    axes[1].axhline(y=1, color='red', linestyle='--', label='Baseline')
    for bar, s in zip(bars, speedups):
        axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
                     f'{s:.1f}x', ha='center', fontsize=11)
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()

---

## Part 5: Finding Optimal Batch Size for DGX Spark

Batch size significantly affects training speed. Let's find the optimal value!

In [None]:
if PYTORCH_AVAILABLE and torch.cuda.is_available():
    print("üî¨ Finding Optimal Batch Size for DGX Spark")
    print("=" * 60)
    
    batch_sizes = [32, 64, 128, 256, 512, 1024, 2048]
    gpu_times = []
    
    for batch_size in batch_sizes:
        try:
            torch.cuda.empty_cache()
            torch.manual_seed(42)
            
            train_loader = DataLoader(
                TensorDataset(X_train_torch, y_train_torch), 
                batch_size=batch_size, 
                shuffle=True
            )
            
            model = PyTorchMLP(ARCHITECTURE)
            t = train_pytorch(model, train_loader, epochs=1, lr=0.1, device=torch.device('cuda'))
            gpu_times.append(t)
            
            print(f"Batch size {batch_size:4d}: {t:.3f}s")
        except RuntimeError as e:
            if 'out of memory' in str(e):
                print(f"Batch size {batch_size:4d}: OUT OF MEMORY")
                gpu_times.append(None)
            else:
                raise
    
    # Find optimal
    valid_times = [(bs, t) for bs, t in zip(batch_sizes, gpu_times) if t is not None]
    if valid_times:
        optimal_bs, optimal_time = min(valid_times, key=lambda x: x[1])
        print(f"\n‚úÖ Optimal batch size for speed: {optimal_bs}")
else:
    print("‚ö†Ô∏è GPU not available. Batch size optimization skipped.")
    batch_sizes = []
    gpu_times = []

In [None]:
if batch_sizes and any(t is not None for t in gpu_times):
    # Visualize batch size impact
    fig, ax = plt.subplots(figsize=(10, 5))
    
    valid_bs = [bs for bs, t in zip(batch_sizes, gpu_times) if t is not None]
    valid_times = [t for t in gpu_times if t is not None]
    
    ax.plot(valid_bs, valid_times, 'bo-', linewidth=2, markersize=10)
    ax.set_xlabel('Batch Size', fontsize=12)
    ax.set_ylabel('Time per Epoch (seconds)', fontsize=12)
    ax.set_title('Batch Size vs Training Speed (GPU)', fontsize=14)
    ax.set_xscale('log', base=2)
    ax.grid(True, alpha=0.3)
    
    # Mark optimal
    min_idx = valid_times.index(min(valid_times))
    ax.scatter([valid_bs[min_idx]], [valid_times[min_idx]], color='red', s=200, zorder=5, label='Optimal')
    ax.legend()
    
    plt.tight_layout()
    plt.show()

---

## Part 6: DGX Spark's Unified Memory Advantage

One of DGX Spark's key features is its **unified memory architecture**. Let's understand why this matters.

In [None]:
print("\n" + "=" * 80)
print("                    DGX SPARK'S UNIFIED MEMORY ADVANTAGE")
print("=" * 80)

print("""
Traditional GPU System:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ    CPU      ‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ PCIe ‚îÄ‚îÄ‚îÄ‚îÄ>   ‚îÇ    GPU      ‚îÇ
‚îÇ   32GB RAM  ‚îÇ  <‚îÄ‚îÄ Transfer ‚îÄ‚îÄ   ‚îÇ   8GB VRAM  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     (SLOW!)        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Problem: Moving data between CPU and GPU memory is slow!
- PCIe bandwidth: ~16 GB/s
- For a 70B model: Would need ~140GB, but GPU only has 8GB!
- Constant swapping = very slow inference

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

DGX Spark (Unified Memory):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                      128GB UNIFIED MEMORY (LPDDR5X)                           ‚îÇ
‚îÇ              (CPU and GPU share the same memory pool!)                        ‚îÇ
‚îÇ                                                                               ‚îÇ
‚îÇ     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê               ‚îÇ
‚îÇ     ‚îÇ Grace CPU       ‚îÇ<‚îÄ‚îÄ 273 GB/s ‚îÄ‚îÄ‚îÄ>   ‚îÇ Blackwell GPU   ‚îÇ               ‚îÇ
‚îÇ     ‚îÇ 20 ARM v9.2     ‚îÇ    unified         ‚îÇ 6,144 CUDA cores‚îÇ               ‚îÇ
‚îÇ     ‚îÇ cores           ‚îÇ    bandwidth       ‚îÇ 192 Tensor cores‚îÇ               ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò               ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

DGX Spark Hardware Specifications:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Component              ‚îÇ Specification                                       ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ GPU                    ‚îÇ NVIDIA Blackwell GB10 Superchip                     ‚îÇ
‚îÇ CPU                    ‚îÇ 20 ARM v9.2 cores (10 Cortex-X925 + 10 Cortex-A725) ‚îÇ
‚îÇ Memory                 ‚îÇ 128GB unified memory (LPDDR5X)                      ‚îÇ
‚îÇ Memory Bandwidth       ‚îÇ 273 GB/s                                            ‚îÇ
‚îÇ CUDA Cores             ‚îÇ 6,144                                               ‚îÇ
‚îÇ Tensor Cores           ‚îÇ 192 (5th generation)                                ‚îÇ
‚îÇ NVFP4 Performance      ‚îÇ 1 PFLOP                                             ‚îÇ
‚îÇ FP8 Performance        ‚îÇ ~209 TFLOPS                                         ‚îÇ
‚îÇ BF16 Performance       ‚îÇ ~100 TFLOPS                                         ‚îÇ
‚îÇ Architecture           ‚îÇ ARM64/aarch64                                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Advantage:
- No data transfer needed between CPU and GPU
- 70B model fits entirely in memory!
- Memory bandwidth: 273 GB/s (LPDDR5X)
- Perfect for large language models

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

What this means for you:
1. Load models once, no swapping
2. Larger batch sizes possible
3. Run models that wouldn't fit on traditional GPUs
4. Faster iteration during development
""")

---

## Part 7: Best Practices Summary

In [None]:
print("\n" + "=" * 80)
print("                    GPU TRAINING BEST PRACTICES")
print("=" * 80)

print("""
1. USE APPROPRIATE BATCH SIZES
   - Too small: GPU underutilized, slower training
   - Too large: May hurt generalization, memory issues
   - Sweet spot: Usually 64-512 for most tasks
   - DGX Spark: Can go larger (1024+) thanks to unified memory

2. USE PROPER DATA TYPES
   - float32: Default, good balance
   - float16/bfloat16: 2x memory savings, often same accuracy
   - DGX Spark: Native bfloat16 support in Blackwell

3. PIN MEMORY FOR DATA LOADING
   ```python
   DataLoader(..., pin_memory=True)  # Faster CPU‚ÜíGPU transfer
   ```
   Note: Less important on DGX Spark due to unified memory!

4. USE CUDA STREAMS FOR OVERLAP
   - Overlap data loading with computation
   - PyTorch does this automatically with DataLoader workers

5. PROFILE YOUR CODE
   ```python
   with torch.profiler.profile() as prof:
       model(x)
   print(prof.key_averages().table())
   ```

6. CLEAR CACHE BEFORE LARGE MODELS
   ```bash
   sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
   ```
""")

---

## Part 8: Mixed Precision Training with torch.cuda.amp

### What is Mixed Precision?

Mixed precision training uses a combination of float16 and float32 data types:
- **Forward pass**: Use float16 (faster, less memory)
- **Gradients**: Computed in float16
- **Weight updates**: Done in float32 (maintains precision)

This gives you:
- **~2x faster training** on Tensor Cores
- **~2x less memory usage**
- **Same accuracy** as float32 training

### How torch.cuda.amp Works

PyTorch's Automatic Mixed Precision (AMP) provides two key components:

1. **`torch.autocast`**: Automatically casts operations to float16 where safe
2. **`GradScaler`**: Scales gradients to prevent underflow in float16

In [None]:
if PYTORCH_AVAILABLE and torch.cuda.is_available():
    from torch.cuda.amp import autocast, GradScaler

    print("üî¨ Mixed Precision Training Demo")
    print("=" * 60)

    def train_mixed_precision(model, train_loader, epochs, lr, device):
        """Train with automatic mixed precision."""
        model = model.to(device)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=lr)

        # GradScaler helps prevent gradient underflow in float16
        scaler = GradScaler()

        # Warmup
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            # autocast automatically uses float16 where appropriate
            with autocast(device_type='cuda', dtype=torch.float16):
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)

            # Scale loss and backward
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            break
        torch.cuda.synchronize()

        start_time = time.time()

        for epoch in range(epochs):
            for X_batch, y_batch in train_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)

                optimizer.zero_grad()

                # Forward pass with autocast
                with autocast(device_type='cuda', dtype=torch.float16):
                    outputs = model(X_batch)
                    loss = criterion(outputs, y_batch)

                # Backward pass with gradient scaling
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()

        torch.cuda.synchronize()
        return time.time() - start_time

    # Compare FP32 vs Mixed Precision
    torch.manual_seed(42)
    model_fp32 = PyTorchMLP(ARCHITECTURE)
    train_loader_amp = DataLoader(
        TensorDataset(X_train_torch, y_train_torch),
        batch_size=BATCH_SIZE,
        shuffle=True
    )
    time_fp32 = train_pytorch(model_fp32, train_loader_amp, EPOCHS, LR, torch.device('cuda'))

    torch.manual_seed(42)
    model_amp = PyTorchMLP(ARCHITECTURE)
    time_amp = train_mixed_precision(model_amp, train_loader_amp, EPOCHS, LR, torch.device('cuda'))

    print(f"\nResults ({EPOCHS} epochs):")
    print(f"   FP32:            {time_fp32:.3f}s")
    print(f"   Mixed Precision: {time_amp:.3f}s")
    print(f"   Speedup:         {time_fp32/time_amp:.2f}x")

    # Check accuracy
    model_amp.eval()
    with torch.no_grad():
        preds = model_amp(X_test_torch.to('cuda')).argmax(dim=1).cpu()
        acc = (preds == y_test_torch).float().mean().item()
    print(f"   Accuracy:        {acc:.2%}")
else:
    print("‚ö†Ô∏è GPU not available. Mixed precision demo skipped.")

---

## ‚úã Try It Yourself

### Exercise 1: Train a Larger Model

Train a model with architecture `[784, 1024, 512, 256, 128, 10]` and compare CPU vs GPU times.

In [None]:
# Your code here: Train larger model

### Exercise 2: Try bfloat16 Mixed Precision

Now that you've seen float16 mixed precision in Part 8, try using **bfloat16** instead.

**Why bfloat16?**
- DGX Spark's Blackwell GPU has native bfloat16 support
- bfloat16 has better numerical stability than float16
- **Bonus:** bfloat16 doesn't require GradScaler!

<details>
<summary>üí° Hint: bfloat16 simplifies training</summary>

With bfloat16, you can skip the GradScaler entirely:

```python
# No scaler needed with bfloat16!
for X_batch, y_batch in train_loader:
    X_batch, y_batch = X_batch.to(device), y_batch.to(device)
    
    optimizer.zero_grad()
    
    with autocast(device_type='cuda', dtype=torch.bfloat16):
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
    
    loss.backward()  # No scaler.scale() needed!
    optimizer.step()
```
</details>

**Your task:** Modify the `train_mixed_precision` function to use bfloat16 and compare speed/accuracy with float16.

In [None]:
# Your code here: Implement mixed precision training

---

## üéâ Checkpoint

You've learned:

- ‚úÖ How to port NumPy code to PyTorch
- ‚úÖ The dramatic speedup from GPU acceleration
- ‚úÖ How to find optimal batch sizes
- ‚úÖ Why DGX Spark's unified memory is special
- ‚úÖ Mixed precision training with torch.cuda.amp (autocast + GradScaler)
- ‚úÖ Best practices for GPU training

---

## üìñ Further Reading

- [PyTorch Performance Tuning Guide](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html)
- [NVIDIA DGX Spark Documentation](https://docs.nvidia.com/dgx/)
- [Mixed Precision Training](https://pytorch.org/docs/stable/amp.html)

---

## üßπ Cleanup

In [None]:
import gc

if PYTORCH_AVAILABLE and torch.cuda.is_available():
    torch.cuda.empty_cache()

gc.collect()

print("‚úÖ Cleanup complete!")
print("\nüéâ Congratulations! You've completed Module 1.5: Neural Network Fundamentals!")
print("\nüéØ Next: Proceed to Module 1.6: Classical ML Foundations")