# PyTorch Tutorial: Optimization and Tuning

Building a model is just the start. Making it train fast and generalize well requires optimization and tuning. This notebook covers essential techniques for improving model performance.

## Learning Objectives
- Use Learning Rate Schedulers
- Apply Regularization (Dropout, Weight Decay)
- Implement Batch Normalization
- Understand Early Stopping


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

torch.manual_seed(42)

## 1. Learning Rate Schedulers

A constant learning rate is rarely optimal. We often want to start high (to learn fast) and decrease it (to fine-tune).

Common schedulers:
- `StepLR`: Decays LR by gamma every step_size epochs
- `ReduceLROnPlateau`: Decays LR when validation loss stops improving
- `CosineAnnealingLR`: Follows a cosine curve

In [None]:
# Create a dummy model and optimizer
model = nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Setup scheduler: Multiply LR by 0.1 every 5 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

lrs = []
for epoch in range(20):
    optimizer.step()  # Simulate training step
    lrs.append(optimizer.param_groups[0]['lr'])
    scheduler.step()  # Update LR

plt.plot(lrs, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('StepLR Scheduler')
plt.grid(True)
plt.show()

---

## 1.5. Learning Rate Warmup (CRITICAL for Transformers!)

### The Problem
Starting with a high learning rate can cause:
- Unstable training in early epochs
- Divergence (NaN losses)
- Poor final performance

**Especially problematic for:**
- Large models (Transformers, LLMs)
- Large batch sizes
- Adam/AdamW optimizers

### The Solution: Warmup
Start with a very small LR and gradually increase it over the first N steps.

```
Warmup phase (0-1000 steps): LR goes from 0 ‚Üí target_lr
Main phase (1000+ steps): Use normal schedule (cosine, constant, etc.)
```

### Why It Works
1. **Prevents instability**: Model starts with small updates
2. **Adaptive optimizers need warmup**: Adam's momentum estimates are unreliable initially
3. **Standard in Transformer training**: BERT, GPT, all modern LLMs use warmup

### FAANG Interview Question
**"Why do we need learning rate warmup for Transformers?"** ‚Üê Asked at Google, OpenAI

**Answer:**
1. Adam/AdamW have poor estimates of gradient statistics early in training
2. Large learning rates can cause exploding gradients in early stages
3. Warmup stabilizes training, especially for large models/batches
4. Standard practice: warmup for 5-10% of total training steps

In [None]:
# Learning Rate Warmup Implementation

# Method 1: Manual Warmup Function
def get_lr_with_warmup(step, warmup_steps, base_lr, max_lr):
    """
    Linear warmup from base_lr to max_lr over warmup_steps.
    This is the STANDARD approach in Transformer training!
    """
    if step < warmup_steps:
        # Linear warmup
        return base_lr + (max_lr - base_lr) * step / warmup_steps
    else:
        # After warmup, use max_lr (or apply decay)
        return max_lr

# Visualize warmup schedule
warmup_steps = 1000
total_steps = 10000
base_lr = 0.0
max_lr = 1e-3

lrs = [get_lr_with_warmup(step, warmup_steps, base_lr, max_lr) 
       for step in range(total_steps)]

plt.figure(figsize=(10, 4))
plt.plot(lrs)
plt.axvline(x=warmup_steps, color='r', linestyle='--', label='End of warmup')
plt.xlabel('Training Step')
plt.ylabel('Learning Rate')
plt.title('Linear Warmup Schedule')
plt.legend()
plt.grid(True)
plt.show()

print(f"‚úì LR starts at {base_lr} and warms up to {max_lr} over {warmup_steps} steps")
print(f"Then stays constant at {max_lr}")

In [None]:
# Method 2: Warmup + Cosine Decay (The FAANG Standard!)

import math

def get_cosine_schedule_with_warmup(step, warmup_steps, total_steps, max_lr, min_lr=0):
    """
    Warmup + Cosine Annealing.
    This is used to train GPT-3, BERT, Llama, and most modern LLMs!
    """
    if step < warmup_steps:
        # Linear warmup
        return max_lr * step / warmup_steps
    else:
        # Cosine decay
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return min_lr + (max_lr - min_lr) * 0.5 * (1 + math.cos(math.pi * progress))

# Visualize warmup + cosine
warmup_steps = 1000
total_steps = 10000
max_lr = 1e-3
min_lr = 1e-5

lrs = [get_cosine_schedule_with_warmup(step, warmup_steps, total_steps, max_lr, min_lr) 
       for step in range(total_steps)]

plt.figure(figsize=(12, 4))
plt.plot(lrs, linewidth=2)
plt.axvline(x=warmup_steps, color='r', linestyle='--', alpha=0.7, label='End of warmup')
plt.xlabel('Training Step')
plt.ylabel('Learning Rate')
plt.title('Warmup + Cosine Annealing (GPT-3, BERT, Llama)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("‚úì This is the STANDARD LR schedule for LLM training!")
print(f"‚Ä¢ Warmup: {warmup_steps} steps (10% of training)")
print(f"‚Ä¢ Peak LR: {max_lr}")
print(f"‚Ä¢ Final LR: {min_lr}")
print(f"‚Ä¢ Total steps: {total_steps}")
print("\\nUsed in: GPT-3, BERT, RoBERTa, Llama, Mistral, etc.")

## 2. Regularization: Dropout and Weight Decay

**Overfitting** happens when a model memorizes training data but fails on new data. Regularization prevents this.

### Dropout
Randomly zeros out neurons during training. This forces the network to learn robust features.

In [None]:
class RegularizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.dropout = nn.Dropout(p=0.5)  # 50% probability
        self.fc2 = nn.Linear(64, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)  # Apply dropout
        x = self.fc2(x)
        return x

model = RegularizedNet()
print(model)

### Weight Decay (L2 Regularization)
Penalizes large weights. In PyTorch, this is part of the optimizer.

In [None]:
# Add weight_decay parameter to optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

## 3. Batch Normalization

Normalizes layer inputs to have mean 0 and variance 1. This stabilizes training and allows higher learning rates.

In [None]:
class BatchNormNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.bn1 = nn.BatchNorm1d(64)  # Batch Norm for 1D data
        self.fc2 = nn.Linear(64, 1)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)  # Apply BN before activation
        x = torch.relu(x)
        x = self.fc2(x)
        return x

model = BatchNormNet()
print(model)

## 4. Early Stopping

Stop training when validation loss stops improving. This saves time and prevents overfitting.

*(Concept only - usually implemented as a loop check)*

```python
best_loss = float('inf')
patience = 5
counter = 0

for epoch in range(100):
    train(...)
    val_loss = validate(...)
    
    if val_loss < best_loss:
        best_loss = val_loss
        counter = 0
        torch.save(model, 'best_model.pth')
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping!")
            break
```

## Summary of Part 1

1. **Schedulers**: Adjust learning rate dynamically.
2. **Dropout**: Randomly disable neurons to improve robustness.
3. **Weight Decay**: Penalize large weights to prevent overfitting.
4. **Batch Norm**: Normalize inputs for stable, faster training.
5. **Early Stopping**: Stop when you stop improving.

---

# PART 2: ADVANCED OPTIMIZATION (FAANG-Level)

These techniques are used in production at all major AI companies.

## 5. Advanced Optimizers: Beyond Adam

### The Optimizer Hierarchy

**Basic (1980s-1990s):**
- SGD: Slow but reliable
- Momentum: Accelerates SGD

**Adaptive (2010s):**
- Adam: Adaptive learning rates per parameter (most popular)
- RMSprop: Similar to Adam, used by DeepMind

**Modern (2020s):**
- **AdamW**: Adam with decoupled weight decay (used in BERT, GPT)
- **LAMB**: Large batch training (used for BERT pretraining)
- **Lion**: New optimizer from Google (2023)

### AdamW vs Adam: The Critical Difference

**Adam**: Weight decay applied BEFORE gradient update (incorrect!)  
**AdamW**: Weight decay applied AFTER gradient update (correct!)

**Impact:** AdamW generalizes better, especially for Transformers.

### FAANG Interview Question
**"What's the difference between Adam and AdamW?"** ‚Üê Asked at Google, OpenAI

In [None]:
# Comparing Optimizers

model = nn.Linear(10, 1)

# 1. Classic Adam (PyTorch default)
optimizer_adam = optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.01)

# 2. AdamW (Decoupled weight decay - BETTER for deep learning!)
optimizer_adamw = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# 3. SGD with Momentum (Still used for CNNs like ResNet)
optimizer_sgd = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)

print("Optimizer Comparison:")
print(f"Adam: {optimizer_adam}")
print(f"AdamW: {optimizer_adamw}")
print(f"SGD+Momentum: {optimizer_sgd}")

print("\\nüìä When to use each:")
print("‚Ä¢ AdamW: Transformers, LLMs (GPT, BERT, Llama)")
print("‚Ä¢ SGD+Momentum: CNNs (ResNet, EfficientNet)")
print("‚Ä¢ Adam: General purpose (but prefer AdamW for research)")

---

## 6. Mixed Precision Training (CRITICAL for Nvidia/FAANG)

### The Problem
- Training in Float32 (32-bit) is slow and memory-intensive
- Large models (LLMs) don't fit in GPU memory

### The Solution: Mixed Precision (FP16 + FP32)
Use 16-bit floating point (FP16) for most operations, 32-bit (FP32) for critical parts.

**Benefits:**
- **2x faster training** (on modern GPUs with Tensor Cores)
- **2x less memory** (can train bigger models)
- **Minimal accuracy loss** (with proper techniques)

### How It Works
1. Store weights in FP32 (master copy)
2. Convert to FP16 for forward/backward pass
3. Use FP32 for weight updates (precision matters here!)
4. Apply **loss scaling** to prevent gradient underflow

### Automatic Mixed Precision (AMP)
PyTorch provides `torch.cuda.amp` that handles everything automatically!

### FAANG Interview Question
**"How does mixed precision training work?"** ‚Üê Asked at Nvidia, Google, Meta

**Answer:**
1. Model weights stored in FP32
2. Forward/backward in FP16 (faster)
3. Gradient scaling to prevent underflow
4. Weight updates in FP32 (accuracy)
5. Result: 2x speedup, same accuracy

In [None]:
# Mixed Precision Training Example (Production Code!)

# Setup
model = nn.Linear(1000, 1000).cuda() if torch.cuda.is_available() else nn.Linear(1000, 1000)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Create GradScaler for automatic mixed precision
scaler = torch.cuda.amp.GradScaler() if torch.cuda.is_available() else None

# Training loop with AMP
def train_with_amp(model, data, labels, optimizer, scaler):
    """
    This is the STANDARD training loop at FAANG companies!
    """
    optimizer.zero_grad()
    
    # Autocast: automatically uses FP16 where safe
    with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
        outputs = model(data)
        loss = nn.functional.mse_loss(outputs, labels)
    
    # Backward pass with gradient scaling
    if scaler:
        scaler.scale(loss).backward()  # Scale loss to prevent underflow
        scaler.step(optimizer)         # Unscale gradients and update weights
        scaler.update()                # Update scale for next iteration
    else:
        loss.backward()
        optimizer.step()
    
    return loss.item()

# Simulate training
device = "cuda" if torch.cuda.is_available() else "cpu"
data = torch.randn(32, 1000).to(device)
labels = torch.randn(32, 1000).to(device)

loss = train_with_amp(model, data, labels, optimizer, scaler)

print(f"‚úì Mixed precision training completed! Loss: {loss:.4f}")
print("\\nüí° Key benefits:")
print("‚Ä¢ 2x faster on GPUs with Tensor Cores (V100, A100, H100)")
print("‚Ä¢ 50% less memory usage")
print("‚Ä¢ Enable with just 3 lines: autocast + GradScaler")
print("\\n‚ö†Ô∏è Must-know for Nvidia interviews!")

---

## 7. Gradient Accumulation (Train Huge Models on Small GPUs)

### The Problem
- You want batch_size=256, but your GPU only fits batch_size=32
- Large batches ‚Üí better gradients, faster convergence

### The Solution: Gradient Accumulation
Accumulate gradients over multiple forward/backward passes before updating weights.

```
Real batch size = micro_batch_size √ó accumulation_steps
Example: 32 √ó 8 = 256
```

### How It Works
1. Forward pass (batch 1) ‚Üí compute gradients (don't update!)
2. Forward pass (batch 2) ‚Üí add gradients (don't update!)
3. ...
4. Forward pass (batch 8) ‚Üí add gradients ‚Üí **NOW update!**

### FAANG Interview Question
**"How do you train with a large batch size on limited GPU memory?"** ‚Üê Asked at Meta, Google

In [None]:
# Gradient Accumulation Example (Used in GPT training!)

model = nn.Linear(100, 10)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Configuration
accumulation_steps = 4  # Simulate 4x larger batch
micro_batch_size = 8
effective_batch_size = micro_batch_size * accumulation_steps

print(f"Training with effective batch size: {effective_batch_size}")
print(f"(Micro-batch: {micro_batch_size} √ó Accumulation: {accumulation_steps})")

# Training loop with gradient accumulation
optimizer.zero_grad()

for step in range(accumulation_steps):
    # Get micro-batch
    data = torch.randn(micro_batch_size, 100)
    labels = torch.randn(micro_batch_size, 10)
    
    # Forward
    outputs = model(data)
    loss = nn.functional.mse_loss(outputs, labels)
    
    # Scale loss by accumulation steps (important!)
    loss = loss / accumulation_steps
    
    # Backward (gradients accumulate)
    loss.backward()
    
    print(f"  Step {step+1}/{accumulation_steps}: loss={loss.item():.4f}")

# Now update weights (after all accumulation steps)
optimizer.step()
optimizer.zero_grad()

print("\\n‚úì Weights updated after accumulating gradients from all steps!")
print("This technique powers training of GPT-3, Llama, etc.")

---

## 8. Gradient Clipping (Preventing Exploding Gradients)

### The Problem
Gradients can explode (become very large) in:
- RNNs/LSTMs (long sequences)
- Very deep networks
- Transformers (sometimes)

**Result:** NaN losses, training divergence

### The Solution: Gradient Clipping
Cap gradients to a maximum value.

**Two methods:**
1. **Clip by value**: `grad = min(max(grad, -threshold), threshold)`
2. **Clip by norm**: If ||grad|| > threshold, scale down: `grad = grad * (threshold / ||grad||)`

**Clip by norm is standard!**

### FAANG Interview Question
**"What causes NaN losses and how do you fix it?"** ‚Üê Asked at all FAANG

**Answer:**
1. Exploding gradients ‚Üí Gradient clipping
2. Learning rate too high ‚Üí Lower LR or use warmup
3. Numerical instability ‚Üí Mixed precision with loss scaling
4. Bad initialization ‚Üí Use proper init (Kaiming, Xavier)

In [None]:
# Gradient Clipping Example (Standard in Transformer training)

model = nn.Linear(100, 10)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Simulate training
data = torch.randn(32, 100)
labels = torch.randn(32, 10)

optimizer.zero_grad()
outputs = model(data)
loss = nn.functional.mse_loss(outputs, labels)
loss.backward()

# Check gradient norm BEFORE clipping
total_norm_before = 0.0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm_before += param_norm.item() ** 2
total_norm_before = total_norm_before ** 0.5

# Gradient Clipping (THIS IS THE KEY LINE!)
max_norm = 1.0  # Common values: 0.5, 1.0, 5.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# Check gradient norm AFTER clipping
total_norm_after = 0.0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm_after += param_norm.item() ** 2
total_norm_after = total_norm_after ** 0.5

print(f"Gradient norm before clipping: {total_norm_before:.4f}")
print(f"Gradient norm after clipping:  {total_norm_after:.4f}")
print(f"Max allowed norm: {max_norm}")

optimizer.step()

print("\\n‚úì Gradients clipped successfully!")
print("\\nüí° Best practices:")
print("‚Ä¢ Always clip gradients for RNNs/Transformers")
print("‚Ä¢ Typical max_norm: 0.5-5.0")
print("‚Ä¢ Monitor gradient norms during training")

---

## Final Summary: Production ML Training Stack

### The Complete Training Recipe (FAANG Standard)

```python
# 1. Model
model = YourModel().cuda()

# 2. Optimizer (AdamW for Transformers, SGD+Momentum for CNNs)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# 3. Learning Rate Schedule (Warmup + Cosine Decay)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

# 4. Mixed Precision
scaler = torch.cuda.amp.GradScaler()

# 5. Training Loop
accumulation_steps = 4
for epoch in range(epochs):
    for i, (data, labels) in enumerate(dataloader):
        # Mixed precision forward pass
        with torch.cuda.amp.autocast():
            outputs = model(data)
            loss = criterion(outputs, labels) / accumulation_steps
        
        # Backward
        scaler.scale(loss).backward()
        
        # Update every N steps (gradient accumulation)
        if (i + 1) % accumulation_steps == 0:
            # Gradient clipping
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            # Optimizer step
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
    
    # LR schedule
    scheduler.step()
```

---

### FAANG Interview Cheat Sheet

| Topic | Key Point | When Asked |
|-------|-----------|------------|
| **AdamW vs Adam** | Decoupled weight decay | Google, OpenAI |
| **Mixed Precision** | FP16 compute + FP32 weights | Nvidia, Meta |
| **Gradient Accumulation** | Simulate large batches | Meta, Google |
| **Gradient Clipping** | Prevent exploding gradients | All FAANG |
| **Learning Rate Warmup** | Start small, ramp up | OpenAI, Google |
| **Batch Normalization** | Normalize layer inputs | Basic question |

---

### Optimizer Decision Tree

```
Training Transformers/LLMs?
‚îú‚îÄ YES ‚Üí Use AdamW (lr=1e-4 to 1e-3)
‚îî‚îÄ NO ‚Üí Training CNNs?
    ‚îú‚îÄ YES ‚Üí Use SGD + Momentum (lr=0.1)
    ‚îî‚îÄ NO ‚Üí Use AdamW (safe default)
```

---

### Common Mistakes to Avoid

1. ‚ùå **Using Adam instead of AdamW for Transformers**
   - ‚úÖ Always use AdamW for modern deep learning

2. ‚ùå **Not using mixed precision on modern GPUs**
   - ‚úÖ Always enable AMP on V100/A100/H100

3. ‚ùå **Forgetting to clip gradients for RNNs/Transformers**
   - ‚úÖ Always clip with `max_norm=1.0`

4. ‚ùå **Not scaling loss with gradient accumulation**
   - ‚úÖ `loss = loss / accumulation_steps`

5. ‚ùå **Constant learning rate**
   - ‚úÖ Use warmup + cosine decay

---

### What We Covered (Enhanced)

**Part 1: Basics**
- ‚úÖ Learning rate schedulers
- ‚úÖ Dropout, weight decay
- ‚úÖ Batch normalization
- ‚úÖ Early stopping

**Part 2: Advanced (FAANG-Level)**
- ‚úÖ AdamW vs Adam
- ‚úÖ Mixed precision training (AMP)
- ‚úÖ Gradient accumulation
- ‚úÖ Gradient clipping

---

## Next Steps for FAANG Prep

1. **Practice:** Implement training loop with all techniques
2. **Understand:** Why each technique works (not just how)
3. **Memorize:** Common hyperparameters (lr, weight_decay, max_norm)
4. **Debug:** Practice fixing NaN losses, slow convergence

---

**You are now ready for Optimization questions at FAANG/Nvidia! üöÄ**