# Topic 5: Training Loop & Optimization

## Learning Objectives

By the end of this notebook, you will:
- Understand the complete training loop workflow
- Know WHY each step in the training loop is necessary
- Use PyTorch optimizers (SGD, Adam, and variants)
- Implement proper train/validation splits
- Track and visualize training metrics
- Understand learning rates and scheduling
- Build a complete end-to-end classification system
- Avoid common training pitfalls

---

## 1. The Big Picture: The Training Loop

### Why Do We Need a Training Loop?

You've learned all the pieces:
- **Tensors**: Data representation
- **Autograd**: Gradient computation
- **nn.Module**: Model architecture
- **Loss functions**: Error measurement

Now we put it all together into a **training loop** - the heart of deep learning!

### The Standard PyTorch Training Loop

```python
for epoch in range(num_epochs):
    for batch_x, batch_y in dataloader:
        # 1. Forward pass: compute predictions
        predictions = model(batch_x)
        
        # 2. Compute loss
        loss = loss_function(predictions, batch_y)
        
        # 3. Zero gradients (they accumulate!)
        optimizer.zero_grad()
        
        # 4. Backward pass: compute gradients
        loss.backward()
        
        # 5. Update weights
        optimizer.step()
```

### Why Each Step?

1. **Forward pass**: Get predictions from current model
2. **Compute loss**: Measure how wrong we are
3. **Zero gradients**: Clear old gradients (they accumulate by default)
4. **Backward pass**: Compute gradients using autograd
5. **Update weights**: Move parameters in direction of lower loss

**Key insight**: This is gradient descent, but with mini-batches and automation!

### Epochs vs Iterations

- **Iteration**: One forward + backward pass on one batch
- **Epoch**: One complete pass through entire dataset
- If dataset has 1000 samples and batch size is 100:
  - 1 epoch = 10 iterations
  - 5 epochs = 50 iterations

In [None]:
# Setup
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

print(f"PyTorch version: {torch.__version__}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

---

## 2. Optimizers: The Engine of Learning

### What Is an Optimizer?

An optimizer implements the **weight update rule**:
$$\theta_{\text{new}} = \theta_{\text{old}} - \text{update}$$

Different optimizers compute the update differently.

### Why Not Manual Updates?

You could do:
```python
for param in model.parameters():
    param.data -= learning_rate * param.grad
```

But optimizers provide:
- Sophisticated update rules (momentum, adaptive learning rates)
- Handling of multiple parameter groups
- Learning rate scheduling
- Numerical stability

### 2.1 Stochastic Gradient Descent (SGD)

**Formula**: $\theta = \theta - \eta \nabla L(\theta)$

**When to use**:
- Simple problems
- When you want full control
- Computer vision (with momentum)

**Pros**:
- Simple, well-understood
- Can generalize well with proper tuning
- Memory efficient

**Cons**:
- Requires careful learning rate tuning
- Can be slow to converge
- Sensitive to initialization

In [None]:
# SGD example
model = nn.Linear(10, 5)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

print(f"Optimizer: {optimizer}")
print(f"\nParameter groups: {len(optimizer.param_groups)}")
print(f"Learning rate: {optimizer.param_groups[0]['lr']}")

### 2.2 SGD with Momentum

**Formula**: 
$$v_t = \beta v_{t-1} + \nabla L(\theta)$$
$$\theta = \theta - \eta v_t$$

**Why momentum?** Accelerates in consistent directions, dampens oscillations.

Think of a ball rolling down a hill - it builds up speed!

**When to use**:
- Computer vision (ResNet, VGG, etc.)
- When loss surface has valleys

**Typical momentum**: 0.9

In [None]:
# SGD with momentum
optimizer_momentum = torch.optim.SGD(
    model.parameters(), 
    lr=0.01, 
    momentum=0.9
)

print(f"SGD with momentum: {optimizer_momentum}")
print("Momentum helps escape local minima and speeds up convergence!")

### 2.3 Adam (Adaptive Moment Estimation)

**Formula**: Combines momentum + adaptive learning rates per parameter

**When to use**:
- **Default choice for most problems**
- NLP (Transformers, LLMs)
- When you don't want to tune learning rate carefully
- Works well out-of-the-box

**Pros**:
- Adaptive learning rates (different per parameter)
- Combines best of momentum and RMSProp
- Robust to hyperparameter choices
- Usually converges fast

**Cons**:
- Can sometimes generalize worse than SGD
- Uses more memory (maintains extra state)

**Typical hyperparameters**: lr=0.001, betas=(0.9, 0.999)

In [None]:
# Adam optimizer
optimizer_adam = torch.optim.Adam(
    model.parameters(),
    lr=0.001,
    betas=(0.9, 0.999),  # momentum parameters
    eps=1e-8  # numerical stability
)

print(f"Adam: {optimizer_adam}")
print("\nAdam is the 'safe default' - works well for most problems!")

### 2.4 AdamW (Adam with Weight Decay)

**What's different?** Fixes weight decay implementation in Adam.

**When to use**:
- **Prefer this over Adam!**
- Modern best practice (used in GPT, BERT, etc.)
- When you want L2 regularization

**Typical hyperparameters**: lr=0.001, weight_decay=0.01

In [None]:
# AdamW optimizer (recommended over Adam)
optimizer_adamw = torch.optim.AdamW(
    model.parameters(),
    lr=0.001,
    weight_decay=0.01  # L2 regularization
)

print(f"AdamW: {optimizer_adamw}")
print("\nAdamW is the modern default - use this instead of Adam!")

### Quick Guide: Which Optimizer?

| Task | Recommended | Why |
|------|-------------|-----|
| **General/NLP** | AdamW | Robust, adaptive, standard for Transformers |
| **Computer Vision** | SGD + Momentum | Better generalization for CNNs |
| **Research/New Problems** | Try both | Adam for fast iteration, SGD for final model |
| **Small datasets** | AdamW | Less sensitive to hyperparameters |
| **Large datasets** | SGD + Momentum | Often better final performance |

---

## 3. Simple Training Loop Example

In [None]:
# Generate toy dataset: binary classification
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X = torch.FloatTensor(X)
y = torch.LongTensor(y)

print(f"X shape: {X.shape}")  # (1000, 2)
print(f"y shape: {y.shape}")  # (1000,)
print(f"Classes: {y.unique()}")

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', alpha=0.6)
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', alpha=0.6)
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Two Moons Dataset', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Define model
class SimpleClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create model, loss, optimizer
model = SimpleClassifier(input_size=2, hidden_size=16, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

print(f"Model:\n{model}")
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Training loop
num_epochs = 100
losses = []
accuracies = []

print("Training...")
for epoch in range(num_epochs):
    # 1. Forward pass
    outputs = model(X)
    loss = criterion(outputs, y)
    
    # 2. Backward pass
    optimizer.zero_grad()  # Clear old gradients
    loss.backward()        # Compute new gradients
    optimizer.step()       # Update weights
    
    # 3. Track metrics
    losses.append(loss.item())
    
    with torch.no_grad():
        predicted = outputs.argmax(dim=1)
        accuracy = (predicted == y).float().mean()
        accuracies.append(accuracy.item())
    
    # 4. Print progress
    if (epoch + 1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Accuracy: {accuracy.item():.4f}")

print("\nTraining complete!")

In [None]:
# Visualize training progress
plt.figure(figsize=(14, 5))

# Plot loss
plt.subplot(1, 2, 1)
plt.plot(losses, 'b-', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Training Loss', fontsize=14)
plt.grid(True, alpha=0.3)

# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(accuracies, 'g-', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Training Accuracy', fontsize=14)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final loss: {losses[-1]:.4f}")
print(f"Final accuracy: {accuracies[-1]:.4f}")

---

## 4. Train/Validation Split: Why It Matters

### The Overfitting Problem

**Problem**: Model memorizes training data but fails on new data.

**Solution**: Split data into:
- **Training set** (70-80%): Used to update weights
- **Validation set** (10-15%): Used to evaluate during training
- **Test set** (10-15%): Used ONLY at the very end

**Why validation?**
- Monitor overfitting (train accuracy ↑, val accuracy ↓)
- Early stopping (stop when val performance degrades)
- Hyperparameter tuning (choose best based on val performance)

**Golden rule**: NEVER use test set during training!

In [None]:
# Proper train/val/test split
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)

# Split: 70% train, 15% val, 15% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.15/0.85, random_state=42
)

# Convert to tensors
X_train = torch.FloatTensor(X_train)
y_train = torch.LongTensor(y_train)
X_val = torch.FloatTensor(X_val)
y_val = torch.LongTensor(y_val)
X_test = torch.FloatTensor(X_test)
y_test = torch.LongTensor(y_test)

print(f"Training set:   {len(X_train)} samples")
print(f"Validation set: {len(X_val)} samples")
print(f"Test set:       {len(X_test)} samples")

### Training Loop with Validation

In [None]:
# Create fresh model
model = SimpleClassifier(input_size=2, hidden_size=16, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

num_epochs = 150
train_losses = []
val_losses = []
train_accs = []
val_accs = []

print("Training with validation...\n")

for epoch in range(num_epochs):
    # ===== TRAINING MODE =====
    model.train()  # Enable dropout, batchnorm in training mode
    
    # Forward pass
    train_outputs = model(X_train)
    train_loss = criterion(train_outputs, y_train)
    
    # Backward pass
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()
    
    # ===== VALIDATION MODE =====
    model.eval()  # Disable dropout, batchnorm in eval mode
    
    with torch.no_grad():  # Don't compute gradients for validation
        # Validation loss
        val_outputs = model(X_val)
        val_loss = criterion(val_outputs, y_val)
        
        # Training accuracy
        train_pred = train_outputs.argmax(dim=1)
        train_acc = (train_pred == y_train).float().mean()
        
        # Validation accuracy
        val_pred = val_outputs.argmax(dim=1)
        val_acc = (val_pred == y_val).float().mean()
    
    # Track metrics
    train_losses.append(train_loss.item())
    val_losses.append(val_loss.item())
    train_accs.append(train_acc.item())
    val_accs.append(val_acc.item())
    
    # Print progress
    if (epoch + 1) % 30 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}]")
        print(f"  Train Loss: {train_loss.item():.4f}, Train Acc: {train_acc.item():.4f}")
        print(f"  Val Loss:   {val_loss.item():.4f}, Val Acc:   {val_acc.item():.4f}")

print("\nTraining complete!")

In [None]:
# Visualize train vs validation
plt.figure(figsize=(14, 5))

# Plot losses
plt.subplot(1, 2, 1)
plt.plot(train_losses, 'b-', linewidth=2, label='Train Loss')
plt.plot(val_losses, 'r-', linewidth=2, label='Val Loss')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Loss: Train vs Validation', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)

# Plot accuracies
plt.subplot(1, 2, 2)
plt.plot(train_accs, 'b-', linewidth=2, label='Train Acc')
plt.plot(val_accs, 'r-', linewidth=2, label='Val Acc')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Accuracy: Train vs Validation', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key observations:")
print("1. If train acc >> val acc: OVERFITTING")
print("2. If both low: UNDERFITTING (need bigger model or more training)")
print("3. If val loss starts increasing: STOP TRAINING (early stopping)")

### Final Test Evaluation

In [None]:
# Evaluate on test set (ONLY ONCE!)
model.eval()

with torch.no_grad():
    test_outputs = model(X_test)
    test_loss = criterion(test_outputs, y_test)
    test_pred = test_outputs.argmax(dim=1)
    test_acc = (test_pred == y_test).float().mean()

print(f"Test Loss: {test_loss.item():.4f}")
print(f"Test Accuracy: {test_acc.item():.4f}")
print(f"\nThis is the final performance estimate!")

---

## 5. Mini-Batch Training with DataLoader

### Why Mini-Batches?

**Options**:
1. **Batch Gradient Descent**: Use entire dataset
   - Pros: Stable gradients
   - Cons: Slow, requires lots of memory

2. **Stochastic Gradient Descent**: Use one sample
   - Pros: Fast updates
   - Cons: Noisy gradients, unstable

3. **Mini-Batch Gradient Descent**: Use small batches (16-256)
   - Pros: Balance of speed and stability
   - Cons: None! This is the standard

**Typical batch sizes**: 32, 64, 128, 256

### PyTorch DataLoader

`DataLoader` handles:
- Batching
- Shuffling
- Parallel data loading
- Memory management

In [None]:
# Create datasets and dataloaders
from torch.utils.data import DataLoader, TensorDataset

# Create TensorDatasets
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)
test_dataset = TensorDataset(X_test, y_test)

# Create DataLoaders
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"Train batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")
print(f"Test batches: {len(test_loader)}")
print()

# Peek at one batch
batch_x, batch_y = next(iter(train_loader))
print(f"Batch shape: {batch_x.shape}")
print(f"Labels shape: {batch_y.shape}")

### Complete Training Loop with DataLoader

In [None]:
# Create model
model = SimpleClassifier(input_size=2, hidden_size=32, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Training parameters
num_epochs = 50
train_losses = []
val_losses = []

print("Training with DataLoader...\n")

for epoch in range(num_epochs):
    # ===== TRAINING =====
    model.train()
    train_loss = 0
    train_correct = 0
    train_total = 0
    
    for batch_x, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Track metrics
        train_loss += loss.item() * batch_x.size(0)
        train_correct += (outputs.argmax(dim=1) == batch_y).sum().item()
        train_total += batch_y.size(0)
    
    # Average over epoch
    train_loss /= train_total
    train_acc = train_correct / train_total
    
    # ===== VALIDATION =====
    model.eval()
    val_loss = 0
    val_correct = 0
    val_total = 0
    
    with torch.no_grad():
        for batch_x, batch_y in val_loader:
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)
            
            val_loss += loss.item() * batch_x.size(0)
            val_correct += (outputs.argmax(dim=1) == batch_y).sum().item()
            val_total += batch_y.size(0)
    
    val_loss /= val_total
    val_acc = val_correct / val_total
    
    # Track
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    
    # Print
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}]")
        print(f"  Train: Loss={train_loss:.4f}, Acc={train_acc:.4f}")
        print(f"  Val:   Loss={val_loss:.4f}, Acc={val_acc:.4f}")

print("\nTraining complete!")

---

## 6. Learning Rate Scheduling

### Why Schedule Learning Rate?

**Problem**: Fixed learning rate is suboptimal
- Too high: Never converge (bouncing around)
- Too low: Slow convergence

**Solution**: Start high, then gradually decrease

### Common Schedules

1. **StepLR**: Decrease by factor every N epochs
2. **ExponentialLR**: Multiply by factor each epoch
3. **CosineAnnealingLR**: Cosine curve (smooth decrease)
4. **ReduceLROnPlateau**: Decrease when metric stops improving

In [None]:
# Example: StepLR scheduler
model = SimpleClassifier(2, 32, 2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Reduce LR by 0.1 every 20 epochs
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer, 
    step_size=20,  # Every 20 epochs
    gamma=0.1      # Multiply by 0.1
)

# Simulate learning rate schedule
lrs = []
for epoch in range(100):
    lrs.append(optimizer.param_groups[0]['lr'])
    scheduler.step()  # Update learning rate

# Visualize
plt.figure(figsize=(10, 5))
plt.plot(lrs, 'b-', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Learning Rate', fontsize=12)
plt.title('StepLR Schedule', fontsize=14)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("Learning rate is reduced every 20 epochs!")

In [None]:
# Example: CosineAnnealingLR (smooth decay)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=50,  # Number of epochs
    eta_min=0.0001  # Minimum learning rate
)

# Simulate
lrs = []
for epoch in range(50):
    lrs.append(optimizer.param_groups[0]['lr'])
    scheduler.step()

# Visualize
plt.figure(figsize=(10, 5))
plt.plot(lrs, 'g-', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Learning Rate', fontsize=12)
plt.title('CosineAnnealingLR Schedule', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

print("Smooth decay following cosine curve - popular for transformers!")

---

## Mini Exercises

### Exercise 1: Fix the Broken Training Loop

The following training loop has 3 bugs. Find and fix them!

In [None]:
# BROKEN CODE (DO NOT RUN)
model = nn.Linear(10, 5)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

X = torch.randn(100, 10)
y = torch.randn(100, 5)

for epoch in range(10):
    outputs = model(X)
    loss = criterion(outputs, y)
    
    loss.backward()
    optimizer.step()
    # Missing something?
    
    print(f"Loss: {loss}")

In [None]:
# Your fixed code here


In [None]:
# Solution
model = nn.Linear(10, 5)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

X = torch.randn(100, 10)
y = torch.randn(100, 5)

for epoch in range(10):
    outputs = model(X)
    loss = criterion(outputs, y)
    
    optimizer.zero_grad()  # BUG 1: Must zero gradients!
    loss.backward()
    optimizer.step()
    
    print(f"Loss: {loss.item()}")  # BUG 2: Use .item() to get scalar

print("\nFixed bugs:")
print("1. Missing optimizer.zero_grad() - gradients accumulate!")
print("2. Printing loss tensor instead of scalar (use .item())")

### Exercise 2: Implement Early Stopping

Add early stopping to the training loop:
- Stop training if validation loss doesn't improve for 5 epochs
- Save the best model (lowest val loss)

In [None]:
# Your code here


In [None]:
# Solution
import copy

model = SimpleClassifier(2, 32, 2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Early stopping parameters
patience = 5
best_val_loss = float('inf')
epochs_without_improvement = 0
best_model = None

num_epochs = 100

for epoch in range(num_epochs):
    # Training
    model.train()
    train_outputs = model(X_train)
    train_loss = criterion(train_outputs, y_train)
    
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()
    
    # Validation
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_val)
        val_loss = criterion(val_outputs, y_val)
    
    # Early stopping check
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = copy.deepcopy(model.state_dict())
        epochs_without_improvement = 0
        print(f"Epoch {epoch+1}: New best! Val Loss = {val_loss:.4f}")
    else:
        epochs_without_improvement += 1
    
    # Stop if no improvement
    if epochs_without_improvement >= patience:
        print(f"\nEarly stopping at epoch {epoch+1}")
        print(f"Best val loss: {best_val_loss:.4f}")
        break

# Restore best model
model.load_state_dict(best_model)
print("\nRestored best model!")

### Exercise 3: Compare Optimizers

Train the same model with SGD, SGD+Momentum, Adam, and AdamW.
Plot training curves to compare convergence speed.

In [None]:
# Your code here


In [None]:
# Solution
def train_with_optimizer(optimizer_name, lr=0.01):
    """Train model with specified optimizer"""
    model = SimpleClassifier(2, 32, 2)
    criterion = nn.CrossEntropyLoss()
    
    # Create optimizer
    if optimizer_name == 'SGD':
        optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    elif optimizer_name == 'SGD+Momentum':
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    elif optimizer_name == 'Adam':
        optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    elif optimizer_name == 'AdamW':
        optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    
    losses = []
    
    for epoch in range(50):
        model.train()
        outputs = model(X_train)
        loss = criterion(outputs, y_train)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        losses.append(loss.item())
    
    return losses

# Train with each optimizer
results = {}
for opt in ['SGD', 'SGD+Momentum', 'Adam', 'AdamW']:
    print(f"Training with {opt}...")
    results[opt] = train_with_optimizer(opt)

# Plot comparison
plt.figure(figsize=(12, 6))
for opt, losses in results.items():
    plt.plot(losses, linewidth=2, label=opt)

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Optimizer Comparison', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print("\nObservations:")
print("- SGD: Slowest convergence")
print("- SGD+Momentum: Faster than vanilla SGD")
print("- Adam/AdamW: Fastest convergence (adaptive learning rates)")

---

## Comprehensive Exercise: Complete Classification Pipeline

Build a complete classification system from scratch:

**Task**: Binary classification on make_moons dataset

**Requirements**:
1. Train/val/test split (70/15/15)
2. DataLoader with batch_size=32
3. Model: 2 hidden layers (64, 32 neurons), ReLU, dropout=0.2
4. Optimizer: AdamW with weight_decay=0.01
5. Learning rate scheduler: CosineAnnealingLR
6. Train for 100 epochs
7. Track train/val loss and accuracy
8. Implement early stopping (patience=10)
9. Evaluate on test set
10. Visualize decision boundary

**Bonus**: Add weight initialization and gradient clipping

In [None]:
# Your complete solution here


In [None]:
# Solution (will be quite long - this is comprehensive!)
# Complete this based on all concepts learned
# This would be a great capstone exercise for students

print("This is left as a comprehensive exercise for the learner!")
print("Combine all concepts from topics 1-5 to build a complete pipeline.")

---

## Key Takeaways

1. **Training loop structure**: Forward → Loss → Zero grad → Backward → Step
2. **Always zero gradients**: They accumulate by default!
3. **Train/val/test split**: Essential to detect overfitting
4. **model.train() vs model.eval()**: Switch modes appropriately
5. **torch.no_grad()**: Disable gradients during evaluation
6. **Optimizers**:
   - AdamW: Default choice (robust, fast)
   - SGD+Momentum: Computer vision
7. **DataLoader**: Handles batching, shuffling, parallel loading
8. **Learning rate scheduling**: Improves convergence
9. **Early stopping**: Prevents overfitting
10. **Monitoring**: Track metrics to diagnose issues

### Common Mistakes

1. **Forgetting `optimizer.zero_grad()`** → Gradients accumulate!
2. **Not switching to eval mode** → Dropout/BatchNorm behave wrong
3. **Computing gradients during validation** → Wastes memory
4. **Using test set for hyperparameter tuning** → Overfitting to test set
5. **Too high learning rate** → Divergence
6. **Too low learning rate** → Slow convergence

### Training Loop Checklist

```python
for epoch in range(num_epochs):
    # TRAINING
    model.train()  ✓
    for batch in train_loader:
        optimizer.zero_grad()  ✓
        output = model(batch)  ✓
        loss = criterion(output, target)  ✓
        loss.backward()  ✓
        optimizer.step()  ✓
    
    # VALIDATION
    model.eval()  ✓
    with torch.no_grad():  ✓
        # Evaluate on validation set
```

---

## Next Steps

Congratulations! You've completed the PyTorch fundamentals. You can now:
- Build neural networks
- Choose appropriate loss functions
- Train models properly
- Evaluate performance

**Next topics** (Intermediate level):
- Custom datasets and data augmentation
- Convolutional Neural Networks (CNNs)
- Transfer learning
- Advanced optimization techniques

---

## Further Reading

- [PyTorch Optimization Tutorial](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html)
- [Adam Paper](https://arxiv.org/abs/1412.6980)
- [AdamW Paper](https://arxiv.org/abs/1711.05101)
- [CS231n: Training Neural Networks](http://cs231n.github.io/neural-networks-3/)
- [Learning Rate Scheduling](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)