# Module 06: Optimizers (SGD, Adam, RMSprop)

**Difficulty**: ⭐⭐⭐ (Advanced)

**Estimated Time**: 60-75 minutes

**Prerequisites**: 
- [Module 05: Feed-Forward Neural Networks with Keras](05_feedforward_neural_networks_keras.ipynb)
- [Module 02: Backpropagation and Gradient Descent](02_backpropagation_and_gradient_descent.ipynb)
- [Module 04: Introduction to TensorFlow and Keras](04_introduction_to_tensorflow_keras.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand the differences between gradient descent variants (Batch, Mini-batch, Stochastic)
2. Explain and implement momentum and Nesterov momentum for faster convergence
3. Apply adaptive learning rate methods (AdaGrad, RMSprop, Adam)
4. Compare optimizer performance on different problems
5. Implement learning rate schedules to improve training
6. Choose the appropriate optimizer for your specific task

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, optimizers
from tensorflow.keras.datasets import mnist, fashion_mnist
from tensorflow.keras.callbacks import LearningRateScheduler, ReduceLROnPlateau

# Reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print(f"TensorFlow version: {tf.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Understanding Gradient Descent Variants

### The Three Main Variants:

#### 1. **Batch Gradient Descent (BGD)**
- Uses **entire** training dataset to compute gradients
- Update rule: $\theta_{t+1} = \theta_t - \eta \nabla_\theta J(\theta; x^{(1:n)}, y^{(1:n)})$
- **Pros**: Stable convergence, exact gradient
- **Cons**: Slow for large datasets, memory intensive

#### 2. **Stochastic Gradient Descent (SGD)**
- Uses **single** random sample to compute gradients
- Update rule: $\theta_{t+1} = \theta_t - \eta \nabla_\theta J(\theta; x^{(i)}, y^{(i)})$
- **Pros**: Fast updates, can escape local minima
- **Cons**: Noisy updates, high variance

#### 3. **Mini-batch Gradient Descent (MBGD)** ⭐ Most Common
- Uses **small batch** of samples (typically 32, 64, 128, 256)
- Update rule: $\theta_{t+1} = \theta_t - \eta \nabla_\theta J(\theta; x^{(i:i+b)}, y^{(i:i+b)})$
- **Pros**: Balance of speed and stability, GPU efficient
- **Cons**: Requires batch size tuning

### Visual Comparison:

In [None]:
# Simulate gradient descent paths on a simple 2D function
def rosenbrock(x, y):
    """Rosenbrock function - classic optimization test function."""
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_gradient(x, y):
    """Gradient of Rosenbrock function."""
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

# Create contour plot
x = np.linspace(-2, 2, 400)
y = np.linspace(-1, 3, 400)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)

fig, ax = plt.subplots(figsize=(12, 8))
contour = ax.contour(X, Y, Z, levels=np.logspace(-1, 3.5, 20), cmap='viridis', alpha=0.6)
ax.clabel(contour, inline=True, fontsize=8)

# Simulate different GD variants (simplified visualization)
# Starting point
start = np.array([-1.5, 2.5])

# Batch GD: smooth path
path_batch = [start]
point = start.copy()
lr = 0.001
for _ in range(100):
    grad = rosenbrock_gradient(point[0], point[1])
    point = point - lr * grad
    path_batch.append(point.copy())
path_batch = np.array(path_batch)

# SGD: noisy path (add random noise to gradients)
path_sgd = [start]
point = start.copy()
for _ in range(100):
    grad = rosenbrock_gradient(point[0], point[1])
    # Add noise to simulate stochastic behavior
    noisy_grad = grad + np.random.randn(2) * 50
    point = point - lr * noisy_grad
    path_sgd.append(point.copy())
path_sgd = np.array(path_sgd)

# Mini-batch: moderate noise
path_minibatch = [start]
point = start.copy()
for _ in range(100):
    grad = rosenbrock_gradient(point[0], point[1])
    # Less noise than SGD
    noisy_grad = grad + np.random.randn(2) * 20
    point = point - lr * noisy_grad
    path_minibatch.append(point.copy())
path_minibatch = np.array(path_minibatch)

# Plot paths
ax.plot(path_batch[:, 0], path_batch[:, 1], 'b-', linewidth=2, label='Batch GD (smooth)', alpha=0.7)
ax.plot(path_sgd[:, 0], path_sgd[:, 1], 'r-', linewidth=1, label='SGD (noisy)', alpha=0.6)
ax.plot(path_minibatch[:, 0], path_minibatch[:, 1], 'g-', linewidth=1.5, 
        label='Mini-batch GD (balanced)', alpha=0.7)

# Mark start and optimum
ax.plot(start[0], start[1], 'ko', markersize=10, label='Start')
ax.plot(1, 1, 'r*', markersize=20, label='Optimum (1, 1)')

ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('Gradient Descent Variants Comparison', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 3)
plt.tight_layout()
plt.show()

print("Observation:")
print("- Batch GD: Smooth but slow convergence")
print("- SGD: Fast but very noisy updates")
print("- Mini-batch GD: Good balance between speed and stability")

## 3. Momentum: Accelerating Gradient Descent

### Problem with Standard GD:
- Oscillates in steep dimensions
- Slow progress in gentle dimensions
- Can get stuck in saddle points

### Momentum Solution:

Accumulates a **velocity vector** in directions of persistent reduction:

$$
\begin{align}
v_t &= \beta v_{t-1} + \nabla_\theta J(\theta) \\
\theta_{t+1} &= \theta_t - \eta v_t
\end{align}
$$

Where:
- $v_t$ is the velocity (exponentially weighted average of gradients)
- $\beta$ is the momentum coefficient (typically 0.9)
- $\eta$ is the learning rate

**Physical Analogy**: Ball rolling down a hill - builds momentum in consistent directions

### Nesterov Momentum (NAG):

**Lookahead** gradient - evaluates gradient at the anticipated position:

$$
\begin{align}
v_t &= \beta v_{t-1} + \nabla_\theta J(\theta - \eta \beta v_{t-1}) \\
\theta_{t+1} &= \theta_t - \eta v_t
\end{align}
$$

**Benefit**: More responsive corrections, prevents overshooting

In [None]:
# Demonstrate momentum on a simple optimization problem
def compare_momentum_methods():
    """
    Compare standard GD, momentum, and Nesterov momentum.
    """
    # Simple quadratic function with different curvatures
    def f(x, y):
        return 0.5 * x**2 + 4.5 * y**2
    
    def grad_f(x, y):
        return np.array([x, 9*y])
    
    # Parameters
    start = np.array([10.0, 10.0])
    lr = 0.1
    beta = 0.9
    steps = 50
    
    # Standard GD
    path_gd = [start]
    point = start.copy()
    for _ in range(steps):
        grad = grad_f(point[0], point[1])
        point = point - lr * grad
        path_gd.append(point.copy())
    
    # Momentum GD
    path_momentum = [start]
    point = start.copy()
    velocity = np.zeros(2)
    for _ in range(steps):
        grad = grad_f(point[0], point[1])
        velocity = beta * velocity + grad
        point = point - lr * velocity
        path_momentum.append(point.copy())
    
    # Nesterov Momentum
    path_nesterov = [start]
    point = start.copy()
    velocity = np.zeros(2)
    for _ in range(steps):
        # Look ahead
        lookahead = point - lr * beta * velocity
        grad = grad_f(lookahead[0], lookahead[1])
        velocity = beta * velocity + grad
        point = point - lr * velocity
        path_nesterov.append(point.copy())
    
    return np.array(path_gd), np.array(path_momentum), np.array(path_nesterov), f

# Compare methods
path_gd, path_momentum, path_nesterov, func = compare_momentum_methods()

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Contour plot
x = np.linspace(-12, 12, 200)
y = np.linspace(-12, 12, 200)
X, Y = np.meshgrid(x, y)
Z = func(X, Y)

contour = ax1.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
ax1.plot(path_gd[:, 0], path_gd[:, 1], 'b-o', linewidth=2, markersize=4, 
         label='Standard GD', alpha=0.7)
ax1.plot(path_momentum[:, 0], path_momentum[:, 1], 'r-o', linewidth=2, markersize=4,
         label='Momentum', alpha=0.7)
ax1.plot(path_nesterov[:, 0], path_nesterov[:, 1], 'g-o', linewidth=2, markersize=4,
         label='Nesterov', alpha=0.7)
ax1.plot(0, 0, 'r*', markersize=20, label='Optimum')
ax1.set_xlabel('x', fontsize=12)
ax1.set_ylabel('y', fontsize=12)
ax1.set_title('Optimization Paths', fontsize=12, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Loss over iterations
losses_gd = [func(p[0], p[1]) for p in path_gd]
losses_momentum = [func(p[0], p[1]) for p in path_momentum]
losses_nesterov = [func(p[0], p[1]) for p in path_nesterov]

ax2.semilogy(losses_gd, 'b-', linewidth=2, label='Standard GD')
ax2.semilogy(losses_momentum, 'r-', linewidth=2, label='Momentum')
ax2.semilogy(losses_nesterov, 'g-', linewidth=2, label='Nesterov')
ax2.set_xlabel('Iteration', fontsize=12)
ax2.set_ylabel('Loss (log scale)', fontsize=12)
ax2.set_title('Convergence Speed Comparison', fontsize=12, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Iterations to reach loss < 1.0:")
print(f"  Standard GD: {np.argmax(np.array(losses_gd) < 1.0)}")
print(f"  Momentum:    {np.argmax(np.array(losses_momentum) < 1.0)}")
print(f"  Nesterov:    {np.argmax(np.array(losses_nesterov) < 1.0)}")

## 4. Adaptive Learning Rate Methods

### Problem with Fixed Learning Rates:
- Too large: Overshooting, divergence
- Too small: Slow convergence
- Different parameters may need different learning rates

### Solution: Adaptive Methods

Automatically adjust learning rates for each parameter based on historical gradients.

### 4.1 AdaGrad (Adaptive Gradient)

**Key Idea**: Scale learning rate inversely to the square root of sum of squared gradients

$$
\begin{align}
G_t &= G_{t-1} + (\nabla_\theta J(\theta))^2 \\
\theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta J(\theta)
\end{align}
$$

Where:
- $G_t$ accumulates squared gradients (element-wise)
- $\epsilon$ is a small constant for numerical stability (e.g., $10^{-8}$)

**Pros**: 
- Good for sparse gradients
- Automatically adjusts learning rate per parameter

**Cons**: 
- Accumulation of squared gradients grows monotonically
- Learning rate may become infinitesimally small

### 4.2 RMSprop (Root Mean Square Propagation)

**Key Idea**: Fix AdaGrad's diminishing learning rate by using exponentially weighted average

$$
\begin{align}
E[g^2]_t &= \beta E[g^2]_{t-1} + (1-\beta) (\nabla_\theta J(\theta))^2 \\
\theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla_\theta J(\theta)
\end{align}
$$

Where:
- $E[g^2]_t$ is the exponentially weighted average of squared gradients
- $\beta$ is the decay rate (typically 0.9)

**Pros**: 
- Fixes AdaGrad's aggressive learning rate decay
- Works well on non-stationary problems

**Recommended for**: RNNs, time series

### 4.3 Adam (Adaptive Moment Estimation) ⭐ Most Popular

**Key Idea**: Combine momentum (first moment) with RMSprop (second moment)

$$
\begin{align}
m_t &= \beta_1 m_{t-1} + (1-\beta_1) \nabla_\theta J(\theta) \quad \text{(momentum)} \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2) (\nabla_\theta J(\theta))^2 \quad \text{(RMSprop)} \\
\hat{m}_t &= \frac{m_t}{1-\beta_1^t} \quad \text{(bias correction)} \\
\hat{v}_t &= \frac{v_t}{1-\beta_2^t} \quad \text{(bias correction)} \\
\theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
\end{align}
$$

**Default hyperparameters**:
- $\eta = 0.001$ (learning rate)
- $\beta_1 = 0.9$ (momentum decay)
- $\beta_2 = 0.999$ (RMSprop decay)
- $\epsilon = 10^{-8}$ (numerical stability)

**Pros**: 
- Works well on most problems
- Bias correction handles initial timesteps
- Robust to hyperparameter choices

**When to use**: Default choice for deep learning

### 4.4 AdamW (Adam with Weight Decay)

**Key Idea**: Decouple weight decay from gradient-based update

Standard Adam with L2 regularization is inconsistent. AdamW fixes this:

$$
\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)
$$

Where $\lambda$ is the weight decay coefficient.

**When to use**: Training transformers, large models

## 5. Optimizer Comparison on Fashion-MNIST

Let's compare different optimizers empirically on a real task.

In [None]:
# Load and preprocess Fashion-MNIST
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# Normalize and flatten
X_train_full = X_train_full.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

X_train_full_flat = X_train_full.reshape(-1, 784)
X_test_flat = X_test.reshape(-1, 784)

# Create validation split
X_train = X_train_full_flat[:50000]
X_valid = X_train_full_flat[50000:]
y_train = y_train_full[:50000]
y_valid = y_train_full[50000:]

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_valid.shape}")
print(f"Test set: {X_test_flat.shape}")

In [None]:
# Define a standard architecture for fair comparison
def create_model():
    """
    Create a standard neural network for optimizer comparison.
    Using same architecture ensures fair comparison.
    """
    model = models.Sequential([
        layers.Input(shape=(784,)),
        layers.Dense(128, activation='relu'),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])
    return model

# Test the model creation
test_model = create_model()
test_model.summary()

In [None]:
# Compare different optimizers
optimizers_to_compare = {
    'SGD': optimizers.SGD(learning_rate=0.01),
    'SGD + Momentum': optimizers.SGD(learning_rate=0.01, momentum=0.9),
    'SGD + Nesterov': optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True),
    'RMSprop': optimizers.RMSprop(learning_rate=0.001),
    'Adam': optimizers.Adam(learning_rate=0.001),
    'AdamW': optimizers.AdamW(learning_rate=0.001, weight_decay=0.0001)
}

# Train each optimizer
histories = {}
epochs = 15

for name, optimizer in optimizers_to_compare.items():
    print(f"\nTraining with {name}...")
    
    # Create fresh model
    model = create_model()
    
    # Compile with specific optimizer
    model.compile(
        optimizer=optimizer,
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    # Train
    history = model.fit(
        X_train, y_train,
        epochs=epochs,
        batch_size=128,
        validation_data=(X_valid, y_valid),
        verbose=0
    )
    
    histories[name] = history
    
    # Print final metrics
    final_val_acc = history.history['val_accuracy'][-1]
    final_val_loss = history.history['val_loss'][-1]
    print(f"  Final Validation Accuracy: {final_val_acc:.4f}")
    print(f"  Final Validation Loss:     {final_val_loss:.4f}")

print("\nAll optimizers trained!")

In [None]:
# Visualize comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot validation loss
for name, history in histories.items():
    ax1.plot(history.history['val_loss'], linewidth=2, marker='o', label=name)

ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Validation Loss', fontsize=12)
ax1.set_title('Optimizer Comparison: Validation Loss', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot validation accuracy
for name, history in histories.items():
    ax2.plot(history.history['val_accuracy'], linewidth=2, marker='o', label=name)

ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Validation Accuracy', fontsize=12)
ax2.set_title('Optimizer Comparison: Validation Accuracy', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Summary statistics
print("\n" + "="*80)
print("OPTIMIZER PERFORMANCE SUMMARY")
print("="*80)
print(f"{'Optimizer':<20} {'Best Val Acc':<15} {'Final Val Acc':<15} {'Convergence Speed'}")
print("-"*80)

for name, history in histories.items():
    best_acc = max(history.history['val_accuracy'])
    final_acc = history.history['val_accuracy'][-1]
    # Epoch where accuracy reached 85% (if it did)
    acc_array = np.array(history.history['val_accuracy'])
    epoch_85 = np.argmax(acc_array >= 0.85) if any(acc_array >= 0.85) else epochs
    
    print(f"{name:<20} {best_acc:<15.4f} {final_acc:<15.4f} {epoch_85+1}/{epochs}")

print("="*80)

## 6. Learning Rate Schedules

### Why Use Learning Rate Schedules?

- **Early training**: Need higher learning rate for fast progress
- **Late training**: Need lower learning rate for fine-tuning
- **Solution**: Gradually decrease learning rate over time

### Common Schedules:

#### 1. **Step Decay**
$$\eta_t = \eta_0 \times \gamma^{\lfloor t/k \rfloor}$$

Reduce learning rate by factor $\gamma$ every $k$ epochs.

#### 2. **Exponential Decay**
$$\eta_t = \eta_0 \times e^{-\lambda t}$$

Smooth exponential reduction.

#### 3. **Cosine Annealing**
$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)$$

Gradually decrease following a cosine curve.

#### 4. **ReduceLROnPlateau**
Reduce learning rate when validation metric stops improving.

In [None]:
# Visualize different learning rate schedules
def step_decay(epoch, initial_lr=0.1, drop=0.5, epochs_drop=5):
    """Step decay schedule."""
    return initial_lr * (drop ** np.floor(epoch / epochs_drop))

def exponential_decay(epoch, initial_lr=0.1, k=0.1):
    """Exponential decay schedule."""
    return initial_lr * np.exp(-k * epoch)

def cosine_annealing(epoch, initial_lr=0.1, min_lr=0.001, T_max=30):
    """Cosine annealing schedule."""
    return min_lr + (initial_lr - min_lr) * 0.5 * (1 + np.cos(np.pi * epoch / T_max))

# Generate schedules
epochs_range = np.arange(0, 30)
lr_step = [step_decay(e) for e in epochs_range]
lr_exp = [exponential_decay(e) for e in epochs_range]
lr_cosine = [cosine_annealing(e) for e in epochs_range]

# Plot
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(epochs_range, lr_step, linewidth=2, marker='o', label='Step Decay')
ax.plot(epochs_range, lr_exp, linewidth=2, marker='s', label='Exponential Decay')
ax.plot(epochs_range, lr_cosine, linewidth=2, marker='^', label='Cosine Annealing')

ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Learning Rate', fontsize=12)
ax.set_title('Learning Rate Schedules Comparison', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Example: Train with cosine annealing schedule
print("Training with Cosine Annealing Learning Rate Schedule...\n")

# Create model
model_scheduled = create_model()

# Define cosine annealing callback
def cosine_schedule(epoch):
    """Cosine annealing schedule for Keras callback."""
    initial_lr = 0.01
    min_lr = 0.0001
    T_max = 20
    return min_lr + (initial_lr - min_lr) * 0.5 * (1 + np.cos(np.pi * epoch / T_max))

lr_scheduler = LearningRateScheduler(cosine_schedule, verbose=0)

# Compile
model_scheduled.compile(
    optimizer=optimizers.SGD(learning_rate=0.01, momentum=0.9),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train with scheduler
history_scheduled = model_scheduled.fit(
    X_train, y_train,
    epochs=20,
    batch_size=128,
    validation_data=(X_valid, y_valid),
    callbacks=[lr_scheduler],
    verbose=0
)

print("Training completed!")
print(f"Final Validation Accuracy: {history_scheduled.history['val_accuracy'][-1]:.4f}")

In [None]:
# Compare with constant learning rate
print("\nTraining with Constant Learning Rate (for comparison)...\n")

model_constant = create_model()
model_constant.compile(
    optimizer=optimizers.SGD(learning_rate=0.01, momentum=0.9),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history_constant = model_constant.fit(
    X_train, y_train,
    epochs=20,
    batch_size=128,
    validation_data=(X_valid, y_valid),
    verbose=0
)

print("Training completed!")
print(f"Final Validation Accuracy: {history_constant.history['val_accuracy'][-1]:.4f}")

In [None]:
# Compare scheduled vs constant learning rate
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Validation accuracy comparison
ax1.plot(history_constant.history['val_accuracy'], linewidth=2, 
         marker='o', label='Constant LR (0.01)')
ax1.plot(history_scheduled.history['val_accuracy'], linewidth=2, 
         marker='s', label='Cosine Annealing LR')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Validation Accuracy', fontsize=12)
ax1.set_title('Learning Rate Schedule Impact', fontsize=12, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Show actual learning rate schedule
lr_values = [cosine_schedule(e) for e in range(20)]
ax2.plot(lr_values, linewidth=2, marker='o', color='green')
ax2.axhline(y=0.01, color='red', linestyle='--', linewidth=2, label='Constant LR')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Learning Rate', fontsize=12)
ax2.set_title('Learning Rate Schedule', fontsize=12, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nBest Validation Accuracy:")
print(f"  Constant LR:        {max(history_constant.history['val_accuracy']):.4f}")
print(f"  Cosine Annealing:   {max(history_scheduled.history['val_accuracy']):.4f}")

## 7. Optimizer Selection Guide

### Decision Tree for Choosing Optimizers:

```
Start Here
    |
    |-- Need quick baseline? --> Adam (default choice)
    |
    |-- Training RNN/LSTM? --> RMSprop or Adam
    |
    |-- Training Transformer? --> AdamW
    |
    |-- Need best generalization? --> SGD + Momentum + LR Schedule
    |
    |-- Sparse gradients? --> Adam or AdaGrad
    |
    |-- Limited memory? --> SGD
```

### Hyperparameter Recommendations:

| Optimizer | Learning Rate | Other Params | When to Use |
|-----------|---------------|--------------|-------------|
| **SGD** | 0.01 - 0.1 | momentum=0.9 | Best final performance with tuning |
| **Adam** | 0.001 - 0.0001 | defaults are good | Default choice, fast convergence |
| **RMSprop** | 0.001 | rho=0.9 | RNNs, non-stationary problems |
| **AdamW** | 0.001 | weight_decay=0.01 | Transformers, large models |

### Common Pitfalls:

1. **Using same LR for different optimizers**: Adam needs ~10x smaller LR than SGD
2. **Not tuning momentum**: Default 0.9 is good, but try 0.95 or 0.99 for some tasks
3. **Forgetting LR schedules**: Can significantly improve SGD performance
4. **Not monitoring gradients**: Check for vanishing/exploding gradients

## 8. Summary

### Key Concepts:

1. **Gradient Descent Variants**:
   - Batch GD: Stable but slow
   - SGD: Fast but noisy
   - Mini-batch: Best balance (most common)

2. **Momentum Methods**:
   - Accelerate learning in consistent directions
   - Dampen oscillations
   - Nesterov: Look-ahead variant

3. **Adaptive Learning Rates**:
   - AdaGrad: Good for sparse data, but learning rate decays
   - RMSprop: Fixes AdaGrad, good for RNNs
   - Adam: Best of both worlds, default choice
   - AdamW: Adam with proper weight decay

4. **Learning Rate Schedules**:
   - Step decay: Periodic drops
   - Exponential: Smooth decay
   - Cosine annealing: Gentle curve
   - Adaptive: Based on validation metrics

5. **Practical Guidelines**:
   - Start with Adam for quick results
   - Use SGD + momentum + schedule for best final performance
   - Adjust learning rates appropriately for each optimizer
   - Monitor training curves to diagnose issues

### What's Next?

- **Module 07**: Regularization Techniques (Dropout, Batch Normalization)
- **Module 08**: Loss Functions and Metrics
- **Module 09**: Hyperparameter Tuning for Deep Learning

### Additional Resources:

- [Keras Optimizers Documentation](https://keras.io/api/optimizers/)
- [An Overview of Gradient Descent Optimization Algorithms](https://arxiv.org/abs/1609.04747)
- [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)
- [CS231n: Optimization](https://cs231n.github.io/neural-networks-3/)

## 9. Exercises

### Exercise 1: Implement Basic SGD with Momentum

**Task**: Implement SGD with momentum from scratch (without using Keras optimizers).

**Requirements**:
- Create a simple 2D optimization problem (e.g., minimize $f(x,y) = x^2 + 10y^2$)
- Implement the momentum update rule
- Compare convergence with standard GD
- Visualize the optimization path

**Questions**:
1. How does changing momentum ($\beta$) affect convergence?
2. What happens with momentum=0 vs momentum=0.9?

```python
# Your code here
```

In [None]:
# Exercise 1 Solution
# Uncomment to reveal

# def sgd_momentum(grad_fn, start, lr=0.1, beta=0.9, iterations=50):
#     """SGD with momentum implementation."""
#     path = [start]
#     point = np.array(start, dtype=float)
#     velocity = np.zeros_like(point)
#     
#     for _ in range(iterations):
#         grad = grad_fn(point[0], point[1])
#         velocity = beta * velocity + grad
#         point = point - lr * velocity
#         path.append(point.copy())
#     
#     return np.array(path)
# 
# # Test function and gradient
# def f(x, y):
#     return x**2 + 10 * y**2
# 
# def grad_f(x, y):
#     return np.array([2*x, 20*y])
# 
# # Compare different momentum values
# start = np.array([5.0, 5.0])
# path_no_momentum = sgd_momentum(grad_f, start, beta=0.0)
# path_momentum = sgd_momentum(grad_f, start, beta=0.9)
# 
# # Visualize
# # (Add visualization code here)

### Exercise 2: Learning Rate Sensitivity Analysis

**Task**: Investigate how different learning rates affect training with different optimizers.

**Requirements**:
- Train the same model with Adam using learning rates: [0.1, 0.01, 0.001, 0.0001]
- Train the same model with SGD using learning rates: [1.0, 0.1, 0.01, 0.001]
- Plot convergence curves for each
- Identify optimal learning rate ranges

**Questions**:
1. Which optimizer is more sensitive to learning rate?
2. What happens with too high learning rates?
3. What are the signs of learning rate being too small?

```python
# Your code here
```

In [None]:
# Exercise 2 Solution
# Uncomment to reveal

# adam_lrs = [0.1, 0.01, 0.001, 0.0001]
# sgd_lrs = [1.0, 0.1, 0.01, 0.001]
# 
# # Train with different Adam learning rates
# adam_histories = {}
# for lr in adam_lrs:
#     model = create_model()
#     model.compile(optimizer=optimizers.Adam(learning_rate=lr),
#                   loss='sparse_categorical_crossentropy',
#                   metrics=['accuracy'])
#     history = model.fit(X_train, y_train, epochs=10, 
#                        validation_data=(X_valid, y_valid), verbose=0)
#     adam_histories[lr] = history
# 
# # Repeat for SGD
# # (Add SGD training code here)
# 
# # Plot comparison
# # (Add plotting code here)

### Exercise 3: Custom Learning Rate Schedule

**Task**: Implement a custom "warm-up" learning rate schedule.

**Warm-up Schedule**:
- Linearly increase LR from 0 to max_lr over first `warmup_epochs`
- Then apply cosine annealing for remaining epochs

This is commonly used in transformer training.

**Requirements**:
- Implement the warm-up schedule function
- Train a model using this schedule
- Compare with standard cosine annealing
- Visualize the learning rate curve

```python
# Your code here
```

In [None]:
# Exercise 3 Solution
# Uncomment to reveal

# def warmup_cosine_schedule(epoch, warmup_epochs=5, max_lr=0.01, min_lr=0.0001, total_epochs=30):
#     """Warm-up followed by cosine annealing."""
#     if epoch < warmup_epochs:
#         # Linear warm-up
#         return max_lr * (epoch + 1) / warmup_epochs
#     else:
#         # Cosine annealing
#         progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
#         return min_lr + (max_lr - min_lr) * 0.5 * (1 + np.cos(np.pi * progress))
# 
# # Create callback
# warmup_scheduler = LearningRateScheduler(warmup_cosine_schedule, verbose=0)
# 
# # Train model
# model_warmup = create_model()
# model_warmup.compile(optimizer=optimizers.SGD(learning_rate=0.01, momentum=0.9),
#                      loss='sparse_categorical_crossentropy',
#                      metrics=['accuracy'])
# 
# history_warmup = model_warmup.fit(X_train, y_train, epochs=30,
#                                   validation_data=(X_valid, y_valid),
#                                   callbacks=[warmup_scheduler], verbose=0)
# 
# # Visualize schedule
# lrs = [warmup_cosine_schedule(e) for e in range(30)]
# plt.plot(lrs)
# plt.xlabel('Epoch')
# plt.ylabel('Learning Rate')
# plt.title('Warm-up + Cosine Annealing Schedule')
# plt.show()

### Exercise 4: Optimizer Robustness Test

**Task**: Test optimizer robustness to different batch sizes.

**Requirements**:
- Train with batch sizes: [16, 32, 64, 128, 256, 512]
- Use both Adam and SGD+momentum
- Keep other hyperparameters constant
- Measure: final accuracy, training time, convergence speed

**Questions**:
1. Which optimizer is more sensitive to batch size?
2. What's the trade-off between batch size and convergence?
3. Is there an optimal batch size?

```python
# Your code here
```

In [None]:
# Exercise 4 Solution
# Uncomment to reveal

# import time
# 
# batch_sizes = [16, 32, 64, 128, 256, 512]
# results = {'Adam': {}, 'SGD': {}}
# 
# for batch_size in batch_sizes:
#     # Test Adam
#     model = create_model()
#     model.compile(optimizer=optimizers.Adam(learning_rate=0.001),
#                   loss='sparse_categorical_crossentropy',
#                   metrics=['accuracy'])
#     
#     start_time = time.time()
#     history = model.fit(X_train, y_train, epochs=10, batch_size=batch_size,
#                        validation_data=(X_valid, y_valid), verbose=0)
#     training_time = time.time() - start_time
#     
#     results['Adam'][batch_size] = {
#         'accuracy': history.history['val_accuracy'][-1],
#         'time': training_time
#     }
#     
#     # Repeat for SGD
#     # (Add SGD training code here)
# 
# # Analyze and visualize results
# # (Add analysis code here)

---

**Congratulations!** You've completed Module 06. You now understand:
- Different gradient descent variants and their trade-offs
- How momentum accelerates convergence
- Adaptive learning rate methods (AdaGrad, RMSprop, Adam)
- Learning rate schedules for better training
- How to choose the right optimizer for your task

Continue to **Module 07: Regularization Techniques** to learn how to prevent overfitting!