# Lab 1.4.2: Optimizer Implementation

**Module:** 1.4 - Mathematics for Deep Learning  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Implement vanilla SGD from scratch
- [ ] Implement SGD with momentum
- [ ] Implement the Adam optimizer
- [ ] Understand why adaptive learning rates help
- [ ] Compare convergence behavior of different optimizers

---

## üìö Prerequisites

- Completed: Lab 1.4.1 (Manual Backpropagation)
- Knowledge of: Gradients, basic calculus

---

## üåç Real-World Context

**Why do optimizers matter?**

Choosing the right optimizer can mean the difference between:
- A model that trains in 1 hour vs 100 hours
- A model that converges vs one that diverges
- Finding a good solution vs getting stuck in a bad one

**Real examples:**
- GPT models use AdamW (Adam with weight decay)
- Vision Transformers often use AdamW with specific Œ≤ values
- BERT was trained with Adam (Œ≤1=0.9, Œ≤2=0.999)

Understanding these algorithms helps you debug training and choose the right tool!

---

## üßí ELI5: What is an Optimizer?

> **Imagine you're blindfolded on a mountain, trying to find the lowest valley...**
>
> **Vanilla SGD (the basic approach):**
> - Feel which way is downhill with your foot
> - Take one step in that direction
> - Repeat
> - Problem: You might zigzag a lot and waste energy!
>
> **SGD with Momentum (like a ball rolling downhill):**
> - You're a ball now! You build up speed as you roll
> - When going downhill consistently, you go faster
> - Small bumps don't stop you (you have momentum!)
> - Problem: Sometimes you overshoot the valley
>
> **Adam (the smart explorer):**
> - You have momentum (like the ball)
> - BUT you also adapt your step size!
> - Steep areas? Take smaller steps to not overshoot
> - Flat areas? Take bigger steps to make progress
> - This is like having a smart GPS that adjusts your walking speed!

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("üöÄ Optimizer Implementation Lab")
print("=" * 50)

---

## Part 1: The Test Function

We'll optimize the **Rosenbrock function** - a classic optimization test:

$$f(x, y) = (a - x)^2 + b(y - x^2)^2$$

Where typically $a=1$, $b=100$.

**Why this function?**
- Has a global minimum at $(a, a^2) = (1, 1)$
- Has a curved valley that's hard to navigate
- Exposes weaknesses in different optimizers

It looks like a banana-shaped valley - easy to find the valley, hard to find the bottom!

In [None]:
def rosenbrock(x, y, a=1, b=100):
    """
    Rosenbrock function - a classic optimization test case.
    
    Global minimum: f(a, a¬≤) = 0 ‚Üí at (1, 1) with default params
    """
    return (a - x)**2 + b * (y - x**2)**2

def rosenbrock_gradient(x, y, a=1, b=100):
    """
    Gradient of Rosenbrock function.
    
    ‚àÇf/‚àÇx = -2(a-x) - 4bx(y-x¬≤)
    ‚àÇf/‚àÇy = 2b(y-x¬≤)
    """
    df_dx = -2*(a - x) - 4*b*x*(y - x**2)
    df_dy = 2*b*(y - x**2)
    return np.array([df_dx, df_dy])

# Visualize the function
fig = plt.figure(figsize=(14, 5))

# Create mesh
x_range = np.linspace(-2, 2, 200)
y_range = np.linspace(-1, 3, 200)
X, Y = np.meshgrid(x_range, y_range)
Z = rosenbrock(X, Y)

# Contour plot
ax1 = fig.add_subplot(121)
levels = np.logspace(0, 3, 20)  # Log-spaced levels
contour = ax1.contour(X, Y, Z, levels=levels, cmap='viridis')
ax1.scatter([1], [1], color='red', s=100, marker='*', 
           label='Global minimum (1, 1)', zorder=5)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Rosenbrock Function (Contour)')
ax1.legend()
plt.colorbar(contour, ax=ax1, label='f(x,y)')

# 3D surface
ax2 = fig.add_subplot(122, projection='3d')
# Use log scale for better visualization
Z_log = np.log10(Z + 1)
surf = ax2.plot_surface(X, Y, Z_log, cmap='viridis', alpha=0.8)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('log‚ÇÅ‚ÇÄ(f + 1)')
ax2.set_title('Rosenbrock Function (3D)')

plt.tight_layout()
plt.show()

print("\nüìä The Rosenbrock function:")
print("  - Has a curved 'banana' valley")
print("  - Global minimum at (1, 1) with value 0")
print("  - Difficult because the valley is flat but curved")

---

## Part 2: Vanilla SGD

The simplest optimizer:

$$\theta_{t+1} = \theta_t - \eta \cdot \nabla L(\theta_t)$$

Where:
- $\theta$ = parameters
- $\eta$ = learning rate
- $\nabla L$ = gradient of loss

**That's it!** Just move opposite to the gradient.

In [None]:
class SGD:
    """
    Vanilla Stochastic Gradient Descent optimizer.
    
    The simplest optimizer: just follow the negative gradient.
    
    Update rule:
        Œ∏ = Œ∏ - lr * gradient
    
    Args:
        lr: Learning rate (step size)
    
    Example:
        >>> optimizer = SGD(lr=0.01)
        >>> params = np.array([0.0, 0.0])
        >>> for _ in range(100):
        ...     grad = compute_gradient(params)
        ...     params = optimizer.step(params, grad)
    """
    
    def __init__(self, lr=0.01):
        self.lr = lr
        self.name = f"SGD (lr={lr})"
    
    def step(self, params, grads):
        """
        Perform one optimization step.
        
        Args:
            params: Current parameters (numpy array)
            grads: Gradients at current parameters
            
        Returns:
            Updated parameters
        """
        return params - self.lr * grads
    
    def reset(self):
        """Reset optimizer state (nothing to reset for vanilla SGD)"""
        pass

print("SGD Implementation:")
print("  Œ∏_new = Œ∏ - lr √ó gradient")
print("\nSimple, but can be slow in curved valleys!")

---

## Part 3: SGD with Momentum

Momentum helps overcome oscillations by accumulating past gradients:

$$v_t = \beta \cdot v_{t-1} + \nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \eta \cdot v_t$$

Where:
- $v$ = velocity (accumulated gradient)
- $\beta$ = momentum coefficient (typically 0.9)

### üßí ELI5: Why Momentum?

> Imagine pushing a shopping cart down a bumpy path:
> - Without momentum: Every bump makes you change direction
> - With momentum: The cart's weight carries it through small bumps
>
> The cart naturally smooths out the path!

In [None]:
class SGDMomentum:
    """
    SGD with Momentum optimizer.
    
    Accumulates past gradients to build "velocity" and smooth out oscillations.
    
    Update rules:
        v = Œ≤ √ó v + gradient           (accumulate velocity)
        Œ∏ = Œ∏ - lr √ó v                 (update parameters)
    
    Args:
        lr: Learning rate
        momentum: Momentum coefficient Œ≤ (typically 0.9)
    
    Example:
        >>> optimizer = SGDMomentum(lr=0.01, momentum=0.9)
        >>> optimizer.reset()  # Initialize velocity
        >>> params = np.array([0.0, 0.0])
        >>> for _ in range(100):
        ...     grad = compute_gradient(params)
        ...     params = optimizer.step(params, grad)
    """
    
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.velocity = None
        self.name = f"SGD+Momentum (lr={lr}, Œ≤={momentum})"
    
    def step(self, params, grads):
        """
        Perform one optimization step with momentum.
        
        Args:
            params: Current parameters
            grads: Gradients at current parameters
            
        Returns:
            Updated parameters
        """
        # Initialize velocity on first call
        if self.velocity is None:
            self.velocity = np.zeros_like(params)
        
        # Update velocity: v = Œ≤*v + grad
        self.velocity = self.momentum * self.velocity + grads
        
        # Update parameters: Œ∏ = Œ∏ - lr*v
        return params - self.lr * self.velocity
    
    def reset(self):
        """Reset velocity to zero"""
        self.velocity = None

print("SGD+Momentum Implementation:")
print("  v = Œ≤ √ó v + gradient")
print("  Œ∏_new = Œ∏ - lr √ó v")
print("\nThe velocity builds up in consistent directions!")

---

## Part 4: Adam Optimizer

**Adam** = **Ada**ptive **M**oment estimation

Adam combines:
1. **Momentum** (first moment = mean of gradients)
2. **RMSprop** (second moment = variance of gradients)

### The Algorithm:

$$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$
$$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$
$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
$$\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

### üßí ELI5: Why Adam Works

> Imagine you're navigating a fog-covered valley:
>
> **First moment (m):** "On average, I've been going LEFT, so I should keep going left"
> (This is momentum!)
>
> **Second moment (v):** "The ground has been VERY bumpy in the left-right direction"
> (This tells you how much the gradient varies)
>
> **Combining them:** "I should go left (m says so), but take SMALL steps because it's bumpy there (v says so)"
>
> Adam automatically takes smaller steps in bumpy directions and bigger steps in smooth directions!

In [None]:
class Adam:
    """
    Adam (Adaptive Moment Estimation) optimizer.
    
    Combines momentum with adaptive learning rates per parameter.
    
    Update rules:
        m = Œ≤1 √ó m + (1-Œ≤1) √ó g         (first moment / momentum)
        v = Œ≤2 √ó v + (1-Œ≤2) √ó g¬≤        (second moment / variance)
        m_hat = m / (1 - Œ≤1^t)          (bias correction)
        v_hat = v / (1 - Œ≤2^t)          (bias correction)
        Œ∏ = Œ∏ - lr √ó m_hat / (‚àöv_hat + Œµ)
    
    Args:
        lr: Learning rate (default 0.001)
        beta1: Momentum coefficient (default 0.9)
        beta2: Variance coefficient (default 0.999)
        epsilon: Numerical stability term (default 1e-8)
    
    Example:
        >>> optimizer = Adam(lr=0.001)
        >>> optimizer.reset()
        >>> params = np.array([0.0, 0.0])
        >>> for _ in range(100):
        ...     grad = compute_gradient(params)
        ...     params = optimizer.step(params, grad)
    """
    
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        
        # Moving averages (will be initialized on first step)
        self.m = None  # First moment (mean of gradients)
        self.v = None  # Second moment (variance of gradients)
        self.t = 0     # Timestep (for bias correction)
        
        self.name = f"Adam (lr={lr})"
    
    def step(self, params, grads):
        """
        Perform one Adam optimization step.
        
        Args:
            params: Current parameters
            grads: Gradients at current parameters
            
        Returns:
            Updated parameters
        """
        # Initialize on first call
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        # Increment timestep
        self.t += 1
        
        # Update biased first moment estimate (momentum)
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        
        # Update biased second moment estimate (variance)
        self.v = self.beta2 * self.v + (1 - self.beta2) * (grads ** 2)
        
        # Compute bias-corrected estimates
        # (Important early in training when m and v are biased toward 0)
        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)
        
        # Update parameters
        params = params - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
        
        return params
    
    def reset(self):
        """Reset optimizer state"""
        self.m = None
        self.v = None
        self.t = 0

print("Adam Implementation:")
print("  m = Œ≤1√óm + (1-Œ≤1)√óg        (momentum)")
print("  v = Œ≤2√óv + (1-Œ≤2)√óg¬≤       (variance tracking)")
print("  Œ∏ = Œ∏ - lr √ó mÃÇ / (‚àövÃÇ + Œµ)  (adaptive update)")
print("\nAdapts step size per-parameter based on gradient history!")

---

## Part 5: Comparing Optimizers

Let's run all three optimizers on the Rosenbrock function and see how they perform!

In [None]:
def optimize(optimizer, start_point, gradient_fn, n_steps=1000):
    """
    Run optimization and track history.
    
    Args:
        optimizer: Optimizer object with step() method
        start_point: Initial parameters
        gradient_fn: Function to compute gradients
        n_steps: Number of optimization steps
    
    Returns:
        Dictionary with trajectory and loss history
    """
    optimizer.reset()
    
    params = start_point.copy()
    history = {'params': [params.copy()], 'loss': [rosenbrock(params[0], params[1])]}
    
    for _ in range(n_steps):
        grads = gradient_fn(params[0], params[1])
        params = optimizer.step(params, grads)
        
        loss = rosenbrock(params[0], params[1])
        history['params'].append(params.copy())
        history['loss'].append(loss)
        
        # Early stopping if converged
        if loss < 1e-10:
            break
    
    return history

# Starting point (in the "hard" region of the valley)
start_point = np.array([-1.0, 1.0])

# Create optimizers with tuned learning rates
optimizers = [
    SGD(lr=0.001),                    # Small lr to avoid divergence
    SGDMomentum(lr=0.001, momentum=0.9),
    Adam(lr=0.1),                      # Adam can handle larger lr
]

# Run optimization
n_steps = 5000
histories = {}

print("Running optimizations...")
print(f"Start: ({start_point[0]}, {start_point[1]})")
print(f"Goal:  (1.0, 1.0)")
print(f"Steps: {n_steps}")
print("=" * 50)

for opt in optimizers:
    histories[opt.name] = optimize(opt, start_point, rosenbrock_gradient, n_steps)
    final_loss = histories[opt.name]['loss'][-1]
    final_params = histories[opt.name]['params'][-1]
    steps_taken = len(histories[opt.name]['loss']) - 1
    
    print(f"\n{opt.name}:")
    print(f"  Final position: ({final_params[0]:.6f}, {final_params[1]:.6f})")
    print(f"  Final loss: {final_loss:.6e}")
    print(f"  Steps taken: {steps_taken}")

In [None]:
# Visualize the results!

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Colors for each optimizer
colors = ['red', 'blue', 'green']

# --- Left plot: Trajectories on contour ---
levels = np.logspace(-1, 3, 30)
axes[0].contour(X, Y, Z, levels=levels, cmap='Greys', alpha=0.5)

for (name, hist), color in zip(histories.items(), colors):
    path = np.array(hist['params'])
    # Plot every nth point to avoid clutter
    step = max(1, len(path) // 200)
    axes[0].plot(path[::step, 0], path[::step, 1], '-', color=color, 
                linewidth=1.5, alpha=0.7, label=name)
    axes[0].scatter(path[0, 0], path[0, 1], color=color, s=100, 
                   marker='o', edgecolors='black', zorder=5)
    axes[0].scatter(path[-1, 0], path[-1, 1], color=color, s=100, 
                   marker='*', edgecolors='black', zorder=5)

axes[0].scatter([1], [1], color='gold', s=200, marker='*', 
               edgecolors='black', linewidth=2, label='Optimum', zorder=10)
axes[0].set_xlabel('x', fontsize=12)
axes[0].set_ylabel('y', fontsize=12)
axes[0].set_title('Optimization Trajectories', fontsize=14)
axes[0].legend(loc='upper left')
axes[0].set_xlim(-1.5, 1.5)
axes[0].set_ylim(-0.5, 2)

# --- Right plot: Loss curves ---
for (name, hist), color in zip(histories.items(), colors):
    losses = hist['loss']
    axes[1].semilogy(losses, color=color, linewidth=2, label=name, alpha=0.8)

axes[1].axhline(y=1e-6, color='gray', linestyle='--', label='Target: 1e-6')
axes[1].set_xlabel('Step', fontsize=12)
axes[1].set_ylabel('Loss (log scale)', fontsize=12)
axes[1].set_title('Convergence Comparison', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Key Observations:")
print("  - SGD: Slow, struggles in the curved valley")
print("  - Momentum: Faster, but may overshoot")
print("  - Adam: Adapts to the landscape, converges reliably")

### üîç What Just Happened?

**Left plot (Trajectories):**
- Shows the path each optimizer takes through parameter space
- Notice how SGD zigzags, while Adam takes a more direct path
- Stars show start (filled) and end (star shape) points

**Right plot (Loss curves):**
- Y-axis is log scale - lower is better
- Adam typically converges fastest
- Momentum helps SGD avoid getting stuck

**Key insight:** Adam adapts its step size, which helps in the curved valley!

---

## Part 6: Why Bias Correction Matters in Adam

Early in training, `m` and `v` are initialized to 0, which biases them toward 0.

Let's see what happens without bias correction:

In [None]:
class AdamNoBiasCorrection:
    """Adam WITHOUT bias correction (for demonstration)"""
    
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None
        self.v = None
        self.name = "Adam (no bias correction)"
    
    def step(self, params, grads):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * (grads ** 2)
        
        # No bias correction!
        return params - self.lr * self.m / (np.sqrt(self.v) + self.epsilon)
    
    def reset(self):
        self.m = None
        self.v = None

# Compare with and without bias correction
adam_with = Adam(lr=0.1)
adam_without = AdamNoBiasCorrection(lr=0.1)

hist_with = optimize(adam_with, start_point, rosenbrock_gradient, 1000)
hist_without = optimize(adam_without, start_point, rosenbrock_gradient, 1000)

# Plot comparison
plt.figure(figsize=(10, 5))
plt.semilogy(hist_with['loss'][:200], 'b-', linewidth=2, label='Adam (with bias correction)')
plt.semilogy(hist_without['loss'][:200], 'r--', linewidth=2, label='Adam (without bias correction)')
plt.xlabel('Step', fontsize=12)
plt.ylabel('Loss (log scale)', fontsize=12)
plt.title('Effect of Bias Correction in Adam', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Without bias correction:")
print("  - Steps are too small early in training")
print("  - Takes longer to get moving")
print("  - Bias correction fixes this!")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Learning Rate Too High

```python
# ‚ùå Wrong: Learning rate causes divergence
optimizer = SGD(lr=1.0)  # Rosenbrock has large gradients!

# ‚úÖ Right: Start small and tune
optimizer = SGD(lr=0.001)  # Safe starting point
```

**Why:** Large learning rates can overshoot the minimum and cause the loss to explode.

### Mistake 2: Forgetting to Reset State

```python
# ‚ùå Wrong: Reusing optimizer without reset
opt = Adam(lr=0.001)
train_model_1(opt)  # First training
train_model_2(opt)  # Oops! Still has state from model 1!

# ‚úÖ Right: Reset between uses
opt = Adam(lr=0.001)
train_model_1(opt)
opt.reset()  # Clear momentum/variance
train_model_2(opt)
```

### Mistake 3: Wrong Epsilon Value

```python
# ‚ùå Wrong: epsilon too large changes behavior
optimizer = Adam(lr=0.001, epsilon=1.0)  # Defeats adaptive learning!

# ‚úÖ Right: Keep epsilon small
optimizer = Adam(lr=0.001, epsilon=1e-8)  # Standard value
```

**Why:** epsilon is just for numerical stability; making it large changes the algorithm's behavior.

---

## ‚úã Try It Yourself

### Exercise 1: Implement RMSprop

RMSprop is like Adam but without the momentum term:

$$v_t = \beta \cdot v_{t-1} + (1-\beta) \cdot g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \cdot g_t$$

<details>
<summary>üí° Hint</summary>

Start from the Adam implementation and:
1. Remove the first moment (`m`) calculations
2. Remove bias correction (RMSprop doesn't use it)
3. Divide the gradient by sqrt(v) + epsilon
</details>

In [None]:
# YOUR CODE HERE: Implement RMSprop

class RMSprop:
    """
    RMSprop optimizer.
    
    Update rules:
        v = Œ≤ √ó v + (1-Œ≤) √ó g¬≤
        Œ∏ = Œ∏ - lr √ó g / (‚àöv + Œµ)
    """
    
    def __init__(self, lr=0.01, beta=0.9, epsilon=1e-8):
        self.lr = lr
        self.beta = beta
        self.epsilon = epsilon
        self.v = None
        self.name = f"RMSprop (lr={lr})"
    
    def step(self, params, grads):
        # TODO: Implement RMSprop step
        # 1. Initialize v if None (use np.zeros_like)
        # 2. Update v: v = beta * v + (1-beta) * grads**2
        # 3. Update params: params = params - lr * grads / (sqrt(v) + epsilon)
        raise NotImplementedError("Implement the RMSprop step method")
    
    def reset(self):
        self.v = None

# Test your implementation (uncomment after implementing)
# rmsprop = RMSprop(lr=0.01)
# hist_rmsprop = optimize(rmsprop, start_point, rosenbrock_gradient, 5000)
# print(f"RMSprop final loss: {hist_rmsprop['loss'][-1]:.6e}")

### Exercise 2: AdamW (Adam with Weight Decay)

AdamW adds weight decay (L2 regularization) SEPARATELY from the gradient:

$$\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \cdot \lambda \cdot \theta_t$$

Where $\lambda$ is the weight decay coefficient.

<details>
<summary>üí° Hint</summary>

Add one line to the Adam step:
```python
params = params - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
params = params - self.lr * self.weight_decay * params  # Add this!
```
</details>

In [None]:
# YOUR CODE HERE: Implement AdamW

class AdamW:
    """Adam with decoupled weight decay (AdamW)"""
    
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.weight_decay = weight_decay
        self.m = None
        self.v = None
        self.t = 0
        self.name = f"AdamW (lr={lr}, wd={weight_decay})"
    
    def step(self, params, grads):
        # TODO: Implement AdamW step
        # 1. Initialize m and v if None
        # 2. Increment timestep t
        # 3. Update m and v (same as Adam)
        # 4. Compute bias-corrected m_hat and v_hat
        # 5. Update params with Adam step
        # 6. Apply weight decay: params = params - lr * weight_decay * params
        raise NotImplementedError("Implement the AdamW step method")
    
    def reset(self):
        self.m = None
        self.v = None
        self.t = 0

# Test AdamW (uncomment after implementing)
# adamw = AdamW(lr=0.1, weight_decay=0.01)
# hist_adamw = optimize(adamw, start_point, rosenbrock_gradient, 5000)
# print(f"AdamW final loss: {hist_adamw['loss'][-1]:.6e}")

---

## üéâ Checkpoint

You've learned:

- ‚úÖ **SGD**: Simple gradient descent - just follow the gradient
- ‚úÖ **Momentum**: Accumulate velocity to overcome oscillations
- ‚úÖ **Adam**: Adaptive learning rates + momentum for robust training
- ‚úÖ **Bias correction**: Why it matters early in training
- ‚úÖ How to compare optimizers visually

**Key insight:** Modern optimizers like Adam work well because they adapt to the loss landscape!

---

## üìñ Further Reading

- [Why Momentum Really Works](https://distill.pub/2017/momentum/) - Beautiful visualizations
- [Adam Paper](https://arxiv.org/abs/1412.6980) - Original Adam paper
- [AdamW Paper](https://arxiv.org/abs/1711.05101) - Decoupled weight decay
- [An overview of gradient descent optimization](https://ruder.io/optimizing-gradient-descent/) - Comprehensive survey

---

## üßπ Cleanup

In [None]:
import gc
gc.collect()

print("‚úÖ Cleanup complete!")
print("\n‚û°Ô∏è  Next: Lab 1.4.3 - Loss Landscape Visualization")