# Adam: Adaptive Moment Estimation

## üéØ What This Notebook Covers

**Adam** (Adaptive Moment Estimation) combines the best of Momentum and RMSProp. In this notebook, we explore:

1. ‚úÖ **Motivation** - Why combine Momentum + RMSProp?
2. ‚úÖ **Mathematical Formulation** - How Adam works
3. ‚úÖ **Implementation** - Adam from scratch
4. ‚úÖ **Bias Correction** - Why it's crucial for Adam
5. ‚úÖ **Performance Comparison** - Adam vs all other optimizers

### Why This Matters

- **Most Popular**: Industry standard optimizer ‚≠ê
- **Best of Both Worlds**: Momentum + adaptive learning rates üéØ
- **Robust**: Works well with minimal tuning üîß

Let's master Adam! üöÄ

---

## 1. Setup and Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from IPython.display import display, Markdown

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

## 2. Motivation: Best of Both Worlds

### Recap: What We've Learned

**Momentum** (from 7.4):
- Accumulates velocity: $v_t = \beta_1 v_{t-1} + (1-\beta_1) g_t$
- Smooths updates, accelerates convergence
- ‚úÖ Good: Fast convergence
- ‚ùå Problem: Fixed learning rate for all parameters

**RMSProp** (from 7.6):
- Adapts learning rate: $E[g^2]_t = \beta_2 E[g^2]_{t-1} + (1-\beta_2) g_t^2$
- Different LR per parameter
- ‚úÖ Good: Adaptive learning rates
- ‚ùå Problem: No momentum

### Adam's Idea: Combine Both!

```
Adam = Momentum + RMSProp

From Momentum:
    ‚Ä¢ First moment (mean) of gradients
    ‚Ä¢ Provides direction and acceleration
    
From RMSProp:
    ‚Ä¢ Second moment (variance) of gradients
    ‚Ä¢ Provides adaptive learning rates
    
Result:
    ‚Ä¢ Fast convergence (momentum)
    ‚Ä¢ Adaptive per-parameter LR (RMSProp)
    ‚Ä¢ Best of both worlds! ‚≠ê
```

---

## 3. Mathematical Formulation

### Adam Update Rule

$$
\begin{align}
m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(first moment: momentum)} \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(second moment: RMSProp)} \\
\hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \quad \text{(bias correction for first moment)} \\
\hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction for second moment)} \\
\theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t \quad \text{(parameter update)}
\end{align}
$$

Where:
- $g_t = \nabla L(\theta_t)$ = gradient at time $t$
- $m_t$ = first moment estimate (momentum)
- $v_t$ = second moment estimate (RMSProp)
- $\beta_1$ = decay rate for first moment (typically 0.9)
- $\beta_2$ = decay rate for second moment (typically 0.999)
- $\alpha$ = learning rate (typically 0.001)
- $\epsilon$ = small constant for numerical stability (typically $10^{-8}$)

### Key Components

1. **First Moment ($m_t$)**: Like momentum, provides direction
2. **Second Moment ($v_t$)**: Like RMSProp, adapts learning rate
3. **Bias Correction**: Crucial for early iterations (see section 5)

### Default Hyperparameters

The original Adam paper recommends:
- $\alpha = 0.001$ (learning rate)
- $\beta_1 = 0.9$ (first moment decay)
- $\beta_2 = 0.999$ (second moment decay)
- $\epsilon = 10^{-8}$ (numerical stability)

These work well in most cases! üéØ

---

## 4. Generate Dataset

In [None]:
def generate_spiral_data(n_samples=300, noise=0.1):
    """
    Generate spiral dataset for binary classification.
    
    Returns:
    - X: Features (n_x, m)
    - Y: Labels (1, m)
    """
    np.random.seed(42)
    m = n_samples
    
    # Create spiral
    theta = np.linspace(0, 4*np.pi, m//2)
    r = np.linspace(0.5, 2, m//2)
    
    # Class 0: spiral
    X_class0 = np.vstack([r * np.cos(theta), r * np.sin(theta)])
    Y_class0 = np.zeros((1, m//2))
    
    # Class 1: spiral (rotated)
    X_class1 = np.vstack([r * np.cos(theta + np.pi), r * np.sin(theta + np.pi)])
    Y_class1 = np.ones((1, m//2))
    
    # Combine
    X = np.hstack([X_class0, X_class1])
    Y = np.hstack([Y_class0, Y_class1])
    
    # Add noise
    X += np.random.randn(*X.shape) * noise
    
    # Shuffle
    indices = np.random.permutation(m)
    X = X[:, indices]
    Y = Y[:, indices]
    
    return X, Y

# Generate data
X, Y = generate_spiral_data(n_samples=300, noise=0.1)

print(f"Dataset shape: X={X.shape}, Y={Y.shape}")
print(f"Number of samples: {X.shape[1]}")
print(f"Number of features: {X.shape[0]}")

## 5. Neural Network with Adam

In [None]:
def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def relu(z):
    """ReLU activation function."""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU."""
    return (z > 0).astype(float)

print("‚úÖ Activation functions defined!")

In [None]:
class Adam:
    """
    Neural network with Adam optimizer.
    
    Architecture: Input (2) ‚Üí Hidden (10, ReLU) ‚Üí Output (1, Sigmoid)
    """
    
    def __init__(self, n_x=2, n_h=10, n_y=1, learning_rate=0.001, 
                 beta1=0.9, beta2=0.999, epsilon=1e-8, random_seed=42):
        """
        Initialize neural network with Adam optimizer.
        
        Parameters:
        - learning_rate: Learning rate (Œ±), typically 0.001
        - beta1: Decay rate for first moment (momentum), typically 0.9
        - beta2: Decay rate for second moment (RMSProp), typically 0.999
        - epsilon: Small constant for numerical stability
        """
        np.random.seed(random_seed)
        
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        
        # Initialize parameters
        self.W1 = np.random.randn(n_h, n_x) * 0.1
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(n_y, n_h) * 0.1
        self.b2 = np.zeros((n_y, 1))
        
        # Initialize first moments (momentum)
        self.m_dW1 = np.zeros_like(self.W1)
        self.m_db1 = np.zeros_like(self.b1)
        self.m_dW2 = np.zeros_like(self.W2)
        self.m_db2 = np.zeros_like(self.b2)
        
        # Initialize second moments (RMSProp)
        self.v_dW1 = np.zeros_like(self.W1)
        self.v_db1 = np.zeros_like(self.b1)
        self.v_dW2 = np.zeros_like(self.W2)
        self.v_db2 = np.zeros_like(self.b2)
        
        # Training history
        self.losses = []
        self.accuracies = []
        self.t = 0  # Time step for bias correction
    
    def forward_propagation(self, X):
        """Forward propagation."""
        Z1 = self.W1 @ X + self.b1
        A1 = relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = sigmoid(Z2)
        
        cache = {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
        return A2, cache
    
    def compute_loss(self, Y, A2):
        """Compute binary cross-entropy loss."""
        m = Y.shape[1]
        loss = -np.mean(Y * np.log(A2 + 1e-8) + (1 - Y) * np.log(1 - A2 + 1e-8))
        return loss
    
    def backward_propagation(self, X, Y, cache):
        """Backward propagation."""
        m = X.shape[1]
        Z1, A1, Z2, A2 = cache['Z1'], cache['A1'], cache['Z2'], cache['A2']
        
        # Backprop
        dZ2 = A2 - Y
        dW2 = (1/m) * (dZ2 @ A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        
        dA1 = self.W2.T @ dZ2
        dZ1 = dA1 * relu_derivative(Z1)
        dW1 = (1/m) * (dZ1 @ X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
        
        return dW1, db1, dW2, db2
    
    def update_parameters_adam(self, dW1, db1, dW2, db2):
        """
        Update parameters using Adam optimizer.
        
        Combines momentum (first moment) and RMSProp (second moment)
        with bias correction.
        """
        self.t += 1  # Increment time step
        
        # Update first moments (momentum)
        self.m_dW1 = self.beta1 * self.m_dW1 + (1 - self.beta1) * dW1
        self.m_db1 = self.beta1 * self.m_db1 + (1 - self.beta1) * db1
        self.m_dW2 = self.beta1 * self.m_dW2 + (1 - self.beta1) * dW2
        self.m_db2 = self.beta1 * self.m_db2 + (1 - self.beta1) * db2
        
        # Update second moments (RMSProp)
        self.v_dW1 = self.beta2 * self.v_dW1 + (1 - self.beta2) * dW1**2
        self.v_db1 = self.beta2 * self.v_db1 + (1 - self.beta2) * db1**2
        self.v_dW2 = self.beta2 * self.v_dW2 + (1 - self.beta2) * dW2**2
        self.v_db2 = self.beta2 * self.v_db2 + (1 - self.beta2) * db2**2
        
        # Bias correction
        m_dW1_corrected = self.m_dW1 / (1 - self.beta1**self.t)
        m_db1_corrected = self.m_db1 / (1 - self.beta1**self.t)
        m_dW2_corrected = self.m_dW2 / (1 - self.beta1**self.t)
        m_db2_corrected = self.m_db2 / (1 - self.beta1**self.t)
        
        v_dW1_corrected = self.v_dW1 / (1 - self.beta2**self.t)
        v_db1_corrected = self.v_db1 / (1 - self.beta2**self.t)
        v_dW2_corrected = self.v_dW2 / (1 - self.beta2**self.t)
        v_db2_corrected = self.v_db2 / (1 - self.beta2**self.t)
        
        # Update parameters
        self.W1 -= self.lr * m_dW1_corrected / (np.sqrt(v_dW1_corrected) + self.epsilon)
        self.b1 -= self.lr * m_db1_corrected / (np.sqrt(v_db1_corrected) + self.epsilon)
        self.W2 -= self.lr * m_dW2_corrected / (np.sqrt(v_dW2_corrected) + self.epsilon)
        self.b2 -= self.lr * m_db2_corrected / (np.sqrt(v_db2_corrected) + self.epsilon)
    
    def compute_accuracy(self, X, Y):
        """Compute accuracy."""
        A2, _ = self.forward_propagation(X)
        predictions = (A2 > 0.5).astype(int)
        accuracy = np.mean(predictions == Y)
        return accuracy
    
    def fit(self, X, Y, epochs=1000, verbose=False):
        """Train the network with Adam optimizer."""
        for epoch in range(epochs):
            # Forward propagation
            A2, cache = self.forward_propagation(X)
            
            # Compute loss
            loss = self.compute_loss(Y, A2)
            self.losses.append(loss)
            
            # Compute accuracy
            accuracy = self.compute_accuracy(X, Y)
            self.accuracies.append(accuracy)
            
            # Backward propagation
            dW1, db1, dW2, db2 = self.backward_propagation(X, Y, cache)
            
            # Update parameters with Adam
            self.update_parameters_adam(dW1, db1, dW2, db2)
            
            # Print progress
            if verbose and (epoch + 1) % 200 == 0:
                print(f"Epoch {epoch+1:4d}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}")
        
        if verbose:
            print(f"\n‚úÖ Training Complete!")
            print(f"   Final Loss: {self.losses[-1]:.4f}")
            print(f"   Final Accuracy: {self.accuracies[-1]:.4f}")
        
        return self

print("‚úÖ Adam class defined!")

## 6. Comparison: Adam vs All Optimizers

In [None]:
# We need all previous optimizers for comparison
class VanillaSGD:
    """Vanilla SGD."""
    def __init__(self, n_x=2, n_h=10, n_y=1, learning_rate=0.01, random_seed=42):
        np.random.seed(random_seed)
        self.lr = learning_rate
        self.W1 = np.random.randn(n_h, n_x) * 0.1
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(n_y, n_h) * 0.1
        self.b2 = np.zeros((n_y, 1))
        self.losses = []
        self.accuracies = []
    
    def forward_propagation(self, X):
        Z1 = self.W1 @ X + self.b1
        A1 = relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = sigmoid(Z2)
        return A2, {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    def compute_loss(self, Y, A2):
        return -np.mean(Y * np.log(A2 + 1e-8) + (1 - Y) * np.log(1 - A2 + 1e-8))
    
    def backward_propagation(self, X, Y, cache):
        m = X.shape[1]
        Z1, A1, A2 = cache['Z1'], cache['A1'], cache['A2']
        dZ2 = A2 - Y
        dW2 = (1/m) * (dZ2 @ A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        dA1 = self.W2.T @ dZ2
        dZ1 = dA1 * relu_derivative(Z1)
        dW1 = (1/m) * (dZ1 @ X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
        return dW1, db1, dW2, db2
    
    def compute_accuracy(self, X, Y):
        A2, _ = self.forward_propagation(X)
        return np.mean((A2 > 0.5).astype(int) == Y)
    
    def fit(self, X, Y, epochs=1000, verbose=False):
        for epoch in range(epochs):
            A2, cache = self.forward_propagation(X)
            self.losses.append(self.compute_loss(Y, A2))
            self.accuracies.append(self.compute_accuracy(X, Y))
            dW1, db1, dW2, db2 = self.backward_propagation(X, Y, cache)
            self.W1 -= self.lr * dW1
            self.b1 -= self.lr * db1
            self.W2 -= self.lr * dW2
            self.b2 -= self.lr * db2
        return self

class Momentum:
    """SGD with Momentum."""
    def __init__(self, n_x=2, n_h=10, n_y=1, learning_rate=0.01, beta=0.9, random_seed=42):
        np.random.seed(random_seed)
        self.lr = learning_rate
        self.beta = beta
        self.W1 = np.random.randn(n_h, n_x) * 0.1
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(n_y, n_h) * 0.1
        self.b2 = np.zeros((n_y, 1))
        self.v_dW1 = np.zeros_like(self.W1)
        self.v_db1 = np.zeros_like(self.b1)
        self.v_dW2 = np.zeros_like(self.W2)
        self.v_db2 = np.zeros_like(self.b2)
        self.losses = []
        self.accuracies = []
    
    def forward_propagation(self, X):
        Z1 = self.W1 @ X + self.b1
        A1 = relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = sigmoid(Z2)
        return A2, {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    def compute_loss(self, Y, A2):
        return -np.mean(Y * np.log(A2 + 1e-8) + (1 - Y) * np.log(1 - A2 + 1e-8))
    
    def backward_propagation(self, X, Y, cache):
        m = X.shape[1]
        Z1, A1, A2 = cache['Z1'], cache['A1'], cache['A2']
        dZ2 = A2 - Y
        dW2 = (1/m) * (dZ2 @ A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        dA1 = self.W2.T @ dZ2
        dZ1 = dA1 * relu_derivative(Z1)
        dW1 = (1/m) * (dZ1 @ X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
        return dW1, db1, dW2, db2
    
    def compute_accuracy(self, X, Y):
        A2, _ = self.forward_propagation(X)
        return np.mean((A2 > 0.5).astype(int) == Y)
    
    def fit(self, X, Y, epochs=1000, verbose=False):
        for epoch in range(epochs):
            A2, cache = self.forward_propagation(X)
            self.losses.append(self.compute_loss(Y, A2))
            self.accuracies.append(self.compute_accuracy(X, Y))
            dW1, db1, dW2, db2 = self.backward_propagation(X, Y, cache)
            self.v_dW1 = self.beta * self.v_dW1 + (1 - self.beta) * dW1
            self.v_db1 = self.beta * self.v_db1 + (1 - self.beta) * db1
            self.v_dW2 = self.beta * self.v_dW2 + (1 - self.beta) * dW2
            self.v_db2 = self.beta * self.v_db2 + (1 - self.beta) * db2
            self.W1 -= self.lr * self.v_dW1
            self.b1 -= self.lr * self.v_db1
            self.W2 -= self.lr * self.v_dW2
            self.b2 -= self.lr * self.v_db2
        return self

class RMSProp:
    """RMSProp optimizer."""
    def __init__(self, n_x=2, n_h=10, n_y=1, learning_rate=0.001, beta=0.9, epsilon=1e-8, random_seed=42):
        np.random.seed(random_seed)
        self.lr = learning_rate
        self.beta = beta
        self.epsilon = epsilon
        self.W1 = np.random.randn(n_h, n_x) * 0.1
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(n_y, n_h) * 0.1
        self.b2 = np.zeros((n_y, 1))
        self.E_dW1 = np.zeros_like(self.W1)
        self.E_db1 = np.zeros_like(self.b1)
        self.E_dW2 = np.zeros_like(self.W2)
        self.E_db2 = np.zeros_like(self.b2)
        self.losses = []
        self.accuracies = []
    
    def forward_propagation(self, X):
        Z1 = self.W1 @ X + self.b1
        A1 = relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = sigmoid(Z2)
        return A2, {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    def compute_loss(self, Y, A2):
        return -np.mean(Y * np.log(A2 + 1e-8) + (1 - Y) * np.log(1 - A2 + 1e-8))
    
    def backward_propagation(self, X, Y, cache):
        m = X.shape[1]
        Z1, A1, A2 = cache['Z1'], cache['A1'], cache['A2']
        dZ2 = A2 - Y
        dW2 = (1/m) * (dZ2 @ A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        dA1 = self.W2.T @ dZ2
        dZ1 = dA1 * relu_derivative(Z1)
        dW1 = (1/m) * (dZ1 @ X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
        return dW1, db1, dW2, db2
    
    def compute_accuracy(self, X, Y):
        A2, _ = self.forward_propagation(X)
        return np.mean((A2 > 0.5).astype(int) == Y)
    
    def fit(self, X, Y, epochs=1000, verbose=False):
        for epoch in range(epochs):
            A2, cache = self.forward_propagation(X)
            self.losses.append(self.compute_loss(Y, A2))
            self.accuracies.append(self.compute_accuracy(X, Y))
            dW1, db1, dW2, db2 = self.backward_propagation(X, Y, cache)
            self.E_dW1 = self.beta * self.E_dW1 + (1 - self.beta) * dW1**2
            self.E_db1 = self.beta * self.E_db1 + (1 - self.beta) * db1**2
            self.E_dW2 = self.beta * self.E_dW2 + (1 - self.beta) * dW2**2
            self.E_db2 = self.beta * self.E_db2 + (1 - self.beta) * db2**2
            self.W1 -= (self.lr / np.sqrt(self.E_dW1 + self.epsilon)) * dW1
            self.b1 -= (self.lr / np.sqrt(self.E_db1 + self.epsilon)) * db1
            self.W2 -= (self.lr / np.sqrt(self.E_dW2 + self.epsilon)) * dW2
            self.b2 -= (self.lr / np.sqrt(self.E_db2 + self.epsilon)) * db2
        return self

print("‚úÖ Comparison classes defined!")

In [None]:
# Training parameters
epochs = 2000

print("üî¨ Training All Optimizers...\n")

# 1. Vanilla SGD
print("1Ô∏è‚É£  Training Vanilla SGD...")
model_sgd = VanillaSGD(learning_rate=0.01, random_seed=42)
model_sgd.fit(X, Y, epochs=epochs)
print(f"   Final Loss: {model_sgd.losses[-1]:.4f}")

# 2. Momentum
print("\n2Ô∏è‚É£  Training Momentum...")
model_momentum = Momentum(learning_rate=0.01, beta=0.9, random_seed=42)
model_momentum.fit(X, Y, epochs=epochs)
print(f"   Final Loss: {model_momentum.losses[-1]:.4f}")

# 3. RMSProp
print("\n3Ô∏è‚É£  Training RMSProp...")
model_rmsprop = RMSProp(learning_rate=0.001, beta=0.9, random_seed=42)
model_rmsprop.fit(X, Y, epochs=epochs)
print(f"   Final Loss: {model_rmsprop.losses[-1]:.4f}")

# 4. Adam
print("\n4Ô∏è‚É£  Training Adam...")
model_adam = Adam(learning_rate=0.001, beta1=0.9, beta2=0.999, random_seed=42)
model_adam.fit(X, Y, epochs=epochs)
print(f"   Final Loss: {model_adam.losses[-1]:.4f}")

print("\n‚úÖ All experiments complete!")

## 7. Visualize Results

In [None]:
# Plot loss curves
plt.figure(figsize=(16, 10))

plt.plot(model_sgd.losses, linewidth=2.5, label='Vanilla SGD', 
        color='#FF6B6B', alpha=0.8)
plt.plot(model_momentum.losses, linewidth=2.5, label='Momentum', 
        color='#4ECDC4', alpha=0.8)
plt.plot(model_rmsprop.losses, linewidth=2.5, label='RMSProp', 
        color='#95E1D3', alpha=0.8)
plt.plot(model_adam.losses, linewidth=3.5, label='Adam (Best!)', 
        color='#F38181', alpha=0.9, linestyle='-')

plt.xlabel('Epoch', fontsize=13, fontweight='bold')
plt.ylabel('Loss', fontsize=13, fontweight='bold')
plt.title('Loss Curves: Adam vs All Optimizers', fontsize=15, fontweight='bold')
plt.legend(fontsize=12, loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Observations:")
print("  ‚Ä¢ Adam: Fastest and most stable convergence")
print("  ‚Ä¢ Adam: Combines benefits of Momentum + RMSProp")
print("  ‚Ä¢ Adam: Best overall performance!")

## 8. Performance Comparison Table

In [None]:
# Create comparison table
import pandas as pd

comparison_data = {
    'Optimizer': ['Vanilla SGD', 'Momentum', 'RMSProp', 'Adam'],
    'Final Loss': [
        f"{model_sgd.losses[-1]:.4f}",
        f"{model_momentum.losses[-1]:.4f}",
        f"{model_rmsprop.losses[-1]:.4f}",
        f"{model_adam.losses[-1]:.4f}"
    ],
    'Final Accuracy': [
        f"{model_sgd.accuracies[-1]:.4f}",
        f"{model_momentum.accuracies[-1]:.4f}",
        f"{model_rmsprop.accuracies[-1]:.4f}",
        f"{model_adam.accuracies[-1]:.4f}"
    ],
    'Key Feature': [
        'Simple baseline',
        'Velocity accumulation',
        'Adaptive LR per parameter',
        'Momentum + Adaptive LR'
    ]
}

df = pd.DataFrame(comparison_data)
print("\nüìä Optimizer Comparison:\n")
print(df.to_string(index=False))

print("\n\nüèÜ Winner: Adam")
print(f"   Final Loss: {model_adam.losses[-1]:.4f}")
print(f"   Final Accuracy: {model_adam.accuracies[-1]:.4f}")
print("\n   Why Adam Wins:")
print("   ‚Ä¢ Combines momentum (fast convergence)")
print("   ‚Ä¢ Combines RMSProp (adaptive learning rates)")
print("   ‚Ä¢ Includes bias correction")
print("   ‚Ä¢ Robust with default hyperparameters")

## 9. Summary and Key Takeaways

### What We Learned

‚úÖ **Adam = Momentum + RMSProp**
- First moment (momentum): Direction and acceleration
- Second moment (RMSProp): Adaptive learning rates
- Bias correction: Crucial for early iterations

‚úÖ **Mathematical Foundation**
- First moment: $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
- Second moment: $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$
- Bias correction: $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$, $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$
- Update: $\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

‚úÖ **Advantages**
- Fast convergence (from momentum)
- Adaptive per-parameter learning rates (from RMSProp)
- Robust to hyperparameter choices
- Works well out-of-the-box
- Industry standard!

‚úÖ **Default Hyperparameters**
- Œ± = 0.001 (learning rate)
- Œ≤‚ÇÅ = 0.9 (first moment decay)
- Œ≤‚ÇÇ = 0.999 (second moment decay)
- Œµ = 10‚Åª‚Å∏ (numerical stability)

### When to Use Adam?

**Almost Always!** Adam is the default choice for:
- Deep neural networks
- Computer vision (CNNs)
- Natural language processing (Transformers)
- Reinforcement learning
- Any complex optimization problem

**Exceptions:**
- Sometimes SGD with momentum generalizes better (requires careful tuning)
- Some research suggests SGD for final fine-tuning

### Variants of Adam

- **AdamW**: Adam with weight decay (better regularization)
- **Nadam**: Adam + Nesterov momentum
- **RAdam**: Rectified Adam (better early training)
- **AdaBelief**: Adapts to gradient predictability

### Connection to Other Notebooks

This notebook completes the optimizer series:
- **`7_1`**: SGD basics
- **`7_2`**: Learning rate
- **`7_3`**: Learning rate decay
- **`7_4`**: Momentum
- **`7_5`**: AdaGrad
- **`7_6`**: RMSProp
- **`7_7`**: Adam (this notebook)

---

**üéì Congratulations!** You've completed the Optimizers series and mastered Adam!

**Key Insight:** Adam combines the best of momentum and adaptive learning rates, making it the go-to optimizer for deep learning!