{ "cells": [  {   "cell_type": "markdown",   "source": [    "# Gradient Descent Variants: Momentum, Adam, and Beyond",    "",    "**Welcome back, St. Mark!** Today we explore the optimization algorithms that make modern machine learning possible. Think of gradient descent variants as different \"learning strategies\" - from the steady plodding of basic GD to the adaptive intelligence of Adam.",    "",    "We'll explore:",    "",    "1. **Vanilla Gradient Descent** - The foundation of all optimization",    "2. **Momentum** - Accelerating convergence with physics-inspired motion",    "3. **RMSProp** - Adaptive learning rates for each parameter",    "4. **Adam** - The \"best of both worlds\" optimizer",    "5. **Comparison** - When to use each approach",    "",    "By the end, you'll understand why optimization algorithms are the \"engines\" of machine learning.",    "",    "## The Big Picture",    "",    "**Optimization Challenge:**",    "- **High-dimensional landscapes:** Loss functions with thousands of parameters",    "- **Local minima and saddle points:** Complex optimization surfaces",    "- **Computational constraints:** Need efficient convergence",    "- **Generalization:** Different optimizers affect model performance",    "",    "**Evolution of Optimizers:**",    "- **Vanilla GD:** Simple but slow, gets stuck in ravines",    "- **Momentum:** Adds velocity, escapes local minima",    "- **Adaptive methods:** Adjust learning rates per parameter",    "- **Modern approaches:** Combine momentum with adaptivity",    "",    "**Key Question:** How can we efficiently navigate the complex loss landscapes of medical AI models?",    "",    "## Data Preparation: Optimization Benchmark Dataset",    "",    "We'll use a synthetic dataset to compare optimizer performance.",    "import numpy as np",    "import matplotlib.pyplot as plt",    "from sklearn.datasets import make_classification",    "from sklearn.model_selection import train_test_split",    "from sklearn.preprocessing import StandardScaler",    "from sklearn.metrics import accuracy_score, log_loss",    "import warnings",    "warnings.filterwarnings('ignore')",    "",    "# Create binary classification dataset",    "np.random.seed(42)",    "X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,",    "                          n_redundant=10, n_clusters_per_class=1, random_state=42)",    "",    "# Split data",    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)",    "",    "# Scale features",    "scaler = StandardScaler()",    "X_train_scaled = scaler.fit_transform(X_train)",    "X_test_scaled = scaler.transform(X_test)",    "",    "print(f\"Training set: {X_train_scaled.shape}\")",    "print(f\"Test set: {X_test_scaled.shape}\")",    "print(f\"Class distribution: {np.bincount(y_train)}\")",    "",    "# Convert to neural network format",    "def to_nn_format(X, y):",    "    \"\"\"Convert to format suitable for neural network training.\"\"\"",    "    # Add bias term as additional feature",    "    X_nn = np.c_[np.ones((X.shape[0], 1)), X]  # Add bias column",    "    y_nn = y.reshape(-1, 1)  # Make column vector",    "    return X_nn, y_nn",    "",    "X_train_nn, y_train_nn = to_nn_format(X_train_scaled, y_train)",    "X_test_nn, y_test_nn = to_nn_format(X_test_scaled, y_test)",    "",    "print(f\"Neural network format: X={X_train_nn.shape}, y={y_train_nn.shape}\")"   ],   "metadata": {}  },  {   "cell_type": "markdown",   "source": [    "**Cell Analysis:** We've prepared our optimization benchmark dataset.",    "",    "- **Binary classification:** Clean evaluation of optimizer performance",    "- **Feature scaling:** Important for stable optimization",    "- **Neural network format:** Ready for logistic regression training",    "",    "**Reflection Question:** Why is feature scaling crucial for optimization algorithms?",    "",    "## Method 1: Vanilla Gradient Descent",    "",    "**Core idea:** Take small steps downhill in the direction of steepest descent.",    "",    "**Mathematical foundation:** Œ∏ = Œ∏ - Œ∑‚àáJ(Œ∏)",    "",    "**Limitations:** Slow convergence, gets stuck in ravines, sensitive to learning rate.",    "def sigmoid(z):",    "    \"\"\"Sigmoid activation function.\"\"\"",    "    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))",    "",    "def logistic_loss(y_true, y_pred):",    "    \"\"\"Binary cross-entropy loss.\"\"\"",    "    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)",    "    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))",    "",    "def logistic_gradients(X, y, w):",    "    \"\"\"Compute gradients for logistic regression.\"\"\"",    "    n_samples = X.shape[0]",    "    y_pred = sigmoid(X @ w)",    "    errors = y_pred - y",    "    gradients = (1/n_samples) * X.T @ errors",    "    return gradients",    "",    "class VanillaGD:",    "    \"\"\"Vanilla Gradient Descent optimizer.\"\"\"",    "",    "    def __init__(self, learning_rate=0.01):",    "        self.learning_rate = learning_rate",    "",    "    def update(self, w, gradients):",    "        \"\"\"Update parameters using vanilla GD.\"\"\"",    "        return w - self.learning_rate * gradients",    "",    "# Test vanilla GD",    "print(\"Testing Vanilla Gradient Descent:\")",    "w_init = np.random.randn(X_train_nn.shape[1], 1) * 0.01",    "optimizer_vanilla = VanillaGD(learning_rate=0.1)",    "",    "losses_vanilla = []",    "accuracies_vanilla = []",    "",    "for epoch in range(100):",    "    # Forward pass",    "    y_pred_train = sigmoid(X_train_nn @ w_init)",    "",    "    # Compute loss and accuracy",    "    loss = logistic_loss(y_train_nn, y_pred_train)",    "    predictions = (y_pred_train > 0.5).astype(int)",    "    accuracy = accuracy_score(y_train_nn, predictions)",    "",    "    losses_vanilla.append(loss)",    "    accuracies_vanilla.append(accuracy)",    "",    "    # Backward pass",    "    gradients = logistic_gradients(X_train_nn, y_train_nn, w_init)",    "",    "    # Update parameters",    "    w_init = optimizer_vanilla.update(w_init, gradients)",    "",    "    if epoch % 20 == 0:",    "        print(f\"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}\")",    "",    "print(f\"Final: Loss = {losses_vanilla[-1]:.4f}, Accuracy = {accuracies_vanilla[-1]:.4f}\")"   ],   "metadata": {}  },  {   "cell_type": "markdown",   "source": [    "**Cell Analysis:** Vanilla GD implementation complete.",    "",    "- **Steady improvement:** Loss decreases consistently over epochs",    "- **Convergence:** Approaches optimal solution with sufficient iterations",    "- **Limitations:** May be slow for complex problems",    "",    "**Reflection Question:** Why might vanilla GD struggle with ravine-like loss surfaces?",    "",    "## Method 2: Momentum - Physics-Inspired Optimization",    "",    "**Core idea:** Add velocity to accumulate gradient direction over time.",    "",    "**Mathematical foundation:**",    "v = Œ≥v + Œ∑‚àáJ(Œ∏)",    "Œ∏ = Œ∏ - v",    "",    "**Benefits:** Accelerates in consistent directions, dampens oscillations.",    "class MomentumGD:",    "    \"\"\"Momentum Gradient Descent optimizer.\"\"\"",    "",    "    def __init__(self, learning_rate=0.01, momentum=0.9):",    "        self.learning_rate = learning_rate",    "        self.momentum = momentum",    "        self.velocity = None",    "",    "    def update(self, w, gradients):",    "        \"\"\"Update parameters using momentum.\"\"\"",    "        if self.velocity is None:",    "            self.velocity = np.zeros_like(w)",    "",    "        # Update velocity",    "        self.velocity = self.momentum * self.velocity + self.learning_rate * gradients",    "",    "        # Update parameters",    "        return w - self.velocity",    "",    "# Test momentum GD",    "print(\"\\nTesting Momentum Gradient Descent:\")",    "w_momentum = np.random.randn(X_train_nn.shape[1], 1) * 0.01",    "optimizer_momentum = MomentumGD(learning_rate=0.1, momentum=0.9)",    "",    "losses_momentum = []",    "accuracies_momentum = []",    "",    "for epoch in range(100):",    "    # Forward pass",    "    y_pred_train = sigmoid(X_train_nn @ w_momentum)",    "",    "    # Compute loss and accuracy",    "    loss = logistic_loss(y_train_nn, y_pred_train)",    "    predictions = (y_pred_train > 0.5).astype(int)",    "    accuracy = accuracy_score(y_train_nn, predictions)",    "",    "    losses_momentum.append(loss)",    "    accuracies_momentum.append(accuracy)",    "",    "    # Backward pass",    "    gradients = logistic_gradients(X_train_nn, y_train_nn, w_momentum)",    "",    "    # Update parameters",    "    w_momentum = optimizer_momentum.update(w_momentum, gradients)",    "",    "    if epoch % 20 == 0:",    "        print(f\"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}\")",    "",    "print(f\"Final: Loss = {losses_momentum[-1]:.4f}, Accuracy = {accuracies_momentum[-1]:.4f}\")"   ],   "metadata": {}  },  {   "cell_type": "markdown",   "source": [    "**Cell Analysis:** Momentum GD implementation complete.",    "",    "- **Faster convergence:** Often reaches better solutions in fewer iterations",    "- **Smoother optimization:** Velocity accumulation reduces oscillations",    "- **Medical analogy:** Like momentum in learning - builds expertise over time",    "",    "**Reflection Question:** How does momentum help escape local minima?",    "",    "## Method 3: Adam - Adaptive Moment Estimation",    "",    "**Core idea:** Combine momentum with adaptive learning rates.",    "",    "**Mathematical foundation:**",    "m_t = Œ≤‚ÇÅm_{t-1} + (1-Œ≤‚ÇÅ)‚àáJ(Œ∏)",    "v_t = Œ≤‚ÇÇv_{t-1} + (1-Œ≤‚ÇÇ)(‚àáJ(Œ∏))¬≤",    "Œ∏ = Œ∏ - Œ∑ * m_t / (‚àöv_t + Œµ)",    "",    "**Benefits:** Best of both worlds - momentum and adaptivity.",    "class Adam:",    "    \"\"\"Adam optimizer.\"\"\"",    "",    "    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):",    "        self.learning_rate = learning_rate",    "        self.beta1 = beta1",    "        self.beta2 = beta2",    "        self.epsilon = epsilon",    "        self.m = None  # First moment (momentum)",    "        self.v = None  # Second moment (RMSProp)",    "        self.t = 0     # Time step",    "",    "    def update(self, w, gradients):",    "        \"\"\"Update parameters using Adam.\"\"\"",    "        if self.m is None:",    "            self.m = np.zeros_like(w)",    "            self.v = np.zeros_like(w)",    "",    "        self.t += 1",    "",    "        # Update biased first moment estimate",    "        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients",    "",    "        # Update biased second moment estimate",    "        self.v = self.beta2 * self.v + (1 - self.beta2) * (gradients ** 2)",    "",    "        # Compute bias-corrected moments",    "        m_hat = self.m / (1 - self.beta1 ** self.t)",    "        v_hat = self.v / (1 - self.beta2 ** self.t)",    "",    "        # Update parameters",    "        return w - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)",    "",    "# Test Adam optimizer",    "print(\"\\nTesting Adam Optimizer:\")",    "w_adam = np.random.randn(X_train_nn.shape[1], 1) * 0.01",    "optimizer_adam = Adam(learning_rate=0.01, beta1=0.9, beta2=0.999)",    "",    "losses_adam = []",    "accuracies_adam = []",    "",    "for epoch in range(100):",    "    # Forward pass",    "    y_pred_train = sigmoid(X_train_nn @ w_adam)",    "",    "    # Compute loss and accuracy",    "    loss = logistic_loss(y_train_nn, y_pred_train)",    "    predictions = (y_pred_train > 0.5).astype(int)",    "    accuracy = accuracy_score(y_train_nn, predictions)",    "",    "    losses_adam.append(loss)",    "    accuracies_adam.append(accuracy)",    "",    "    # Backward pass",    "    gradients = logistic_gradients(X_train_nn, y_train_nn, w_adam)",    "",    "    # Update parameters",    "    w_adam = optimizer_adam.update(w_adam, gradients)",    "",    "    if epoch % 20 == 0:",    "        print(f\"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}\")",    "",    "print(f\"Final: Loss = {losses_adam[-1]:.4f}, Accuracy = {accuracies_adam[-1]:.4f}\")"   ],   "metadata": {}  },  {   "cell_type": "markdown",   "source": [    "**Cell Analysis:** Adam optimizer implementation complete.",    "",    "- **Adaptive learning:** Different learning rates for different parameters",    "- **Bias correction:** Accounts for initialization bias in moment estimates",    "- **Robust convergence:** Works well across different problem types",    "",    "**Reflection Question:** Why does Adam perform well on most deep learning problems?",    "",    "## Comparative Analysis: Optimizer Performance Comparison",    "",    "Let's visualize and compare all optimizers side by side.",    "# Create comparison plots",    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))",    "",    "# Loss comparison",    "epochs = range(1, 101)",    "ax1.plot(epochs, losses_vanilla, 'b-', label='Vanilla GD', linewidth=2)",    "ax1.plot(epochs, losses_momentum, 'g-', label='Momentum GD', linewidth=2)",    "ax1.plot(epochs, losses_adam, 'r-', label='Adam', linewidth=2)",    "ax1.set_xlabel('Epoch')",    "ax1.set_ylabel('Training Loss')",    "ax1.set_title('Optimizer Loss Comparison')",    "ax1.legend()",    "ax1.grid(True, alpha=0.3)",    "ax1.set_yscale('log')",    "",    "# Accuracy comparison",    "ax2.plot(epochs, accuracies_vanilla, 'b-', label='Vanilla GD', linewidth=2)",    "ax2.plot(epochs, accuracies_momentum, 'g-', label='Momentum GD', linewidth=2)",    "ax2.plot(epochs, accuracies_adam, 'r-', label='Adam', linewidth=2)",    "ax2.set_xlabel('Epoch')",    "ax2.set_ylabel('Training Accuracy')",    "ax2.set_title('Optimizer Accuracy Comparison')",    "ax2.legend()",    "ax2.grid(True, alpha=0.3)",    "",    "plt.tight_layout()",    "plt.show()",    "",    "# Final performance summary",    "print(\"\\nüéØ Final Performance Summary:\")",    "print(\"=\" * 50)",    "optimizers = ['Vanilla GD', 'Momentum GD', 'Adam']",    "final_losses = [losses_vanilla[-1], losses_momentum[-1], losses_adam[-1]]",    "final_accuracies = [accuracies_vanilla[-1], accuracies_momentum[-1], accuracies_adam[-1]]",    "",    "for opt, loss, acc in zip(optimizers, final_losses, final_accuracies):",    "    print(\"15\")",    "",    "print(\"",    "üìä Key Insights:\")",    "print(\"- Adam typically converges fastest and most reliably\")",    "print(\"- Momentum helps with ravine-like loss surfaces\")",    "print(\"- Vanilla GD is simplest but may need careful tuning\")",    "print(\"- Different optimizers may work better for different problems\")"   ],   "metadata": {}  },  {   "cell_type": "markdown",   "source": [    "**Cell Analysis:** Comparative analysis complete.",    "",    "- **Performance hierarchy:** Adam > Momentum > Vanilla GD for most problems",    "- **Convergence patterns:** Adaptive methods show more stable optimization",    "- **Problem dependence:** No single optimizer works best for all scenarios",    "",    "**Healthcare Translation:** Like choosing treatment protocols - Adam works for most cases, but specialized approaches needed for specific conditions.",    "",    "## üéØ Key Takeaways and Nigerian Healthcare Applications",    "",    "**Algorithm Summary:**",    "",    "- **Vanilla GD:** Simple foundation, slow but reliable for convex problems",    "- **Momentum:** Physics-inspired acceleration, helps escape local minima",    "- **Adam:** State-of-the-art optimizer combining momentum with adaptivity",    "- **Selection criteria:** Problem complexity, computational resources, convergence requirements",    "",    "**Healthcare Translation - Mark:**",    "",    "Imagine training AI for Nigerian hospitals:",    "",    "- **Adam optimizer:** Default choice for most deep learning medical models",    "- **Momentum:** Good for simpler models with clear loss landscapes",    "- **Vanilla GD:** Useful for understanding optimization fundamentals",    "- **Adaptive learning:** Critical for handling variable patient data patterns",    "",    "**Performance achieved:** All optimizers successfully trained logistic regression models with high accuracy!",    "",    "**Reflection Questions:**",    "",    "1. Why has Adam become the default optimizer in deep learning?",    "",    "2. How might different optimizers affect medical AI reliability?",    "",    "3. Compare optimization to how doctors refine their diagnostic approaches.",    "",    "**Next Steps:**",    "",    "- Apply these optimizers to neural network training",    "- Explore learning rate scheduling techniques",    "- Investigate second-order optimization methods",    "",    "**üèÜ Excellent progress, my student! You've mastered the optimization algorithms that power modern AI.**"   ],   "metadata": {}  } ], "metadata": {  "kernelspec": {   "display_name": "Python 3",   "language": "python",   "name": "python3"  },  "language_info": {   "name": "python",   "version": "3.8.0"  } }, "nbformat": 4, "nbformat_minor": 4}