# Complete Guide to Gradient Descent and Optimization

## Learning Objectives
By the end of this notebook, you will understand:
1. What gradient descent is and why it's fundamental to ML
2. How gradient descent works step-by-step
3. Learning rate and its critical importance
4. Variants: Batch, Mini-batch, and Stochastic GD
5. Advanced optimizers: Momentum, RMSprop, Adam
6. Convergence, local minima, and practical considerations

---

## 1. What is Gradient Descent?

### The Core Idea

**Gradient Descent** is an iterative optimization algorithm for finding the minimum of a function.

**Intuition:** Imagine you're hiking down a mountain in thick fog. You can only see your immediate surroundings. How do you get to the bottom?
- **Step 1:** Look around you (compute gradient)
- **Step 2:** Take a step in the steepest downhill direction (negative gradient)
- **Step 3:** Repeat until you reach the bottom

This is exactly what gradient descent does!

### Mathematical Formulation

Update rule:
$$\theta_{new} = \theta_{old} - \alpha \nabla f(\theta_{old})$$

Where:
- $\theta$ = parameters we're optimizing
- $\alpha$ = learning rate (step size)
- $\nabla f$ = gradient of the loss function

### Why It Matters in ML

**Every ML model training uses gradient descent:**
- Linear regression â†’ minimize MSE
- Logistic regression â†’ minimize cross-entropy
- Neural networks â†’ minimize loss via backpropagation
- Deep learning â†’ all modern architectures

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

# Simple 1D example
def f(x):
    """Simple quadratic function"""
    return x**2 + 2*x + 1

def df_dx(x):
    """Derivative of f"""
    return 2*x + 2

# Gradient descent implementation
def gradient_descent_1d(start_x, learning_rate, num_iterations):
    """Perform gradient descent on 1D function"""
    x = start_x
    history = [x]
    
    for i in range(num_iterations):
        gradient = df_dx(x)
        x = x - learning_rate * gradient
        history.append(x)
    
    return np.array(history)

# Run gradient descent
start = 5.0
lr = 0.1
iterations = 20
path = gradient_descent_1d(start, lr, iterations)

# Visualize
x = np.linspace(-6, 6, 200)
y = f(x)

plt.figure(figsize=(12, 5))

# Left: Function with path
plt.subplot(1, 2, 1)
plt.plot(x, y, 'b-', linewidth=2, label='f(x) = xÂ² + 2x + 1')
plt.plot(path, f(path), 'ro-', markersize=8, linewidth=2, label='GD path')
plt.scatter([-1], [0], color='green', s=200, marker='*', 
           edgecolors='black', linewidths=2, label='Minimum', zorder=5)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gradient Descent Path')
plt.legend()
plt.grid(True, alpha=0.3)

# Right: Convergence
plt.subplot(1, 2, 2)
plt.plot(range(len(path)), f(path), 'b-', linewidth=2, marker='o')
plt.axhline(y=0, color='green', linestyle='--', linewidth=2, label='True minimum')
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.title('Convergence: Loss vs Iteration')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Starting point: x = {start}, f(x) = {f(start):.4f}")
print(f"After {iterations} iterations: x = {path[-1]:.4f}, f(x) = {f(path[-1]):.4f}")
print(f"True minimum: x = -1, f(x) = 0")
print(f"\nâœ“ Gradient descent successfully found the minimum!")

---
## 2. The Learning Rate

### What is the Learning Rate?

The **learning rate** $\alpha$ controls how big our steps are.

**Critical balance:**
- **Too small** â†’ slow convergence, many iterations needed
- **Too large** â†’ overshooting, divergence, instability
- **Just right** â†’ fast and stable convergence âœ“

This is THE most important hyperparameter in training!

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.1, 0.5, 1.0, 1.5]
start = 5.0
iterations = 30

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

x_plot = np.linspace(-8, 8, 200)
y_plot = f(x_plot)

for idx, lr in enumerate(learning_rates):
    path = gradient_descent_1d(start, lr, iterations)
    
    axes[idx].plot(x_plot, y_plot, 'b-', linewidth=2, alpha=0.3)
    axes[idx].plot(path, f(path), 'ro-', markersize=6, linewidth=2)
    axes[idx].scatter([-1], [0], color='green', s=200, marker='*', 
                     edgecolors='black', linewidths=2, zorder=5)
    axes[idx].set_xlabel('x')
    axes[idx].set_ylabel('f(x)')
    axes[idx].set_title(f'Learning Rate = {lr}')
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_ylim(-2, 50)
    
    # Add annotation
    final_loss = f(path[-1])
    if lr <= 0.5:
        status = "âœ“ Converged" if final_loss < 1 else "Slow"
        color = 'green' if final_loss < 1 else 'orange'
    else:
        status = "âœ— Diverged" if final_loss > 10 else "Unstable"
        color = 'red' if final_loss > 10 else 'orange'
    
    axes[idx].text(0.05, 0.95, status, transform=axes[idx].transAxes,
                  fontsize=12, verticalalignment='top', 
                  bbox=dict(boxstyle='round', facecolor=color, alpha=0.5))

# Summary plot
axes[5].axis('off')
summary_text = (
    "Learning Rate Guidelines:\n\n"
    "Î± = 0.01: Too slow\n"
    "  â€¢ Many iterations needed\n"
    "  â€¢ Safe but inefficient\n\n"
    "Î± = 0.1-0.5: Good! âœ“\n"
    "  â€¢ Fast convergence\n"
    "  â€¢ Stable\n\n"
    "Î± â‰¥ 1.0: Too large\n"
    "  â€¢ Overshooting\n"
    "  â€¢ May diverge"
)
axes[5].text(0.1, 0.9, summary_text, transform=axes[5].transAxes,
            fontsize=11, verticalalignment='top', family='monospace',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("=== LEARNING RATE COMPARISON ===")
for lr in learning_rates:
    path = gradient_descent_1d(start, lr, iterations)
    final_loss = f(path[-1])
    print(f"Î± = {lr:4.2f} â†’ Final loss: {final_loss:8.4f}")

### Learning Rate in Practice

In [None]:
# Convergence speed comparison
learning_rates = [0.01, 0.1, 0.3]
colors = ['blue', 'green', 'red']
labels = ['Slow (Î±=0.01)', 'Good (Î±=0.1)', 'Fast (Î±=0.3)']

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for lr, color, label in zip(learning_rates, colors, labels):
    path = gradient_descent_1d(5.0, lr, 50)
    plt.plot(range(len(path)), f(path), color=color, linewidth=2, 
            marker='o', markersize=3, label=label)

plt.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Convergence Speed vs Learning Rate')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

plt.subplot(1, 2, 2)
for lr, color, label in zip(learning_rates, colors, labels):
    path = gradient_descent_1d(5.0, lr, 50)
    plt.plot(path, color=color, linewidth=2, marker='o', markersize=3, label=label)

plt.axhline(y=-1, color='black', linestyle='--', linewidth=1, alpha=0.5, label='True minimum')
plt.xlabel('Iteration')
plt.ylabel('Parameter Value (x)')
plt.title('Parameter Evolution')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Insights:")
print("â€¢ Smaller learning rate = slower but safer")
print("â€¢ Larger learning rate = faster but may overshoot")
print("â€¢ Log scale shows exponential convergence")
print("â€¢ In practice: start with Î± â‰ˆ 0.001-0.1, tune from there")

---
## 3. Multivariable Gradient Descent

### Extension to Multiple Dimensions

For functions with multiple parameters:

$$\vec{\theta}_{new} = \vec{\theta}_{old} - \alpha \nabla f(\vec{\theta}_{old})$$

Where $\nabla f$ is the gradient vector:
$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial \theta_1} \\ \frac{\partial f}{\partial \theta_2} \\ \vdots \end{bmatrix}$$

In [None]:
# 2D example: f(x, y) = xÂ² + yÂ²
def f_2d(params):
    """2D quadratic function"""
    x, y = params
    return x**2 + y**2

def gradient_2d(params):
    """Gradient of f(x,y) = xÂ² + yÂ²"""
    x, y = params
    return np.array([2*x, 2*y])

def gradient_descent_2d(start, learning_rate, num_iterations):
    """Gradient descent in 2D"""
    params = np.array(start)
    history = [params.copy()]
    
    for i in range(num_iterations):
        grad = gradient_2d(params)
        params = params - learning_rate * grad
        history.append(params.copy())
    
    return np.array(history)

# Run gradient descent from different starting points
start_points = [[3, 3], [-3, 2], [2, -3], [-2, -2]]
lr = 0.1
iterations = 20

# Create mesh for contour plot
x = np.linspace(-4, 4, 100)
y = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2

# Plot
fig = plt.figure(figsize=(16, 6))

# 2D contour plot
ax1 = plt.subplot(1, 2, 1)
contour = ax1.contour(X, Y, Z, levels=20, cmap='viridis')
ax1.clabel(contour, inline=True, fontsize=8)

colors = ['red', 'blue', 'green', 'orange']
for start, color in zip(start_points, colors):
    path = gradient_descent_2d(start, lr, iterations)
    ax1.plot(path[:, 0], path[:, 1], 'o-', color=color, linewidth=2, 
            markersize=6, label=f'Start: {start}')

ax1.scatter([0], [0], color='yellow', s=300, marker='*', 
           edgecolors='black', linewidths=2, label='Minimum', zorder=5)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('2D Gradient Descent: All Paths Lead to Minimum')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_aspect('equal')

# 3D surface plot
ax2 = fig.add_subplot(1, 2, 2, projection='3d')
ax2.plot_surface(X, Y, Z, cmap='viridis', alpha=0.6)

for start, color in zip(start_points, colors):
    path = gradient_descent_2d(start, lr, iterations)
    z_path = np.array([f_2d(p) for p in path])
    ax2.plot(path[:, 0], path[:, 1], z_path, 'o-', color=color, 
            linewidth=2, markersize=4)

ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x,y)')
ax2.set_title('3D View: Descending the Surface')

plt.tight_layout()
plt.show()

print("Observation:")
print("â€¢ All paths converge to the global minimum (0, 0)")
print("â€¢ Paths are perpendicular to contour lines")
print("â€¢ Gradient always points toward the minimum")

---
## 4. Linear Regression with Gradient Descent

### Real ML Example

Model: $\hat{y} = wx + b$

Loss (MSE): $L = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2$

Gradients:
$$\frac{\partial L}{\partial w} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) \cdot x_i$$
$$\frac{\partial L}{\partial b} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i)$$

In [None]:
# Generate synthetic data
np.random.seed(42)
X_data = np.linspace(0, 10, 100)
y_data = 3 * X_data + 7 + np.random.randn(100) * 2  # True: y = 3x + 7 + noise

def predict(X, w, b):
    return w * X + b

def mse_loss(X, y, w, b):
    y_pred = predict(X, w, b)
    return np.mean((y_pred - y)**2)

def compute_gradients(X, y, w, b):
    n = len(X)
    y_pred = predict(X, w, b)
    error = y_pred - y
    
    dL_dw = (2/n) * np.sum(error * X)
    dL_db = (2/n) * np.sum(error)
    
    return dL_dw, dL_db

def gradient_descent_linear_regression(X, y, learning_rate, num_iterations):
    """Train linear regression using gradient descent"""
    # Initialize parameters
    w = 0.0
    b = 0.0
    
    history = {
        'w': [w],
        'b': [b],
        'loss': [mse_loss(X, y, w, b)]
    }
    
    for i in range(num_iterations):
        # Compute gradients
        dL_dw, dL_db = compute_gradients(X, y, w, b)
        
        # Update parameters
        w = w - learning_rate * dL_dw
        b = b - learning_rate * dL_db
        
        # Record history
        history['w'].append(w)
        history['b'].append(b)
        history['loss'].append(mse_loss(X, y, w, b))
    
    return w, b, history

# Train the model
w_final, b_final, history = gradient_descent_linear_regression(
    X_data, y_data, learning_rate=0.01, num_iterations=100
)

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Data and fitted line
axes[0, 0].scatter(X_data, y_data, alpha=0.5, label='Data')
axes[0, 0].plot(X_data, predict(X_data, 0, 0), 'r--', linewidth=2, 
               label=f'Initial: y = 0x + 0', alpha=0.5)
axes[0, 0].plot(X_data, predict(X_data, w_final, b_final), 'g-', linewidth=2,
               label=f'Final: y = {w_final:.2f}x + {b_final:.2f}')
axes[0, 0].plot(X_data, 3*X_data + 7, 'b:', linewidth=2, label='True: y = 3x + 7')
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('y')
axes[0, 0].set_title('Linear Regression Result')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Loss over time
axes[0, 1].plot(history['loss'], 'b-', linewidth=2)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('MSE Loss')
axes[0, 1].set_title('Loss Decreases Over Time')
axes[0, 1].grid(True, alpha=0.3)

# Parameter evolution: w
axes[1, 0].plot(history['w'], 'r-', linewidth=2)
axes[1, 0].axhline(y=3, color='green', linestyle='--', linewidth=2, label='True value')
axes[1, 0].set_xlabel('Iteration')
axes[1, 0].set_ylabel('Weight (w)')
axes[1, 0].set_title('Weight Converges to True Value')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Parameter evolution: b
axes[1, 1].plot(history['b'], 'b-', linewidth=2)
axes[1, 1].axhline(y=7, color='green', linestyle='--', linewidth=2, label='True value')
axes[1, 1].set_xlabel('Iteration')
axes[1, 1].set_ylabel('Bias (b)')
axes[1, 1].set_title('Bias Converges to True Value')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== TRAINING RESULTS ===")
print(f"True parameters: w = 3.0, b = 7.0")
print(f"Learned parameters: w = {w_final:.4f}, b = {b_final:.4f}")
print(f"Initial loss: {history['loss'][0]:.4f}")
print(f"Final loss: {history['loss'][-1]:.4f}")
print(f"\nâœ“ Gradient descent successfully learned the model!")

---
## 5. Variants of Gradient Descent

### Three Main Types

1. **Batch Gradient Descent**: Use ALL data to compute gradient
   - Accurate but slow for large datasets
   
2. **Stochastic Gradient Descent (SGD)**: Use ONE sample at a time
   - Fast but noisy updates
   
3. **Mini-batch Gradient Descent**: Use small batches (e.g., 32, 64 samples)
   - Best of both worlds âœ“ (most commonly used)

In [None]:
# Generate larger dataset
np.random.seed(42)
n_samples = 1000
X_large = np.random.randn(n_samples)
y_large = 2 * X_large + 1 + np.random.randn(n_samples) * 0.5

def sgd_update(X, y, w, b, learning_rate):
    """Single SGD update using one random sample"""
    idx = np.random.randint(len(X))
    x_i, y_i = X[idx], y[idx]
    
    y_pred = w * x_i + b
    error = y_pred - y_i
    
    dL_dw = 2 * error * x_i
    dL_db = 2 * error
    
    w = w - learning_rate * dL_dw
    b = b - learning_rate * dL_db
    
    return w, b

def minibatch_update(X, y, w, b, learning_rate, batch_size):
    """Mini-batch update"""
    indices = np.random.choice(len(X), batch_size, replace=False)
    X_batch = X[indices]
    y_batch = y[indices]
    
    y_pred = w * X_batch + b
    error = y_pred - y_batch
    
    dL_dw = (2/batch_size) * np.sum(error * X_batch)
    dL_db = (2/batch_size) * np.sum(error)
    
    w = w - learning_rate * dL_dw
    b = b - learning_rate * dL_db
    
    return w, b

# Compare variants
def compare_gd_variants(X, y, lr, iterations):
    results = {}
    
    # Batch GD
    w, b = 0.0, 0.0
    losses_batch = []
    for i in range(iterations):
        dL_dw, dL_db = compute_gradients(X, y, w, b)
        w = w - lr * dL_dw
        b = b - lr * dL_db
        losses_batch.append(mse_loss(X, y, w, b))
    results['Batch GD'] = losses_batch
    
    # SGD
    w, b = 0.0, 0.0
    losses_sgd = []
    for i in range(iterations):
        w, b = sgd_update(X, y, w, b, lr)
        losses_sgd.append(mse_loss(X, y, w, b))
    results['SGD'] = losses_sgd
    
    # Mini-batch GD
    w, b = 0.0, 0.0
    losses_minibatch = []
    for i in range(iterations):
        w, b = minibatch_update(X, y, w, b, lr, batch_size=32)
        losses_minibatch.append(mse_loss(X, y, w, b))
    results['Mini-batch GD (32)'] = losses_minibatch
    
    return results

# Run comparison
results = compare_gd_variants(X_large, y_large, lr=0.01, iterations=200)

# Plot
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
for name, losses in results.items():
    plt.plot(losses, linewidth=2, label=name, alpha=0.8)
plt.xlabel('Iteration')
plt.ylabel('MSE Loss')
plt.title('Convergence: Batch vs SGD vs Mini-batch')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0, max([max(l) for l in results.values()]) * 0.3)

plt.subplot(1, 2, 2)
for name, losses in results.items():
    # Use moving average for smoothing
    window = 10
    smoothed = np.convolve(losses, np.ones(window)/window, mode='valid')
    plt.plot(smoothed, linewidth=2, label=name, alpha=0.8)
plt.xlabel('Iteration')
plt.ylabel('MSE Loss (smoothed)')
plt.title('Smoothed Convergence')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== COMPARISON ===")
print("\nBatch GD:")
print("  âœ“ Smooth, deterministic convergence")
print("  âœ— Slow for large datasets (uses all data per update)")
print("\nSGD:")
print("  âœ“ Fast updates (one sample at a time)")
print("  âœ— Noisy, can jump around")
print("  âœ“ Can escape local minima")
print("\nMini-batch GD:")
print("  âœ“ Balance of speed and stability")
print("  âœ“ Efficient for GPU computation")
print("  âœ“ Most commonly used in practice!")

---
## 6. Advanced Optimizers

### Beyond Vanilla Gradient Descent

Modern deep learning uses enhanced versions of GD:

1. **Momentum**: Add velocity term (like rolling ball)
2. **RMSprop**: Adaptive learning rates
3. **Adam**: Combines momentum + RMSprop (most popular!)

In [None]:
# Implement optimizers
class Optimizer:
    """Base optimizer class"""
    def __init__(self, learning_rate=0.01):
        self.lr = learning_rate
    
    def step(self, params, grads):
        raise NotImplementedError

class SGD(Optimizer):
    """Standard SGD"""
    def step(self, params, grads):
        return params - self.lr * grads

class Momentum(Optimizer):
    """SGD with Momentum"""
    def __init__(self, learning_rate=0.01, beta=0.9):
        super().__init__(learning_rate)
        self.beta = beta
        self.velocity = None
    
    def step(self, params, grads):
        if self.velocity is None:
            self.velocity = np.zeros_like(params)
        
        self.velocity = self.beta * self.velocity + (1 - self.beta) * grads
        return params - self.lr * self.velocity

class RMSprop(Optimizer):
    """RMSprop optimizer"""
    def __init__(self, learning_rate=0.01, beta=0.9, epsilon=1e-8):
        super().__init__(learning_rate)
        self.beta = beta
        self.epsilon = epsilon
        self.squared_grads = None
    
    def step(self, params, grads):
        if self.squared_grads is None:
            self.squared_grads = np.zeros_like(params)
        
        self.squared_grads = self.beta * self.squared_grads + (1 - self.beta) * grads**2
        return params - self.lr * grads / (np.sqrt(self.squared_grads) + self.epsilon)

class Adam(Optimizer):
    """Adam optimizer (Adaptive Moment Estimation)"""
    def __init__(self, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8):
        super().__init__(learning_rate)
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment (momentum)
        self.v = None  # Second moment (RMSprop)
        self.t = 0
    
    def step(self, params, grads):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        self.t += 1
        
        # Update moments
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * grads**2
        
        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)
        
        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

# Test on challenging function (Rosenbrock)
def rosenbrock(params):
    x, y = params
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(params):
    x, y = params
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

def optimize(optimizer, start, num_iterations):
    params = np.array(start)
    history = [params.copy()]
    
    for i in range(num_iterations):
        grads = rosenbrock_grad(params)
        params = optimizer.step(params, grads)
        history.append(params.copy())
    
    return np.array(history)

# Compare optimizers
start_point = [-1.5, 2.5]
iterations = 500
lr = 0.001

optimizers = {
    'SGD': SGD(lr),
    'Momentum': Momentum(lr, beta=0.9),
    'RMSprop': RMSprop(lr * 10, beta=0.9),  # Higher LR for RMSprop
    'Adam': Adam(lr * 10, beta1=0.9, beta2=0.999)
}

paths = {}
for name, opt in optimizers.items():
    paths[name] = optimize(opt, start_point, iterations)

# Visualize
x = np.linspace(-2, 2, 100)
y = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x, y)
Z = (1 - X)**2 + 100 * (Y - X**2)**2

plt.figure(figsize=(14, 10))

# Contour plot with paths
plt.subplot(2, 2, 1)
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis')
colors = ['blue', 'red', 'green', 'orange']
for (name, path), color in zip(paths.items(), colors):
    plt.plot(path[:, 0], path[:, 1], '-', color=color, linewidth=2, 
            label=name, alpha=0.7)
plt.scatter([1], [1], color='yellow', s=300, marker='*', 
           edgecolors='black', linewidths=2, label='Minimum', zorder=5)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Optimization Paths (Rosenbrock Function)')
plt.legend()
plt.grid(True, alpha=0.3)

# Loss over time
plt.subplot(2, 2, 2)
for name, path in paths.items():
    losses = [rosenbrock(p) for p in path]
    plt.plot(losses, linewidth=2, label=name)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Convergence Speed')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Distance to minimum
plt.subplot(2, 2, 3)
for name, path in paths.items():
    distances = np.linalg.norm(path - np.array([1, 1]), axis=1)
    plt.plot(distances, linewidth=2, label=name)
plt.xlabel('Iteration')
plt.ylabel('Distance to Minimum')
plt.title('Distance from Optimal Point')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Final comparison
plt.subplot(2, 2, 4)
final_losses = [rosenbrock(paths[name][-1]) for name in optimizers.keys()]
plt.bar(optimizers.keys(), final_losses, color=colors)
plt.ylabel('Final Loss')
plt.title('Final Loss Comparison')
plt.yscale('log')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("=== OPTIMIZER COMPARISON ===")
for name in optimizers.keys():
    final_loss = rosenbrock(paths[name][-1])
    print(f"{name:10s}: Final loss = {final_loss:.6f}")

print("\n=== KEY INSIGHTS ===")
print("â€¢ Adam usually converges fastest (combines best of both worlds)")
print("â€¢ Momentum helps with oscillations")
print("â€¢ RMSprop adapts learning rate per parameter")
print("â€¢ In practice: Start with Adam, learning_rate â‰ˆ 0.001")

---
## 7. Practical Considerations

### Common Issues and Solutions

In [None]:
print("=== PRACTICAL TIPS ===")
print("\n1. LEARNING RATE:")
print("   â€¢ Start: 0.001 - 0.01")
print("   â€¢ Too high â†’ divergence, NaN")
print("   â€¢ Too low â†’ slow training")
print("   â€¢ Solution: Learning rate schedule (decay over time)")

print("\n2. FEATURE SCALING:")
print("   â€¢ Always normalize features (mean=0, std=1)")
print("   â€¢ Prevents different scales from dominating")
print("   â€¢ Makes optimization landscape more spherical")

print("\n3. INITIALIZATION:")
print("   â€¢ Don't initialize all weights to zero!")
print("   â€¢ Use small random values")
print("   â€¢ Xavier/He initialization for deep networks")

print("\n4. CONVERGENCE:")
print("   â€¢ Monitor loss over time")
print("   â€¢ Stop when loss plateaus (early stopping)")
print("   â€¢ Watch for oscillations (reduce learning rate)")

print("\n5. OPTIMIZER CHOICE:")
print("   â€¢ Default: Adam (learning_rate=0.001)")
print("   â€¢ For RNNs: Sometimes SGD with momentum")
print("   â€¢ For fine-tuning: Lower learning rate")

print("\n6. DEBUGGING:")
print("   â€¢ Loss increasing â†’ learning rate too high")
print("   â€¢ Loss not decreasing â†’ learning rate too low or bad initialization")
print("   â€¢ Loss = NaN â†’ numerical instability, reduce LR")
print("   â€¢ Slow convergence â†’ try different optimizer or increase LR")

### Learning Rate Schedules

In [None]:
# Different LR schedules
def constant_lr(epoch, initial_lr):
    return initial_lr

def step_decay(epoch, initial_lr, drop_rate=0.5, epochs_drop=10):
    return initial_lr * (drop_rate ** (epoch // epochs_drop))

def exponential_decay(epoch, initial_lr, decay_rate=0.95):
    return initial_lr * (decay_rate ** epoch)

def cosine_annealing(epoch, initial_lr, total_epochs):
    return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / total_epochs))

# Visualize schedules
epochs = np.arange(0, 100)
initial_lr = 0.1

schedules = {
    'Constant': [constant_lr(e, initial_lr) for e in epochs],
    'Step Decay': [step_decay(e, initial_lr) for e in epochs],
    'Exponential': [exponential_decay(e, initial_lr) for e in epochs],
    'Cosine Annealing': [cosine_annealing(e, initial_lr, 100) for e in epochs]
}

plt.figure(figsize=(12, 5))

for name, lrs in schedules.items():
    plt.plot(epochs, lrs, linewidth=2, label=name)

plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("Why use LR schedules?")
print("â€¢ Start high: Make quick progress early")
print("â€¢ Decay over time: Fine-tune as we approach minimum")
print("â€¢ Helps convergence and final performance")
print("\nMost common: Exponential decay or step decay")

---
## 8. Practice Exercises

### Exercise 1: Implement Gradient Descent
Implement gradient descent for $f(x) = x^3 - 3x^2 + 2$

In [None]:
# Exercise 1 - Your code here
def f(x):
    return x**3 - 3*x**2 + 2

def df_dx(x):
    # YOUR CODE: compute derivative
    pass

# YOUR CODE: implement gradient descent


### Exercise 2: Logistic Regression
Implement gradient descent for logistic regression on the iris dataset (binary classification)

In [None]:
# Exercise 2 - Your code here
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data[:100, :2]  # First two features, first two classes
y = iris.target[:100]

# YOUR CODE: implement logistic regression with gradient descent


### Exercise 3: Compare Learning Rates
Test learning rates [0.001, 0.01, 0.1, 0.5] and plot convergence

In [None]:
# Exercise 3 - Your code here


### Exercise 4: Implement Adam
Implement the Adam optimizer from scratch and test on a simple function

In [None]:
# Exercise 4 - Your code here


---
## 9. Solutions

In [None]:
print("=== SOLUTIONS ===")

# Exercise 1
print("\nExercise 1: f(x) = xÂ³ - 3xÂ² + 2")
print("f'(x) = 3xÂ² - 6x")
print("(See implementation in Section 1)")

# Exercise 2
print("\nExercise 2: Logistic Regression")
print("Gradients: âˆ‚L/âˆ‚w = (1/n)Î£(p-y)Â·x")
print("           âˆ‚L/âˆ‚b = (1/n)Î£(p-y)")
print("(See implementation in derivatives notebook)")

# Exercise 3
print("\nExercise 3: Learning Rate Comparison")
print("(See Section 2 for detailed comparison)")

# Exercise 4
print("\nExercise 4: Adam Optimizer")
print("(See Section 6 for full implementation)")

---
## 10. Key Takeaways

### Core Concepts:
1. âœ… **Gradient descent** = iterative optimization following negative gradient
2. âœ… **Learning rate** = controls step size (most important hyperparameter)
3. âœ… **Variants**: Batch, Mini-batch (best), Stochastic
4. âœ… **Advanced**: Momentum, RMSprop, Adam (use Adam by default)

### Update Rule:
$$\theta_{new} = \theta_{old} - \alpha \nabla f(\theta_{old})$$

### Practical Guidelines:
- **Start with Adam**, learning_rate = 0.001
- **Normalize features** before training
- **Use mini-batches** (32-256 samples)
- **Monitor loss** for convergence
- **Learning rate schedule** for better convergence

### When to Use What:
- **Simple problems**: Batch GD
- **Large datasets**: Mini-batch GD with Adam
- **Fine-tuning**: Lower learning rate (0.0001)
- **RNNs**: Sometimes SGD with momentum

### Debugging Tips:
- Loss increasing â†’ LR too high
- Loss flat â†’ LR too low or stuck
- Loss = NaN â†’ Numerical instability
- Slow convergence â†’ Try different optimizer

---

**Congratulations! You understand gradient descent! ðŸŽ‰**

**Next: Chain Rule and Backpropagation - how neural networks compute these gradients!**