# Module 02: Backpropagation and Gradient Descent

**Difficulty**: ⭐⭐⭐ (Advanced)
**Estimated Time**: 60-75 minutes
**Prerequisites**: 
- Module 00: Introduction to Neural Networks
- Module 01: Perceptrons and Activation Functions
- Solid understanding of calculus (derivatives, chain rule)
- Matrix multiplication and linear algebra
- Python and NumPy

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Execute** forward propagation step-by-step through a neural network
2. **Calculate** loss using different cost functions (MSE, Binary Cross-Entropy)
3. **Explain** the gradient descent optimization algorithm and its variants
4. **Apply** the chain rule to compute gradients in backpropagation
5. **Implement** backpropagation manually for a simple neural network
6. **Visualize** optimization landscapes and learning rate effects
7. **Compute** gradients by hand for multi-layer networks

## 1. Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

# Set random seeds for reproducibility
np.random.seed(42)

# Configure plotting
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("Setup complete!")
print(f"NumPy version: {np.__version__}")

## 2. Forward Propagation

### What is Forward Propagation?

Forward propagation is the process of passing input data through the network to generate predictions. Data flows from input → hidden layers → output.

### Single Neuron Forward Pass

For a single neuron:

1. **Weighted sum (pre-activation)**:
   $$z = \mathbf{w}^T \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b$$

2. **Activation**:
   $$a = f(z)$$

Where:
- $\mathbf{x}$ = input vector
- $\mathbf{w}$ = weight vector
- $b$ = bias
- $z$ = pre-activation (linear combination)
- $f$ = activation function
- $a$ = activation (output)

### Multi-Layer Network

For a network with layers $L$:

$$z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$$
$$a^{[l]} = f^{[l]}(z^{[l]})$$

Where:
- $l$ = layer index (1 to $L$)
- $a^{[0]} = \mathbf{x}$ (input)
- $a^{[L]} = \hat{y}$ (output/prediction)

In [None]:
def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # Clip to prevent overflow

def relu(z):
    """ReLU activation function."""
    return np.maximum(0, z)

def forward_propagation_example():
    """
    Demonstrate forward propagation through a 2-layer network.
    
    Architecture:
    - Input: 3 features
    - Hidden: 4 neurons (ReLU)
    - Output: 1 neuron (Sigmoid)
    """
    print("FORWARD PROPAGATION EXAMPLE")
    print("=" * 70)
    
    # Input
    x = np.array([[2.0, 3.0, -1.0]]).T  # Shape: (3, 1)
    print(f"\nInput x (3 features):\n{x.T}")
    
    # Layer 1: Input (3) -> Hidden (4)
    W1 = np.random.randn(4, 3) * 0.5  # Shape: (4, 3)
    b1 = np.zeros((4, 1))              # Shape: (4, 1)
    
    print(f"\nLayer 1 weights W1 shape: {W1.shape}")
    print(f"Layer 1 biases b1 shape: {b1.shape}")
    
    # Forward through layer 1
    z1 = np.dot(W1, x) + b1
    a1 = relu(z1)
    
    print(f"\nLayer 1 pre-activation z1:\n{z1.T}")
    print(f"Layer 1 activation a1 (after ReLU):\n{a1.T}")
    
    # Layer 2: Hidden (4) -> Output (1)
    W2 = np.random.randn(1, 4) * 0.5   # Shape: (1, 4)
    b2 = np.zeros((1, 1))               # Shape: (1, 1)
    
    print(f"\nLayer 2 weights W2 shape: {W2.shape}")
    print(f"Layer 2 biases b2 shape: {b2.shape}")
    
    # Forward through layer 2
    z2 = np.dot(W2, a1) + b2
    a2 = sigmoid(z2)
    
    print(f"\nLayer 2 pre-activation z2: {z2[0, 0]:.4f}")
    print(f"Layer 2 activation a2 (after Sigmoid): {a2[0, 0]:.4f}")
    
    print(f"\n{'=' * 70}")
    print(f"FINAL PREDICTION: {a2[0, 0]:.4f}")
    print(f"{'=' * 70}")
    
    return {'x': x, 'W1': W1, 'b1': b1, 'z1': z1, 'a1': a1,
            'W2': W2, 'b2': b2, 'z2': z2, 'a2': a2}

# Run example
forward_cache = forward_propagation_example()

In [None]:
def visualize_forward_propagation():
    """
    Visualize the flow of information in forward propagation.
    """
    fig, ax = plt.subplots(figsize=(14, 6))
    
    # Define layer positions
    layer_x = [0.15, 0.5, 0.85]
    layer_sizes = [3, 4, 1]
    layer_names = ['Input\n(3 features)', 'Hidden\n(4 neurons, ReLU)', 
                   'Output\n(1 neuron, Sigmoid)']
    
    # Draw neurons
    for layer_idx, (x_pos, size, name) in enumerate(zip(layer_x, layer_sizes, layer_names)):
        y_positions = np.linspace(0.2, 0.8, size)
        
        # Choose color
        if layer_idx == 0:
            color = 'lightblue'
        elif layer_idx == len(layer_x) - 1:
            color = 'lightgreen'
        else:
            color = 'coral'
        
        for y_pos in y_positions:
            circle = plt.Circle((x_pos, y_pos), 0.04, color=color, 
                              ec='black', linewidth=2, zorder=3)
            ax.add_patch(circle)
            
            # Draw connections to next layer
            if layer_idx < len(layer_x) - 1:
                next_x = layer_x[layer_idx + 1]
                next_y_positions = np.linspace(0.2, 0.8, layer_sizes[layer_idx + 1])
                
                for next_y in next_y_positions:
                    ax.plot([x_pos + 0.04, next_x - 0.04], [y_pos, next_y],
                           'gray', alpha=0.3, linewidth=1, zorder=1)
        
        # Add layer label
        ax.text(x_pos, 0.05, name, ha='center', va='top', 
               fontsize=11, weight='bold')
    
    # Add annotations
    ax.annotate('', xy=(0.32, 0.5), xytext=(0.24, 0.5),
               arrowprops=dict(arrowstyle='->', lw=2, color='blue'))
    ax.text(0.28, 0.55, r'$z^{[1]} = W^{[1]}x + b^{[1]}$', 
           ha='center', fontsize=10, style='italic')
    ax.text(0.28, 0.45, r'$a^{[1]} = ReLU(z^{[1]})$', 
           ha='center', fontsize=10, style='italic')
    
    ax.annotate('', xy=(0.67, 0.5), xytext=(0.59, 0.5),
               arrowprops=dict(arrowstyle='->', lw=2, color='blue'))
    ax.text(0.63, 0.55, r'$z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$', 
           ha='center', fontsize=10, style='italic')
    ax.text(0.63, 0.45, r'$a^{[2]} = \sigma(z^{[2]}) = \hat{y}$', 
           ha='center', fontsize=10, style='italic')
    
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')
    ax.set_title('Forward Propagation Flow', fontsize=15, weight='bold', pad=20)
    
    plt.tight_layout()
    plt.show()

visualize_forward_propagation()

## 3. Loss Functions

The **loss function** (or cost function) measures how well the network's predictions match the true labels. Our goal is to **minimize** this loss.

### 3.1 Mean Squared Error (MSE)

Used for **regression** problems:

$$L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2$$

For $m$ samples:

$$J = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(y^{(i)} - \hat{y}^{(i)})^2$$

**Derivative** (needed for backpropagation):

$$\frac{\partial L}{\partial \hat{y}} = -(y - \hat{y}) = \hat{y} - y$$

### 3.2 Binary Cross-Entropy Loss

Used for **binary classification**:

$$L(y, \hat{y}) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$

For $m$ samples:

$$J = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})]$$

**Derivative** (with sigmoid activation):

$$\frac{\partial L}{\partial z} = \hat{y} - y$$

(This simplification is why sigmoid + binary cross-entropy work so well together!)

### 3.3 Categorical Cross-Entropy Loss

Used for **multi-class classification**:

$$L(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$

Where $C$ is the number of classes, and $\mathbf{y}$ is one-hot encoded.

In [None]:
class LossFunctions:
    """
    Common loss functions and their derivatives.
    """
    
    @staticmethod
    def mse(y_true, y_pred):
        """
        Mean Squared Error loss.
        
        Parameters:
        -----------
        y_true : array-like
            True labels
        y_pred : array-like
            Predicted values
        
        Returns:
        --------
        loss : float
            Average MSE across samples
        """
        return np.mean(0.5 * (y_true - y_pred) ** 2)
    
    @staticmethod
    def mse_derivative(y_true, y_pred):
        """
        Derivative of MSE with respect to predictions.
        """
        return y_pred - y_true
    
    @staticmethod
    def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
        """
        Binary Cross-Entropy loss.
        
        Parameters:
        -----------
        y_true : array-like
            True binary labels (0 or 1)
        y_pred : array-like
            Predicted probabilities (0 to 1)
        epsilon : float
            Small value to prevent log(0)
        
        Returns:
        --------
        loss : float
            Average BCE across samples
        """
        # Clip predictions to prevent log(0)
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -np.mean(y_true * np.log(y_pred) + 
                       (1 - y_true) * np.log(1 - y_pred))
    
    @staticmethod
    def binary_cross_entropy_derivative(y_true, y_pred, epsilon=1e-15):
        """
        Derivative of BCE with respect to pre-activation (z) when using sigmoid.
        This simplifies to: y_pred - y_true
        """
        return y_pred - y_true

# Demonstrate loss functions
loss = LossFunctions()

# Example predictions and true values
y_true = np.array([0, 1, 1, 0, 1])
y_pred = np.array([0.1, 0.9, 0.8, 0.2, 0.7])

print("LOSS FUNCTION EXAMPLES")
print("=" * 70)
print(f"\nTrue labels: {y_true}")
print(f"Predictions: {y_pred}")
print(f"\nBinary Cross-Entropy Loss: {loss.binary_cross_entropy(y_true, y_pred):.4f}")
print(f"Mean Squared Error Loss: {loss.mse(y_true, y_pred):.4f}")

# Show why BCE is better for classification
print("\n" + "=" * 70)
print("Why Binary Cross-Entropy for Classification?")
print("=" * 70)

# Case 1: Confident and correct
y_true_1 = np.array([1])
y_pred_1 = np.array([0.99])
print(f"\nCase 1: True=1, Pred=0.99 (confident, correct)")
print(f"  BCE: {loss.binary_cross_entropy(y_true_1, y_pred_1):.4f} (low penalty)")
print(f"  MSE: {loss.mse(y_true_1, y_pred_1):.4f}")

# Case 2: Confident but wrong
y_true_2 = np.array([1])
y_pred_2 = np.array([0.01])
print(f"\nCase 2: True=1, Pred=0.01 (confident, wrong)")
print(f"  BCE: {loss.binary_cross_entropy(y_true_2, y_pred_2):.4f} (high penalty!)")
print(f"  MSE: {loss.mse(y_true_2, y_pred_2):.4f}")

print("\nBCE penalizes confident wrong predictions more heavily!")

In [None]:
# Visualize loss functions

def plot_loss_functions():
    """
    Visualize how different loss functions behave.
    """
    y_true = 1  # True label = 1
    y_pred_range = np.linspace(0.01, 0.99, 1000)
    
    # Calculate losses for different predictions
    bce_losses = [loss.binary_cross_entropy(np.array([y_true]), 
                                            np.array([pred])) 
                 for pred in y_pred_range]
    mse_losses = [loss.mse(np.array([y_true]), np.array([pred])) 
                 for pred in y_pred_range]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Loss comparison
    ax1.plot(y_pred_range, bce_losses, label='Binary Cross-Entropy', 
            linewidth=2.5, color='blue')
    ax1.plot(y_pred_range, mse_losses, label='Mean Squared Error', 
            linewidth=2.5, color='red', linestyle='--')
    ax1.axvline(x=1.0, color='green', linestyle=':', linewidth=2, 
               alpha=0.7, label='True Value')
    ax1.set_xlabel('Predicted Value', fontsize=12, weight='bold')
    ax1.set_ylabel('Loss', fontsize=12, weight='bold')
    ax1.set_title('Loss Functions (True Label = 1)', fontsize=13, weight='bold')
    ax1.legend(fontsize=11)
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Loss derivatives (gradients)
    bce_grads = [loss.binary_cross_entropy_derivative(np.array([y_true]), 
                                                      np.array([pred]))[0] 
                for pred in y_pred_range]
    mse_grads = [loss.mse_derivative(np.array([y_true]), np.array([pred]))[0] 
                for pred in y_pred_range]
    
    ax2.plot(y_pred_range, bce_grads, label='BCE Gradient', 
            linewidth=2.5, color='blue')
    ax2.plot(y_pred_range, mse_grads, label='MSE Gradient', 
            linewidth=2.5, color='red', linestyle='--')
    ax2.axhline(y=0, color='black', linewidth=1, alpha=0.3)
    ax2.axvline(x=1.0, color='green', linestyle=':', linewidth=2, 
               alpha=0.7, label='True Value')
    ax2.set_xlabel('Predicted Value', fontsize=12, weight='bold')
    ax2.set_ylabel('Gradient', fontsize=12, weight='bold')
    ax2.set_title('Loss Gradients', fontsize=13, weight='bold')
    ax2.legend(fontsize=11)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_loss_functions()

## 4. Gradient Descent

### What is Gradient Descent?

Gradient descent is an **optimization algorithm** that iteratively adjusts parameters to minimize the loss function.

### The Intuition

Imagine you're on a mountain (loss landscape) and want to reach the lowest valley (minimum loss). You:
1. Look around to find the steepest downward direction (gradient)
2. Take a step in that direction (parameter update)
3. Repeat until you reach the bottom (convergence)

### Mathematical Update Rule

For each parameter $\theta$ (weight or bias):

$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \frac{\partial J}{\partial \theta}$$

Where:
- $\alpha$ = learning rate (step size)
- $\frac{\partial J}{\partial \theta}$ = gradient (direction of steepest ascent)
- We subtract because we want to go downhill (minimize)

### Learning Rate Effects

- **Too small**: Slow convergence, may get stuck
- **Too large**: Overshooting, divergence, oscillations
- **Just right**: Fast, stable convergence

### Variants

1. **Batch Gradient Descent**: Use all training samples for each update
   - Pro: Stable, converges to minimum
   - Con: Slow for large datasets

2. **Stochastic Gradient Descent (SGD)**: Use one sample at a time
   - Pro: Fast updates, can escape local minima
   - Con: Noisy, unstable

3. **Mini-batch Gradient Descent**: Use small batches (e.g., 32, 64, 128)
   - Pro: Balance between batch and SGD
   - Con: Most commonly used in practice

In [None]:
def visualize_gradient_descent_1d():
    """
    Visualize gradient descent on a simple 1D function.
    Function: f(x) = (x - 3)^2 + 1
    """
    def f(x):
        """Simple quadratic function."""
        return (x - 3) ** 2 + 1
    
    def df(x):
        """Derivative of f."""
        return 2 * (x - 3)
    
    # Test different learning rates
    learning_rates = [0.1, 0.5, 1.1]
    colors = ['blue', 'green', 'red']
    
    fig, axes = plt.subplots(1, 3, figsize=(16, 4))
    
    for idx, (lr, color) in enumerate(zip(learning_rates, colors)):
        ax = axes[idx]
        
        # Plot function
        x_range = np.linspace(-2, 8, 1000)
        ax.plot(x_range, f(x_range), 'k-', linewidth=2, label='f(x)')
        
        # Gradient descent
        x = 0.0  # Starting point
        trajectory_x = [x]
        trajectory_y = [f(x)]
        
        for iteration in range(10):
            gradient = df(x)
            x = x - lr * gradient
            trajectory_x.append(x)
            trajectory_y.append(f(x))
            
            # Stop if diverging
            if abs(x) > 10:
                break
        
        # Plot trajectory
        ax.plot(trajectory_x, trajectory_y, 'o-', color=color, 
               markersize=8, linewidth=2, label=f'Trajectory (lr={lr})')
        
        # Mark start and end
        ax.scatter(trajectory_x[0], trajectory_y[0], s=200, c='green', 
                  marker='*', edgecolors='black', linewidth=2, 
                  zorder=5, label='Start')
        if len(trajectory_x) > 1:
            ax.scatter(trajectory_x[-1], trajectory_y[-1], s=200, c='red', 
                      marker='X', edgecolors='black', linewidth=2, 
                      zorder=5, label='End')
        
        ax.set_xlabel('x', fontsize=12, weight='bold')
        ax.set_ylabel('f(x)', fontsize=12, weight='bold')
        ax.set_title(f'Learning Rate = {lr}', fontsize=13, weight='bold')
        ax.legend(fontsize=9, loc='upper right')
        ax.grid(True, alpha=0.3)
        ax.set_ylim(0, 30)
    
    plt.suptitle('Effect of Learning Rate on Gradient Descent', 
                fontsize=15, weight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    print("Observations:")
    print("  lr=0.1 (blue): Slow but stable convergence")
    print("  lr=0.5 (green): Fast convergence to minimum")
    print("  lr=1.1 (red): Overshooting, diverges!")

visualize_gradient_descent_1d()

## 5. Backpropagation

### What is Backpropagation?

**Backpropagation** (backward propagation of errors) is the algorithm for computing gradients in neural networks. It uses the **chain rule** from calculus to efficiently compute $\frac{\partial J}{\partial \theta}$ for all parameters.

### The Chain Rule

If we have $z = f(y)$ and $y = g(x)$, then:

$$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$$

### Backpropagation Steps

For a 2-layer network:

**Forward Pass** (already computed):
1. $z^{[1]} = W^{[1]} x + b^{[1]}$
2. $a^{[1]} = f^{[1]}(z^{[1]})$
3. $z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}$
4. $a^{[2]} = f^{[2]}(z^{[2]}) = \hat{y}$
5. $J = L(y, \hat{y})$

**Backward Pass** (compute gradients):

Starting from the output:

1. **Output layer gradient**:
   $$\frac{\partial J}{\partial z^{[2]}} = \frac{\partial J}{\partial a^{[2]}} \cdot \frac{\partial a^{[2]}}{\partial z^{[2]}}$$

2. **Output layer parameter gradients**:
   $$\frac{\partial J}{\partial W^{[2]}} = \frac{\partial J}{\partial z^{[2]}} \cdot (a^{[1]})^T$$
   $$\frac{\partial J}{\partial b^{[2]}} = \frac{\partial J}{\partial z^{[2]}}$$

3. **Hidden layer gradient** (chain rule!):
   $$\frac{\partial J}{\partial a^{[1]}} = (W^{[2]})^T \cdot \frac{\partial J}{\partial z^{[2]}}$$
   $$\frac{\partial J}{\partial z^{[1]}} = \frac{\partial J}{\partial a^{[1]}} \cdot \frac{\partial a^{[1]}}{\partial z^{[1]}}$$

4. **Hidden layer parameter gradients**:
   $$\frac{\partial J}{\partial W^{[1]}} = \frac{\partial J}{\partial z^{[1]}} \cdot x^T$$
   $$\frac{\partial J}{\partial b^{[1]}} = \frac{\partial J}{\partial z^{[1]}}$$

### Key Insights

- Gradients flow **backwards** through the network
- Each layer's gradient depends on the **next layer's gradient** (chain rule)
- We **reuse** computations from the forward pass
- This is why it's called "back" propagation!

In [None]:
def manual_backpropagation_example():
    """
    Detailed backpropagation example with manual gradient calculations.
    
    Network: 2 inputs -> 2 hidden (sigmoid) -> 1 output (sigmoid)
    Loss: Binary cross-entropy
    """
    print("MANUAL BACKPROPAGATION EXAMPLE")
    print("=" * 70)
    
    # Simple network setup
    x = np.array([[1.0], [2.0]])    # Input (2, 1)
    y = np.array([[1.0]])            # True label (1, 1)
    
    # Layer 1: 2 inputs -> 2 hidden
    W1 = np.array([[0.5, -0.3],
                   [0.2, 0.8]])      # (2, 2)
    b1 = np.array([[0.1], [0.0]])    # (2, 1)
    
    # Layer 2: 2 hidden -> 1 output
    W2 = np.array([[0.6, -0.4]])     # (1, 2)
    b2 = np.array([[0.2]])           # (1, 1)
    
    print("\n--- FORWARD PASS ---")
    print(f"Input x:\n{x.T}")
    print(f"True label y: {y[0, 0]}")
    
    # Forward pass - Layer 1
    z1 = np.dot(W1, x) + b1
    a1 = sigmoid(z1)
    print(f"\nLayer 1:")
    print(f"  z1 = W1 @ x + b1 = \n{z1.T}")
    print(f"  a1 = sigmoid(z1) = \n{a1.T}")
    
    # Forward pass - Layer 2
    z2 = np.dot(W2, a1) + b2
    a2 = sigmoid(z2)
    print(f"\nLayer 2:")
    print(f"  z2 = W2 @ a1 + b2 = {z2[0, 0]:.4f}")
    print(f"  a2 = sigmoid(z2) = {a2[0, 0]:.4f} (prediction)")
    
    # Calculate loss
    loss_value = loss.binary_cross_entropy(y, a2)
    print(f"\nLoss: {loss_value:.4f}")
    
    print("\n--- BACKWARD PASS ---")
    
    # Backward pass - Output layer
    # For BCE + sigmoid: dL/dz2 = a2 - y
    dz2 = a2 - y
    print(f"\nOutput layer gradients:")
    print(f"  dL/dz2 = a2 - y = {dz2[0, 0]:.4f}")
    
    # Gradients for W2 and b2
    dW2 = np.dot(dz2, a1.T)
    db2 = dz2
    print(f"  dL/dW2 = dz2 @ a1.T =\n{dW2}")
    print(f"  dL/db2 = dz2 = {db2[0, 0]:.4f}")
    
    # Backward pass - Hidden layer
    # Chain rule: dL/da1 = W2.T @ dz2
    da1 = np.dot(W2.T, dz2)
    print(f"\nHidden layer gradients:")
    print(f"  dL/da1 = W2.T @ dz2 =\n{da1.T}")
    
    # dL/dz1 = dL/da1 * sigmoid'(z1)
    # sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z)) = a1 * (1 - a1)
    dz1 = da1 * a1 * (1 - a1)
    print(f"  dL/dz1 = da1 * a1 * (1-a1) =\n{dz1.T}")
    
    # Gradients for W1 and b1
    dW1 = np.dot(dz1, x.T)
    db1 = dz1
    print(f"  dL/dW1 = dz1 @ x.T =\n{dW1}")
    print(f"  dL/db1 = dz1 =\n{db1.T}")
    
    print("\n--- PARAMETER UPDATES (lr=0.1) ---")
    lr = 0.1
    
    W2_new = W2 - lr * dW2
    b2_new = b2 - lr * db2
    W1_new = W1 - lr * dW1
    b1_new = b1 - lr * db1
    
    print(f"\nW2: {W2} -> {W2_new}")
    print(f"b2: {b2[0, 0]:.4f} -> {b2_new[0, 0]:.4f}")
    print(f"W1:\n{W1}\n->\n{W1_new}")
    print(f"b1: {b1.T} -> {b1_new.T}")
    
    print("\n" + "=" * 70)
    print("Gradients computed successfully via backpropagation!")
    print("=" * 70)

manual_backpropagation_example()

## 6. Exercises

### Exercise 1: Manual Gradient Calculation

Given a simple network with:
- 1 input: $x = 2.0$
- 1 weight: $w = 0.5$
- 1 bias: $b = 0.3$
- Activation: Sigmoid
- True label: $y = 1$
- Loss: Binary Cross-Entropy

**Calculate by hand**:
1. Forward pass: $z$, $a$ (prediction), and loss $L$
2. Backward pass: $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$
3. New parameters after one gradient descent step with $\alpha = 0.1$

Show your work step by step!

In [None]:
# Exercise 1: Your solution here

# Given values
x = 2.0
w = 0.5
b = 0.3
y = 1.0
lr = 0.1

print("EXERCISE 1: MANUAL GRADIENT CALCULATION")
print("=" * 70)
print(f"Given: x={x}, w={w}, b={b}, y={y}, learning_rate={lr}")
print("\n--- YOUR CALCULATIONS ---\n")

# TODO: Step 1 - Forward pass
# Calculate z = w*x + b
z = None  # Your calculation

# Calculate a = sigmoid(z)
a = None  # Your calculation

# Calculate loss L = -[y*log(a) + (1-y)*log(1-a)]
L = None  # Your calculation

print("Step 1 - Forward Pass:")
print(f"  z = w*x + b = {w}*{x} + {b} = {z}")
print(f"  a = sigmoid(z) = ?")
print(f"  L = BCE(y, a) = ?")

# TODO: Step 2 - Backward pass
# For sigmoid + BCE, dL/dz = a - y
dL_dz = None  # Your calculation

# dL/dw = dL/dz * dz/dw = dL/dz * x
dL_dw = None  # Your calculation

# dL/db = dL/dz * dz/db = dL/dz * 1 = dL/dz
dL_db = None  # Your calculation

print("\nStep 2 - Backward Pass:")
print(f"  dL/dz = a - y = ?")
print(f"  dL/dw = dL/dz * x = ?")
print(f"  dL/db = dL/dz = ?")

# TODO: Step 3 - Parameter updates
# w_new = w - lr * dL/dw
w_new = None  # Your calculation

# b_new = b - lr * dL/db
b_new = None  # Your calculation

print("\nStep 3 - Parameter Updates:")
print(f"  w_new = w - lr*dL/dw = {w} - {lr}*? = ?")
print(f"  b_new = b - lr*dL/db = {b} - {lr}*? = ?")

print("\n" + "=" * 70)

In [None]:
# Solution to Exercise 1

x = 2.0
w = 0.5
b = 0.3
y = 1.0
lr = 0.1

print("EXERCISE 1: SOLUTION")
print("=" * 70)
print(f"Given: x={x}, w={w}, b={b}, y={y}, learning_rate={lr}")

# Step 1: Forward pass
z = w * x + b
a = sigmoid(np.array([z]))[0]
L = loss.binary_cross_entropy(np.array([y]), np.array([a]))

print("\nStep 1 - Forward Pass:")
print(f"  z = w*x + b = {w}*{x} + {b} = {z}")
print(f"  a = sigmoid({z}) = 1/(1+e^(-{z})) = {a:.6f}")
print(f"  L = -[{y}*log({a:.6f}) + {1-y}*log({1-a:.6f})]")
print(f"  L = {L:.6f}")

# Step 2: Backward pass
dL_dz = a - y
dL_dw = dL_dz * x
dL_db = dL_dz

print("\nStep 2 - Backward Pass:")
print(f"  dL/dz = a - y = {a:.6f} - {y} = {dL_dz:.6f}")
print(f"  dL/dw = dL/dz * x = {dL_dz:.6f} * {x} = {dL_dw:.6f}")
print(f"  dL/db = dL/dz = {dL_db:.6f}")

# Step 3: Parameter updates
w_new = w - lr * dL_dw
b_new = b - lr * dL_db

print("\nStep 3 - Parameter Updates (lr=0.1):")
print(f"  w_new = {w} - {lr}*{dL_dw:.6f} = {w_new:.6f}")
print(f"  b_new = {b} - {lr}*{dL_db:.6f} = {b_new:.6f}")

# Verify with new forward pass
z_new = w_new * x + b_new
a_new = sigmoid(np.array([z_new]))[0]
L_new = loss.binary_cross_entropy(np.array([y]), np.array([a_new]))

print("\nVerification - Forward pass with new parameters:")
print(f"  New prediction: {a_new:.6f} (closer to {y}!)")
print(f"  New loss: {L_new:.6f} (lower than {L:.6f}!)")
print(f"  Loss decreased by: {L - L_new:.6f}")

print("\n" + "=" * 70)
print("✓ Gradients correctly computed and parameters updated!")
print("=" * 70)

### Exercise 2: Implement Simple Gradient Descent

Implement gradient descent to find the minimum of the function:

$$f(x) = x^2 - 4x + 7$$

**Tasks**:
1. Implement the function and its derivative
2. Run gradient descent for 20 iterations with learning rate 0.1
3. Start from $x_0 = 0$
4. Plot the trajectory
5. Find the minimum value

**Hint**: The derivative is $f'(x) = 2x - 4$

In [None]:
# Exercise 2: Your solution here

def f(x):
    """Function: f(x) = x^2 - 4x + 7"""
    # TODO: Implement
    pass

def df(x):
    """Derivative: f'(x) = 2x - 4"""
    # TODO: Implement
    pass

# TODO: Initialize
x = 0.0  # Starting point
lr = 0.1  # Learning rate
n_iterations = 20

# TODO: Track trajectory
trajectory_x = []
trajectory_y = []

# TODO: Run gradient descent
for i in range(n_iterations):
    # Calculate gradient
    # Update x
    # Record trajectory
    pass

# TODO: Plot results
# - Plot the function f(x)
# - Plot the trajectory
# - Mark the minimum

In [None]:
# Solution to Exercise 2

def f(x):
    """Function: f(x) = x^2 - 4x + 7"""
    return x**2 - 4*x + 7

def df(x):
    """Derivative: f'(x) = 2x - 4"""
    return 2*x - 4

# Initialize
x = 0.0
lr = 0.1
n_iterations = 20

trajectory_x = [x]
trajectory_y = [f(x)]

print("GRADIENT DESCENT ON f(x) = x^2 - 4x + 7")
print("=" * 70)
print(f"Starting point: x = {x}")
print(f"Learning rate: {lr}")
print(f"\nIteration | x | f(x) | gradient")
print("-" * 70)

# Run gradient descent
for i in range(n_iterations):
    gradient = df(x)
    x = x - lr * gradient
    
    trajectory_x.append(x)
    trajectory_y.append(f(x))
    
    if i < 5 or i == n_iterations - 1:  # Print first 5 and last
        print(f"{i+1:9d} | {x:8.4f} | {f(x):8.4f} | {gradient:8.4f}")
    elif i == 5:
        print("      ... | ... | ... | ...")

print("=" * 70)
print(f"\nFinal x: {x:.6f}")
print(f"Minimum value f(x): {f(x):.6f}")
print(f"\nAnalytical minimum: x = 2, f(2) = 3")
print(f"Our solution: x = {x:.6f}, f(x) = {f(x):.6f}")
print(f"Error: {abs(x - 2):.6f}")

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Function and trajectory
x_range = np.linspace(-1, 5, 1000)
ax1.plot(x_range, f(x_range), 'k-', linewidth=2, label='f(x)')
ax1.plot(trajectory_x, trajectory_y, 'ro-', markersize=6, 
        linewidth=2, alpha=0.7, label='Gradient Descent')
ax1.scatter(trajectory_x[0], trajectory_y[0], s=200, c='green', 
           marker='*', edgecolors='black', linewidth=2, zorder=5, label='Start')
ax1.scatter(trajectory_x[-1], trajectory_y[-1], s=200, c='red', 
           marker='X', edgecolors='black', linewidth=2, zorder=5, label='End')
ax1.scatter(2, 3, s=200, c='yellow', marker='D', 
           edgecolors='black', linewidth=2, zorder=5, label='True Minimum')
ax1.set_xlabel('x', fontsize=12, weight='bold')
ax1.set_ylabel('f(x)', fontsize=12, weight='bold')
ax1.set_title('Gradient Descent Trajectory', fontsize=13, weight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Convergence
ax2.plot(range(len(trajectory_y)), trajectory_y, 'b-o', 
        linewidth=2, markersize=6)
ax2.axhline(y=3, color='green', linestyle='--', linewidth=2, 
           label='True Minimum (f=3)')
ax2.set_xlabel('Iteration', fontsize=12, weight='bold')
ax2.set_ylabel('f(x)', fontsize=12, weight='bold')
ax2.set_title('Convergence Over Time', fontsize=13, weight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Exercise 3: Analyze Learning Rate Impact

Use the same function from Exercise 2 and test different learning rates:
- Try: [0.01, 0.1, 0.5, 0.9, 1.1]
- Run for 30 iterations each
- Plot all trajectories on the same graph
- Determine which learning rate converges fastest
- Identify which rates diverge or oscillate

In [None]:
# Exercise 3: Your solution here

# TODO: Test multiple learning rates
learning_rates = [0.01, 0.1, 0.5, 0.9, 1.1]

# TODO: For each learning rate:
#   - Run gradient descent
#   - Track trajectory
#   - Count iterations to convergence

# TODO: Plot all trajectories
# TODO: Create comparison table

In [None]:
# Solution to Exercise 3

learning_rates = [0.01, 0.1, 0.5, 0.9, 1.1]
colors = ['blue', 'green', 'orange', 'red', 'purple']
n_iterations = 30

plt.figure(figsize=(14, 8))

# Plot function
x_range = np.linspace(-2, 6, 1000)
plt.plot(x_range, f(x_range), 'k-', linewidth=2.5, label='f(x)', alpha=0.5)

results = []

for lr, color in zip(learning_rates, colors):
    x = 0.0  # Reset starting point
    trajectory_x = [x]
    trajectory_y = [f(x)]
    
    converged = False
    converged_iter = n_iterations
    
    for i in range(n_iterations):
        gradient = df(x)
        x = x - lr * gradient
        trajectory_x.append(x)
        trajectory_y.append(f(x))
        
        # Check convergence (within 0.01 of minimum)
        if not converged and abs(f(x) - 3.0) < 0.01:
            converged = True
            converged_iter = i + 1
        
        # Check divergence
        if abs(x) > 20:
            converged_iter = -1  # Diverged
            break
    
    # Plot trajectory
    plt.plot(trajectory_x, trajectory_y, 'o-', color=color, 
            markersize=4, linewidth=2, alpha=0.7, label=f'lr={lr}')
    
    # Store results
    final_x = trajectory_x[-1]
    final_fx = trajectory_y[-1]
    results.append({
        'lr': lr,
        'final_x': final_x,
        'final_f': final_fx,
        'converged_iter': converged_iter,
        'status': 'Converged' if converged_iter > 0 else 'Diverged'
    })

# Mark true minimum
plt.scatter(2, 3, s=300, c='yellow', marker='*', 
           edgecolors='black', linewidth=2, zorder=5, label='True Minimum')

plt.xlabel('x', fontsize=13, weight='bold')
plt.ylabel('f(x)', fontsize=13, weight='bold')
plt.title('Impact of Learning Rate on Gradient Descent', 
         fontsize=15, weight='bold')
plt.legend(fontsize=11, loc='upper right')
plt.grid(True, alpha=0.3)
plt.xlim(-2, 6)
plt.ylim(0, 30)
plt.tight_layout()
plt.show()

# Print summary table
print("\nLEARNING RATE COMPARISON")
print("=" * 80)
print(f"{'LR':<6} | {'Final x':<10} | {'Final f(x)':<12} | {'Converged':<10} | {'Status':<10}")
print("-" * 80)

for r in results:
    conv_str = f"{r['converged_iter']} iters" if r['converged_iter'] > 0 else "No"
    print(f"{r['lr']:<6.2f} | {r['final_x']:<10.4f} | {r['final_f']:<12.4f} | "
          f"{conv_str:<10} | {r['status']:<10}")

print("=" * 80)
print("\nObservations:")
print("  - lr=0.01: Very slow convergence")
print("  - lr=0.1: Good balance - fast and stable")
print("  - lr=0.5: Very fast convergence")
print("  - lr=0.9: Near critical point - oscillations")
print("  - lr=1.1: Too large - diverges!")

## 7. Summary

Congratulations! You've mastered the core algorithms that power neural network training. Let's recap:

### Key Concepts

1. **Forward Propagation**
   - Data flows input → hidden → output
   - Each layer: $z = Wx + b$, then $a = f(z)$
   - Final output is the prediction $\hat{y}$

2. **Loss Functions**
   - **MSE**: For regression, $L = \frac{1}{2}(y - \hat{y})^2$
   - **Binary Cross-Entropy**: For classification, $L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$
   - Measures prediction error

3. **Gradient Descent**
   - Optimization algorithm: $\theta \leftarrow \theta - \alpha \nabla J$
   - Learning rate $\alpha$ controls step size
   - Too small = slow, too large = divergence

4. **Backpropagation**
   - Efficiently computes gradients using chain rule
   - Flows backward through network
   - Each layer's gradient depends on next layer
   - Enables training of deep networks

5. **The Chain Rule**
   - Foundation of backpropagation
   - $\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$
   - Allows gradient propagation through layers

### Important Insights

- **Why Backprop is Efficient**: Reuses forward pass computations, avoids redundant calculations
- **Gradient Flow**: Gradients can vanish (sigmoid/tanh) or explode (deep networks)
- **Learning Rate**: Critical hyperparameter affecting convergence speed and stability
- **BCE + Sigmoid**: Perfect pair - simple gradient $\hat{y} - y$

### What's Next?

In **Module 03: Building Neural Networks with NumPy**, we'll:
- Implement a complete multi-layer perceptron from scratch
- Build forward and backward propagation classes
- Train on real datasets (XOR, circles, MNIST)
- Compare with scikit-learn's MLPClassifier
- Understand every detail of how neural networks work

### Additional Resources

**Classic Papers**:
- Rumelhart et al. (1986): "Learning representations by back-propagating errors"
- LeCun et al. (1998): "Efficient BackProp"

**Tutorials**:
- 3Blue1Brown: "Backpropagation calculus" (YouTube)
- Andrej Karpathy: "micrograd" (minimal backprop implementation)

**Interactive**:
- TensorFlow Playground: See gradient descent in action
- Distill.pub: "Momentum" and optimization visualizations

---

**Ready to build your own neural network from scratch?** Continue to **Module 03: Building Neural Networks with NumPy**!