# Lab 1.4.1: Manual Backpropagation

**Module:** 1.4 - Mathematics for Deep Learning  
**Time:** 3 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the forward pass as a series of function compositions
- [ ] Derive gradients using the chain rule step-by-step
- [ ] Implement backward pass manually without autograd
- [ ] Verify your gradients match PyTorch autograd (within 1e-6)
- [ ] Build intuition for how errors flow backward through networks

---

## üìö Prerequisites

- Completed: Module 1.2 (Python for AI/ML), Module 1.3 (CUDA Python)
- Knowledge of: Basic calculus (derivatives), matrix multiplication

---

## üåç Real-World Context

**Why does this matter?**

Every time you call `loss.backward()` in PyTorch, the magic of backpropagation happens automatically. But when:
- Your model isn't learning (gradients vanishing/exploding?)
- You need to implement a custom layer
- You're debugging NaN losses
- You want to understand why certain architectures work

...you need to understand what's happening under the hood.

**Real example:** When OpenAI trained GPT-3, they discovered gradient issues at scale that required understanding backprop deeply to fix. Engineers who understand backprop can debug what others can't.

---

## üßí ELI5: What is Backpropagation?

> **Imagine you're playing a telephone game with your friends...**
>
> You whisper "apple" to Friend 1, who whispers to Friend 2, who whispers to Friend 3, and at the end, Friend 3 says "purple" out loud.
>
> The teacher (loss function) says: "That's wrong! The answer should be 'apple', not 'purple'."
>
> Now, how do you fix this?
>
> **You work BACKWARDS:**
> 1. Ask Friend 3: "What did you hear?" ‚Üí "I heard 'turtle'"
> 2. Ask Friend 2: "What did you say?" ‚Üí "I said 'turtle'" ‚Üí OK, Friend 3 heard correctly
> 3. Ask Friend 2: "What did YOU hear?" ‚Üí "I heard 'snapple'"
> 4. Ask Friend 1: "What did you say?" ‚Üí "I said 'snapple'" ‚Üí Aha! Friend 1 made a mistake!
>
> **In neural network terms:**
> - Each friend = a layer in the network
> - The whisper = the activation passed forward
> - Working backwards = backpropagation
> - Figuring out who messed up = computing gradients
> - Telling each friend to speak more clearly = updating weights

---

## Part 1: The Math Behind Backpropagation

### 1.1 The Chain Rule - The Hero of Deep Learning

The chain rule is THE fundamental concept behind backpropagation.

**Simple version:** If `y = f(g(x))`, then:

$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

**Think of it like this:** How much does `y` change when `x` changes?
- First, how much does `g` change when `x` changes?
- Then, how much does `y` change when `g` changes?
- Multiply them together!

### Visual Example

```
x = 3
   ‚Üì g(x) = x¬≤
g = 9
   ‚Üì f(g) = 2g + 1
y = 19
```

- dg/dx = 2x = 6 (at x=3)
- dy/dg = 2
- dy/dx = dy/dg √ó dg/dx = 2 √ó 6 = 12

Let's verify this numerically!

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("üöÄ Mathematics for Deep Learning - Manual Backpropagation")
print("=" * 60)
print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Let's verify the chain rule numerically

def g(x):
    return x ** 2

def f(g_val):
    return 2 * g_val + 1

def y(x):
    return f(g(x))

# Analytical derivatives
def dg_dx(x):
    return 2 * x

def df_dg(g_val):
    return 2  # Constant

def dy_dx_analytical(x):
    """Chain rule: dy/dx = df/dg * dg/dx"""
    return df_dg(g(x)) * dg_dx(x)

# Numerical derivative (for verification)
def numerical_derivative(func, x, eps=1e-7):
    """Compute derivative numerically using finite differences"""
    return (func(x + eps) - func(x - eps)) / (2 * eps)

# Test at x = 3
x_test = 3.0

analytical = dy_dx_analytical(x_test)
numerical = numerical_derivative(y, x_test)

print("Chain Rule Verification")
print("=" * 40)
print(f"At x = {x_test}:")
print(f"  g(x) = x¬≤ = {g(x_test)}")
print(f"  y = f(g(x)) = 2g + 1 = {y(x_test)}")
print()
print(f"Gradients:")
print(f"  dg/dx = 2x = {dg_dx(x_test)}")
print(f"  df/dg = 2")
print(f"  dy/dx = df/dg √ó dg/dx = 2 √ó {dg_dx(x_test)} = {analytical}")
print()
print(f"Verification:")
print(f"  Analytical dy/dx: {analytical}")
print(f"  Numerical dy/dx:  {numerical}")
print(f"  Difference:       {abs(analytical - numerical):.2e}")
print()
print("‚úÖ Chain rule verified!" if abs(analytical - numerical) < 1e-5 else "‚ùå Something's wrong!")

### üîç What Just Happened?

We computed the derivative of a **composite function** two ways:
1. **Analytically** using the chain rule (multiply derivatives along the chain)
2. **Numerically** using finite differences (tiny changes in input/output)

They match! This is exactly what `loss.backward()` does, but for millions of parameters.

---

## Part 2: A Single Neuron - Forward and Backward

Let's start simple: one neuron with one input.

### The Forward Pass

```
Input (x) ‚Üí [√ó weight (w)] ‚Üí [+ bias (b)] ‚Üí [œÉ activation] ‚Üí Output (≈∑)
```

Mathematically:
1. `z = w*x + b` (linear transformation)
2. `≈∑ = œÉ(z)` (activation function)
3. `L = (≈∑ - y)¬≤` (loss - mean squared error)

### The Backward Pass (What we need to find)

We need: `‚àÇL/‚àÇw` and `‚àÇL/‚àÇb` (how does the loss change when we change each parameter?)

Using chain rule:
- `‚àÇL/‚àÇw = ‚àÇL/‚àÇ≈∑ √ó ‚àÇ≈∑/‚àÇz √ó ‚àÇz/‚àÇw`
- `‚àÇL/‚àÇb = ‚àÇL/‚àÇ≈∑ √ó ‚àÇ≈∑/‚àÇz √ó ‚àÇz/‚àÇb`

In [None]:
# Single Neuron: Manual Forward and Backward Pass

def sigmoid(z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """Derivative of sigmoid: œÉ'(z) = œÉ(z) √ó (1 - œÉ(z))"""
    s = sigmoid(z)
    return s * (1 - s)

# Initialize
x = 2.0       # Input
y = 1.0       # Target (ground truth)
w = 0.5       # Weight
b = 0.1       # Bias

print("=" * 60)
print("FORWARD PASS")
print("=" * 60)

# Forward pass (step by step)
z = w * x + b
print(f"Step 1: z = w√óx + b = {w}√ó{x} + {b} = {z}")

y_hat = sigmoid(z)
print(f"Step 2: ≈∑ = œÉ(z) = œÉ({z}) = {y_hat:.6f}")

loss = (y_hat - y) ** 2
print(f"Step 3: L = (≈∑ - y)¬≤ = ({y_hat:.6f} - {y})¬≤ = {loss:.6f}")

print()
print("=" * 60)
print("BACKWARD PASS (Chain Rule)")
print("=" * 60)

# Backward pass (computing gradients)
# ‚àÇL/‚àÇ≈∑ = 2(≈∑ - y)
dL_dy_hat = 2 * (y_hat - y)
print(f"‚àÇL/‚àÇ≈∑ = 2(≈∑ - y) = 2({y_hat:.6f} - {y}) = {dL_dy_hat:.6f}")

# ‚àÇ≈∑/‚àÇz = œÉ'(z) = œÉ(z)(1 - œÉ(z))
dy_hat_dz = sigmoid_derivative(z)
print(f"‚àÇ≈∑/‚àÇz = œÉ'(z) = œÉ(z)(1-œÉ(z)) = {y_hat:.6f}√ó{1-y_hat:.6f} = {dy_hat_dz:.6f}")

# ‚àÇz/‚àÇw = x
dz_dw = x
print(f"‚àÇz/‚àÇw = x = {dz_dw}")

# ‚àÇz/‚àÇb = 1
dz_db = 1
print(f"‚àÇz/‚àÇb = 1")

# Chain rule for final gradients
print()
print("Applying Chain Rule:")
dL_dw = dL_dy_hat * dy_hat_dz * dz_dw
print(f"‚àÇL/‚àÇw = ‚àÇL/‚àÇ≈∑ √ó ‚àÇ≈∑/‚àÇz √ó ‚àÇz/‚àÇw = {dL_dy_hat:.6f} √ó {dy_hat_dz:.6f} √ó {dz_dw} = {dL_dw:.6f}")

dL_db = dL_dy_hat * dy_hat_dz * dz_db
print(f"‚àÇL/‚àÇb = ‚àÇL/‚àÇ≈∑ √ó ‚àÇ≈∑/‚àÇz √ó ‚àÇz/‚àÇb = {dL_dy_hat:.6f} √ó {dy_hat_dz:.6f} √ó {dz_db} = {dL_db:.6f}")

In [None]:
# Verify with PyTorch autograd

# Create tensors with gradient tracking
x_t = torch.tensor(x, dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.float32)
w_t = torch.tensor(w, dtype=torch.float32, requires_grad=True)
b_t = torch.tensor(b, dtype=torch.float32, requires_grad=True)

# Forward pass
z_t = w_t * x_t + b_t
y_hat_t = torch.sigmoid(z_t)
loss_t = (y_hat_t - y_t) ** 2

# Backward pass (PyTorch does this automatically!)
loss_t.backward()

print("=" * 60)
print("VERIFICATION WITH PYTORCH AUTOGRAD")
print("=" * 60)
print(f"Manual ‚àÇL/‚àÇw:   {dL_dw:.6f}")
print(f"PyTorch ‚àÇL/‚àÇw:  {w_t.grad.item():.6f}")
print(f"Difference:     {abs(dL_dw - w_t.grad.item()):.2e}")
print()
print(f"Manual ‚àÇL/‚àÇb:   {dL_db:.6f}")
print(f"PyTorch ‚àÇL/‚àÇb:  {b_t.grad.item():.6f}")
print(f"Difference:     {abs(dL_db - b_t.grad.item()):.2e}")
print()

if abs(dL_dw - w_t.grad.item()) < 1e-6 and abs(dL_db - b_t.grad.item()) < 1e-6:
    print("üéâ SUCCESS! Manual gradients match PyTorch autograd!")
else:
    print("‚ùå Gradients don't match. Check your calculations!")

### ‚úã Try It Yourself #1

Change the activation function from sigmoid to ReLU and compute the gradients manually.

**ReLU Definition:**
- `ReLU(z) = max(0, z)`
- `ReLU'(z) = 1 if z > 0, else 0`

<details>
<summary>üí° Hint</summary>

The only thing that changes is `‚àÇ≈∑/‚àÇz`. For ReLU:
- If z > 0: derivative is 1
- If z ‚â§ 0: derivative is 0

Everything else (‚àÇL/‚àÇ≈∑, ‚àÇz/‚àÇw, ‚àÇz/‚àÇb) stays the same!
</details>

In [None]:
# YOUR CODE HERE: Implement ReLU activation and backward pass

def relu(z):
    """ReLU activation"""
    # TODO: Implement ReLU
    raise NotImplementedError("Implement: return np.maximum(0, z)")

def relu_derivative(z):
    """Derivative of ReLU"""
    # TODO: Implement ReLU derivative
    raise NotImplementedError("Implement: return (z > 0).astype(float)")

# Test with the same inputs
# z = w * x + b  (already computed above)
# TODO: Compute y_hat_relu, loss_relu, and gradients

print("Your manual ReLU gradients:")
# print(f"‚àÇL/‚àÇw = {your_dL_dw}")
# print(f"‚àÇL/‚àÇb = {your_dL_db}")

---

## Part 3: Multi-Layer Perceptron - The Real Challenge

Now let's scale up to a **3-layer network** (input ‚Üí hidden1 ‚Üí hidden2 ‚Üí output).

### Network Architecture

```
Input (2) ‚Üí Hidden1 (3) ‚Üí Hidden2 (2) ‚Üí Output (1)
   x          h1            h2           ≈∑
```

### üßí ELI5: Why Do We Need Multiple Layers?

> **Imagine you're teaching a child to recognize a dog...**
>
> **One friend (single layer):** "Is it furry? Then it's a dog!" ‚ùå (Cats are furry too!)
>
> **Three friends working together (multiple layers):**
> - Friend 1: "Does it have fur? Four legs? A tail?"
> - Friend 2: "Combining those... it looks like a mammal pet!"
> - Friend 3: "Does it bark? Chase balls? Then... DOG!" ‚úÖ
>
> Each layer extracts more complex features from the previous layer's output!

In [None]:
class ManualMLP:
    """
    A 3-layer MLP implemented from scratch.
    
    Architecture: Input(2) ‚Üí Hidden1(3) ‚Üí Hidden2(2) ‚Üí Output(1)
    Activation: ReLU for hidden layers, Sigmoid for output
    """
    
    def __init__(self, input_size=2, hidden1_size=3, hidden2_size=2, output_size=1):
        # Initialize weights with small random values
        # Using Xavier initialization: scale by sqrt(2/fan_in)
        self.W1 = np.random.randn(input_size, hidden1_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((1, hidden1_size))
        
        self.W2 = np.random.randn(hidden1_size, hidden2_size) * np.sqrt(2.0 / hidden1_size)
        self.b2 = np.zeros((1, hidden2_size))
        
        self.W3 = np.random.randn(hidden2_size, output_size) * np.sqrt(2.0 / hidden2_size)
        self.b3 = np.zeros((1, output_size))
        
        # Store intermediate values for backprop
        self.cache = {}
        
    def relu(self, z):
        return np.maximum(0, z)
    
    def relu_derivative(self, z):
        return (z > 0).astype(float)
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # Clip for numerical stability
    
    def sigmoid_derivative(self, z):
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def forward(self, X):
        """
        Forward pass through all layers.
        
        Args:
            X: Input data of shape (batch_size, input_size)
            
        Returns:
            Output predictions of shape (batch_size, output_size)
        """
        # Layer 1: Input ‚Üí Hidden1
        self.cache['X'] = X
        self.cache['z1'] = X @ self.W1 + self.b1          # Linear transformation
        self.cache['h1'] = self.relu(self.cache['z1'])    # ReLU activation
        
        # Layer 2: Hidden1 ‚Üí Hidden2
        self.cache['z2'] = self.cache['h1'] @ self.W2 + self.b2
        self.cache['h2'] = self.relu(self.cache['z2'])
        
        # Layer 3: Hidden2 ‚Üí Output
        self.cache['z3'] = self.cache['h2'] @ self.W3 + self.b3
        self.cache['y_hat'] = self.sigmoid(self.cache['z3'])  # Sigmoid for probability output
        
        return self.cache['y_hat']
    
    def compute_loss(self, y_hat, y):
        """Mean Squared Error loss"""
        return np.mean((y_hat - y) ** 2)
    
    def backward(self, y):
        """
        Backward pass - compute gradients for all parameters.
        
        This is where the magic happens! We apply the chain rule
        layer by layer, from output back to input.
        
        Args:
            y: Ground truth labels of shape (batch_size, output_size)
            
        Returns:
            Dictionary containing all gradients
        """
        batch_size = y.shape[0]
        
        # ===== OUTPUT LAYER (Layer 3) =====
        # Loss gradient: ‚àÇL/‚àÇ≈∑ = 2(≈∑ - y) / batch_size
        # (divided by batch_size because we use MEAN squared error)
        dL_dy_hat = 2 * (self.cache['y_hat'] - y) / batch_size
        
        # Through sigmoid: ‚àÇL/‚àÇz3 = ‚àÇL/‚àÇ≈∑ √ó ‚àÇ≈∑/‚àÇz3
        dy_hat_dz3 = self.sigmoid_derivative(self.cache['z3'])
        dL_dz3 = dL_dy_hat * dy_hat_dz3
        
        # Parameter gradients for layer 3
        # ‚àÇL/‚àÇW3 = h2·µÄ √ó ‚àÇL/‚àÇz3
        dL_dW3 = self.cache['h2'].T @ dL_dz3
        # ‚àÇL/‚àÇb3 = sum(‚àÇL/‚àÇz3) over batch
        dL_db3 = np.sum(dL_dz3, axis=0, keepdims=True)
        
        # ===== HIDDEN LAYER 2 =====
        # Propagate gradient back through W3: ‚àÇL/‚àÇh2 = ‚àÇL/‚àÇz3 √ó W3·µÄ
        dL_dh2 = dL_dz3 @ self.W3.T
        
        # Through ReLU: ‚àÇL/‚àÇz2 = ‚àÇL/‚àÇh2 √ó ‚àÇh2/‚àÇz2
        dh2_dz2 = self.relu_derivative(self.cache['z2'])
        dL_dz2 = dL_dh2 * dh2_dz2
        
        # Parameter gradients for layer 2
        dL_dW2 = self.cache['h1'].T @ dL_dz2
        dL_db2 = np.sum(dL_dz2, axis=0, keepdims=True)
        
        # ===== HIDDEN LAYER 1 =====
        # Propagate gradient back through W2
        dL_dh1 = dL_dz2 @ self.W2.T
        
        # Through ReLU
        dh1_dz1 = self.relu_derivative(self.cache['z1'])
        dL_dz1 = dL_dh1 * dh1_dz1
        
        # Parameter gradients for layer 1
        dL_dW1 = self.cache['X'].T @ dL_dz1
        dL_db1 = np.sum(dL_dz1, axis=0, keepdims=True)
        
        return {
            'dW1': dL_dW1, 'db1': dL_db1,
            'dW2': dL_dW2, 'db2': dL_db2,
            'dW3': dL_dW3, 'db3': dL_db3
        }
    
    def update_params(self, grads, learning_rate=0.01):
        """Update parameters using gradient descent"""
        self.W1 -= learning_rate * grads['dW1']
        self.b1 -= learning_rate * grads['db1']
        self.W2 -= learning_rate * grads['dW2']
        self.b2 -= learning_rate * grads['db2']
        self.W3 -= learning_rate * grads['dW3']
        self.b3 -= learning_rate * grads['db3']

print("ManualMLP class defined successfully!")
print("Architecture: Input(2) ‚Üí Hidden1(3) ‚Üí Hidden2(2) ‚Üí Output(1)")

In [None]:
# Test the manual implementation

# Create sample data (XOR problem - classic neural network test)
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]], dtype=np.float64)

y = np.array([[0],
              [1],
              [1],
              [0]], dtype=np.float64)

print("XOR Problem:")
print("Input (X)     Target (y)")
for i in range(len(X)):
    print(f"  {X[i]}     ‚Üí    {y[i][0]}")
print()

# Initialize network
np.random.seed(42)  # For reproducibility
mlp = ManualMLP()

# Forward pass
y_hat = mlp.forward(X)
loss = mlp.compute_loss(y_hat, y)

print("Initial Forward Pass:")
print(f"Predictions: {y_hat.flatten().round(4)}")
print(f"Targets:     {y.flatten()}")
print(f"Initial Loss: {loss:.6f}")
print()

# Backward pass
grads = mlp.backward(y)

print("Manual Gradients Computed:")
for name, grad in grads.items():
    print(f"  {name}: shape {grad.shape}, mean {grad.mean():.6f}")

---

## Part 4: Verifying Against PyTorch Autograd

The moment of truth! Let's create an equivalent PyTorch model and verify our manual gradients match.

In [None]:
import torch
import torch.nn as nn

class PyTorchMLP(nn.Module):
    """Equivalent PyTorch implementation for verification"""
    
    def __init__(self, W1, b1, W2, b2, W3, b3):
        super().__init__()
        
        # Create linear layers with pre-defined weights
        self.fc1 = nn.Linear(2, 3)
        self.fc2 = nn.Linear(3, 2)
        self.fc3 = nn.Linear(2, 1)
        
        # Copy weights from our manual implementation
        self.fc1.weight.data = torch.tensor(W1.T, dtype=torch.float64)
        self.fc1.bias.data = torch.tensor(b1.flatten(), dtype=torch.float64)
        
        self.fc2.weight.data = torch.tensor(W2.T, dtype=torch.float64)
        self.fc2.bias.data = torch.tensor(b2.flatten(), dtype=torch.float64)
        
        self.fc3.weight.data = torch.tensor(W3.T, dtype=torch.float64)
        self.fc3.bias.data = torch.tensor(b3.flatten(), dtype=torch.float64)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

# Create PyTorch model with same weights as our manual MLP
torch_mlp = PyTorchMLP(
    mlp.W1, mlp.b1,
    mlp.W2, mlp.b2,
    mlp.W3, mlp.b3
).double()  # Use float64 for accurate comparison

# Convert data to tensors
X_t = torch.tensor(X, dtype=torch.float64)
y_t = torch.tensor(y, dtype=torch.float64)

# Forward pass
y_hat_t = torch_mlp(X_t)
loss_t = torch.mean((y_hat_t - y_t) ** 2)

# Backward pass
loss_t.backward()

print("=" * 70)
print("GRADIENT VERIFICATION: Manual vs PyTorch Autograd")
print("=" * 70)
print()

# Compare each gradient
comparisons = [
    ('W1', grads['dW1'], torch_mlp.fc1.weight.grad.numpy().T),
    ('b1', grads['db1'], torch_mlp.fc1.bias.grad.numpy().reshape(1, -1)),
    ('W2', grads['dW2'], torch_mlp.fc2.weight.grad.numpy().T),
    ('b2', grads['db2'], torch_mlp.fc2.bias.grad.numpy().reshape(1, -1)),
    ('W3', grads['dW3'], torch_mlp.fc3.weight.grad.numpy().T),
    ('b3', grads['db3'], torch_mlp.fc3.bias.grad.numpy().reshape(1, -1)),
]

all_match = True
for name, manual, pytorch in comparisons:
    max_diff = np.abs(manual - pytorch).max()
    match = max_diff < 1e-6
    all_match = all_match and match
    
    status = "‚úÖ" if match else "‚ùå"
    print(f"{status} {name}:")
    print(f"   Manual:  {manual.flatten()[:4]}..." if manual.size > 4 else f"   Manual:  {manual.flatten()}")
    print(f"   PyTorch: {pytorch.flatten()[:4]}..." if pytorch.size > 4 else f"   PyTorch: {pytorch.flatten()}")
    print(f"   Max difference: {max_diff:.2e}")
    print()

print("=" * 70)
if all_match:
    print("üéâ ALL GRADIENTS MATCH! Your manual backprop is correct!")
    print("   You've just implemented deep learning from scratch!")
else:
    print("‚ùå Some gradients don't match. Review your chain rule derivations.")

---

## Part 5: Training the Network

Let's use our manual implementation to actually train on the XOR problem!

In [None]:
# Train the network from scratch

# Re-initialize with fresh weights
np.random.seed(123)
mlp = ManualMLP()

# Training hyperparameters
learning_rate = 1.0  # XOR needs aggressive learning
epochs = 1000

# Track losses for plotting
losses = []

print("Training Manual MLP on XOR Problem")
print("=" * 50)

for epoch in range(epochs):
    # Forward pass
    y_hat = mlp.forward(X)
    loss = mlp.compute_loss(y_hat, y)
    losses.append(loss)
    
    # Backward pass
    grads = mlp.backward(y)
    
    # Update parameters
    mlp.update_params(grads, learning_rate)
    
    # Print progress
    if epoch % 200 == 0 or epoch == epochs - 1:
        predictions = (y_hat > 0.5).astype(int)
        accuracy = np.mean(predictions == y) * 100
        print(f"Epoch {epoch:4d}: Loss = {loss:.6f}, Accuracy = {accuracy:.1f}%")

print()
print("Final Predictions:")
print("Input          Predicted    Target")
for i in range(len(X)):
    pred = y_hat[i, 0]
    pred_class = 1 if pred > 0.5 else 0
    correct = "‚úÖ" if pred_class == y[i, 0] else "‚ùå"
    print(f"{X[i]}  ‚Üí  {pred:.4f} ({pred_class})    {int(y[i, 0])}  {correct}")

In [None]:
# Visualize the training

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(losses, 'b-', linewidth=1)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss (MSE)', fontsize=12)
axes[0].set_title('Training Loss Over Time', fontsize=14)
axes[0].set_yscale('log')  # Log scale shows learning better
axes[0].grid(True, alpha=0.3)
axes[0].axhline(y=0.01, color='r', linestyle='--', label='Target threshold')
axes[0].legend()

# Decision boundary visualization
# Create a grid to visualize what the network learned
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 100),
                     np.linspace(-0.5, 1.5, 100))
grid_points = np.c_[xx.ravel(), yy.ravel()]
Z = mlp.forward(grid_points).reshape(xx.shape)

# Plot decision regions
contour = axes[1].contourf(xx, yy, Z, levels=np.linspace(0, 1, 11), 
                          cmap='RdYlBu_r', alpha=0.8)
plt.colorbar(contour, ax=axes[1], label='Prediction')

# Plot training points
for i, (xi, yi) in enumerate(zip(X, y)):
    color = 'red' if yi[0] == 0 else 'blue'
    marker = 'o' if yi[0] == 0 else 's'
    axes[1].scatter(xi[0], xi[1], c=color, s=200, marker=marker, 
                   edgecolors='black', linewidth=2,
                   label=f'Class {int(yi[0])}' if i < 2 else None)

axes[1].set_xlabel('Input 1', fontsize=12)
axes[1].set_ylabel('Input 2', fontsize=12)
axes[1].set_title('Learned XOR Decision Boundary', fontsize=14)
axes[1].legend(loc='upper right')
axes[1].set_xlim(-0.5, 1.5)
axes[1].set_ylim(-0.5, 1.5)

plt.tight_layout()
plt.show()

print("\nüìä Training complete!")

### üîç What Just Happened?

We trained a neural network **entirely from scratch** using our manual backpropagation!

- **Left plot:** Loss decreases over training (note the log scale)
- **Right plot:** The network learned the XOR function!
  - Blue regions: Network predicts class 1
  - Red regions: Network predicts class 0
  - The boundary is non-linear (impossible with a single layer!)

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting the Chain Rule Order

```python
# ‚ùå Wrong: multiplying in wrong order
dL_dW = dz_dW * dL_dz  # This doesn't work for matrices!

# ‚úÖ Right: proper matrix multiplication
dL_dW = h_prev.T @ dL_dz  # Previous layer output transposed √ó gradient
```

**Why:** Matrix dimensions must align. `dL_dW` should have same shape as `W`.

### Mistake 2: Missing Batch Dimension

```python
# ‚ùå Wrong: bias gradient without summing over batch
dL_db = dL_dz  # Shape: (batch_size, hidden_size) - wrong!

# ‚úÖ Right: sum over batch dimension
dL_db = np.sum(dL_dz, axis=0, keepdims=True)  # Shape: (1, hidden_size)
```

**Why:** Each sample contributes to the gradient; we sum their contributions.

### Mistake 3: ReLU Derivative at Zero

```python
# ‚ùå Technically undefined at z=0
def relu_deriv(z):
    return 1 if z > 0 else 0  # What about z == 0?

# ‚úÖ Standard convention: use 0 at z=0
def relu_deriv(z):
    return (z > 0).astype(float)  # z=0 ‚Üí derivative=0
```

**Why:** In practice, hitting exactly z=0 is rare with floating point. The convention is well-tested.

### Mistake 4: Not Normalizing Loss by Batch Size

```python
# ‚ùå Wrong: gradient grows with batch size
loss = np.sum((y_hat - y) ** 2)
dL_dy_hat = 2 * (y_hat - y)

# ‚úÖ Right: normalize by batch size
loss = np.mean((y_hat - y) ** 2)
dL_dy_hat = 2 * (y_hat - y) / batch_size
```

**Why:** Without normalization, larger batches = larger gradients, which breaks learning rate tuning.

---

## ‚úã Try It Yourself #2: Add a Layer

Modify the `ManualMLP` class to add a fourth layer:

```
Input(2) ‚Üí Hidden1(4) ‚Üí Hidden2(3) ‚Üí Hidden3(2) ‚Üí Output(1)
```

Steps:
1. Add `W4`, `b4` to `__init__`
2. Add layer 4 in `forward()`
3. Add layer 4 gradients in `backward()`
4. Verify against PyTorch

<details>
<summary>üí° Hint</summary>

The pattern is consistent for each layer:
```python
# In backward():
# 1. Get gradient from next layer: dL_dh_current = dL_dz_next @ W_next.T
# 2. Through activation: dL_dz_current = dL_dh_current * activation_derivative(z_current)
# 3. Weight gradient: dL_dW_current = h_prev.T @ dL_dz_current
# 4. Bias gradient: dL_db_current = sum(dL_dz_current, axis=0)
```
</details>

In [None]:
# YOUR CODE HERE: Implement 4-Layer MLP

class ManualMLP4Layer:
    """4-layer MLP for practice"""
    
    def __init__(self):
        # TODO: Initialize W1, b1, W2, b2, W3, b3, W4, b4
        raise NotImplementedError("Complete this exercise")
    
    def forward(self, X):
        # TODO: Implement 4-layer forward pass
        raise NotImplementedError("Complete this exercise")
    
    def backward(self, y):
        # TODO: Implement 4-layer backward pass
        raise NotImplementedError("Complete this exercise")

# Test your implementation
# mlp4 = ManualMLP4Layer()
# y_hat = mlp4.forward(X)
# grads = mlp4.backward(y)
# Verify against PyTorch...

---

## üöÄ Challenge: Implement with Different Loss Function

Modify the network to use **Binary Cross-Entropy** loss instead of MSE:

$$L = -\frac{1}{n}\sum_{i}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

The gradient is:
$$\frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$$

Can you derive why this simplifies to:
$$\frac{\partial L}{\partial z} = \hat{y} - y$$

when combined with sigmoid activation?

<details>
<summary>üí° Mathematical hint</summary>

Use the fact that:
- $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})$ for sigmoid
- Multiply: $\frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}$
- After algebra: $\frac{\partial L}{\partial z} = \frac{1}{n}(\hat{y} - y)$

This is beautiful! Cross-entropy + sigmoid gives a simple gradient, which is why it's used together.
</details>

---

## üéâ Checkpoint

Congratulations! You've learned:

- ‚úÖ **Chain rule** is the foundation of backpropagation
- ‚úÖ **Forward pass** computes activations layer by layer
- ‚úÖ **Backward pass** computes gradients in reverse order
- ‚úÖ Each layer's gradients depend on the layer ahead (chain rule!)
- ‚úÖ Your manual gradients can match autograd to machine precision

**Key insight:** Backpropagation isn't magic‚Äîit's just the chain rule applied systematically!

---

## üìñ Further Reading

- [3Blue1Brown: Backpropagation Calculus](https://www.youtube.com/watch?v=tIeHLnjs5U8) - Visual intuition
- [CS231n: Backprop Notes](http://cs231n.github.io/optimization-2/) - Stanford's excellent notes
- [Andrej Karpathy: micrograd](https://github.com/karpathy/micrograd) - Tiny autograd engine
- [The Matrix Calculus You Need for Deep Learning](https://explained.ai/matrix-calculus/) - Deep dive

---

## üßπ Cleanup

---

## üì¶ Using Production-Ready Implementations

This module includes production-ready implementations in the `scripts/` folder:

- **`math_utils.py`**: Activation functions, loss functions, optimizers (SGD, Adam, AdamW)
- **`visualization_utils.py`**: Loss landscape plotting, training curves, SVD analysis

You can import these for your own projects or to verify your implementations.

In [None]:
# Example: Using production-ready implementations from scripts/
import sys
sys.path.insert(0, '..')  # Add parent directory to path

# Import from scripts
from scripts.math_utils import (
    sigmoid, sigmoid_derivative,
    relu, relu_derivative,
    softmax,
    mse_loss, cross_entropy_loss,
    SGD, SGDMomentum, Adam, AdamW,
    numerical_gradient, check_gradient
)

# Test the implementations
test_x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print("Testing script implementations:")
print(f"  sigmoid({test_x}) = {sigmoid(test_x).round(4)}")
print(f"  relu({test_x}) = {relu(test_x)}")
print(f"  softmax({test_x}) = {softmax(test_x).round(4)}")

# Test optimizer
opt = Adam(lr=0.1)
params = np.array([5.0, 5.0])
print(f"\nAdam optimizer test:")
print(f"  Initial params: {params}")
for i in range(3):
    grads = 2 * params  # Gradient of x^2 + y^2
    params = opt.step(params, grads)
print(f"  After 3 steps: {params.round(4)}")

print("\n‚úÖ Script implementations work correctly!")

In [None]:
# Clear memory
import gc

if torch.cuda.is_available():
    torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
print("\n‚û°Ô∏è  Next: Lab 1.4.2 - Optimizer Implementation")