# Task 4.1: NumPy Neural Network from Scratch

**Module:** 4 - Neural Network Fundamentals  
**Time:** 4 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê (Challenging but rewarding!)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how neural networks learn through forward and backward passes
- [ ] Implement a fully-connected (Linear) layer from scratch
- [ ] Implement ReLU activation with proper gradient computation
- [ ] Implement Softmax + Cross-Entropy for classification
- [ ] Build a complete training loop with SGD optimizer
- [ ] Train a network on MNIST to >95% accuracy

---

## üìö Prerequisites

- Completed: Module 2 (NumPy proficiency)
- Completed: Module 3 (Matrix calculus, chain rule)
- Knowledge of: Matrix multiplication, derivatives

---

## üåç Real-World Context

**Why build from scratch when PyTorch exists?**

Understanding the internals helps you:
1. **Debug effectively** - When training fails, you know where to look
2. **Optimize performance** - Knowing bottlenecks helps you speed things up
3. **Design new architectures** - Innovation requires understanding primitives
4. **Interview prep** - Top AI companies ask about these fundamentals

Many successful ML engineers at Google, OpenAI, and Anthropic can implement neural networks from scratch. It's a rite of passage!

---

## üßí ELI5: How Neural Networks Learn

> **Imagine you're playing a game of "Hot and Cold" with a friend who's hiding.**
>
> You take a step in some direction. Your friend says "warmer" or "colder." Based on that feedback, you adjust your direction and step size. Over many steps, you find your friend!
>
> **A neural network works the same way:**
> 1. **Forward Pass**: Make a prediction (take a step)
> 2. **Loss Calculation**: Check how wrong we were ("warmer" or "colder")
> 3. **Backward Pass**: Figure out which direction to adjust (where did I go wrong?)
> 4. **Weight Update**: Take a small step in the right direction
>
> **The magic of backpropagation** is that it can trace blame back through many layers. If the output is wrong, it figures out how much each neuron in each layer contributed to that mistake.

---

## Setup and Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Dict, Optional
import time
import sys
import os

# Add scripts directory to path
sys.path.insert(0, os.path.join(os.path.dirname(os.getcwd()), 'scripts'))

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib
plt.style.use('default')
%matplotlib inline

print("‚úÖ Setup complete!")
print(f"NumPy version: {np.__version__}")

---

## Part 1: Understanding the Forward Pass

### What happens in a neural network layer?

A single neuron computes:
$$y = \sigma(w_1 x_1 + w_2 x_2 + ... + w_n x_n + b) = \sigma(\mathbf{w}^T \mathbf{x} + b)$$

Where:
- $\mathbf{x}$ = input vector
- $\mathbf{w}$ = weight vector (learnable)
- $b$ = bias (learnable)
- $\sigma$ = activation function (adds non-linearity)

For a whole layer with many neurons, we use matrix multiplication:
$$\mathbf{Y} = \sigma(\mathbf{X} \mathbf{W} + \mathbf{b})$$

Let's visualize this:

In [None]:
# Let's trace through a simple example
print("üîç Tracing a Forward Pass")
print("=" * 50)

# Input: batch of 2 samples, each with 3 features
X = np.array([
    [1.0, 2.0, 3.0],   # Sample 1
    [4.0, 5.0, 6.0]    # Sample 2
])
print(f"Input X (2 samples, 3 features):")
print(X)

# Weights: 3 input features -> 4 output neurons
W = np.array([
    [0.1, 0.2, 0.3, 0.4],
    [0.5, 0.6, 0.7, 0.8],
    [0.9, 1.0, 1.1, 1.2]
])
print(f"\nWeights W (3 inputs -> 4 neurons):")
print(W)

# Bias: one per output neuron
b = np.array([0.1, 0.2, 0.3, 0.4])
print(f"\nBias b (4 neurons): {b}")

# Linear transformation
Z = X @ W + b
print(f"\nLinear output Z = X @ W + b:")
print(Z)
print(f"Shape: {Z.shape}")

# ReLU activation
A = np.maximum(0, Z)
print(f"\nAfter ReLU activation:")
print(A)

### üîç What Just Happened?

1. We took a batch of 2 input samples (each with 3 features)
2. Matrix multiplication `X @ W` transformed 3 features ‚Üí 4 neurons
3. Adding bias shifted each neuron's output
4. ReLU removed negative values (keeping only positive "activations")

**Key insight:** The weights and biases are the "learnable parameters" that we'll adjust during training!

---

## Part 2: Implementing the Linear Layer

Now let's create a proper `Linear` layer class with both forward and backward passes.

### The Math Behind Backpropagation

Given: $Z = XW + b$ and gradient from above: $\frac{\partial L}{\partial Z}$

We need to compute:
- $\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Z}$ (to update weights)
- $\frac{\partial L}{\partial b} = \sum \frac{\partial L}{\partial Z}$ (to update bias)
- $\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Z} \cdot W^T$ (to pass to previous layer)

In [None]:
class Linear:
    """
    Fully-connected linear layer.
    
    ELI5: A Linear layer is like a voting committee. Each input feature
    is a voter, and each weight represents how much we trust that voter.
    The bias is a starting opinion before any votes come in.
    """
    
    def __init__(self, in_features: int, out_features: int, init: str = 'he'):
        """
        Initialize the linear layer.
        
        Args:
            in_features: Number of input features
            out_features: Number of output features (neurons)
            init: Initialization method ('he' for ReLU, 'xavier' for tanh/sigmoid)
        """
        self.in_features = in_features
        self.out_features = out_features
        
        # Initialize weights using He initialization (optimal for ReLU)
        if init == 'he':
            std = np.sqrt(2.0 / in_features)
        else:  # xavier
            std = np.sqrt(2.0 / (in_features + out_features))
            
        self.W = np.random.randn(in_features, out_features) * std
        self.b = np.zeros(out_features)
        
        # Cache for backward pass
        self.cache = {}
        
        # Gradients
        self.dW = None
        self.db = None
    
    def forward(self, X: np.ndarray) -> np.ndarray:
        """
        Forward pass: Z = X @ W + b
        
        Args:
            X: Input of shape (batch_size, in_features)
            
        Returns:
            Output of shape (batch_size, out_features)
        """
        # Save input for backward pass
        self.cache['X'] = X
        
        # Linear transformation
        Z = X @ self.W + self.b
        return Z
    
    def backward(self, dZ: np.ndarray) -> np.ndarray:
        """
        Backward pass: compute gradients.
        
        Args:
            dZ: Gradient from above, shape (batch_size, out_features)
            
        Returns:
            Gradient to pass to previous layer, shape (batch_size, in_features)
        """
        X = self.cache['X']
        batch_size = X.shape[0]
        
        # Gradient for weights: X^T @ dZ, averaged over batch
        self.dW = X.T @ dZ / batch_size
        
        # Gradient for bias: sum over batch dimension
        self.db = np.mean(dZ, axis=0)
        
        # Gradient for input (to pass to previous layer)
        dX = dZ @ self.W.T
        
        return dX
    
    def __call__(self, X: np.ndarray) -> np.ndarray:
        return self.forward(X)

In [None]:
# Test the Linear layer
print("üß™ Testing Linear Layer")
print("=" * 50)

# Create layer: 784 inputs -> 256 outputs (like first layer for MNIST)
layer = Linear(784, 256)
print(f"Layer: {layer.in_features} -> {layer.out_features}")
print(f"Weight shape: {layer.W.shape}")
print(f"Bias shape: {layer.b.shape}")

# Forward pass
X = np.random.randn(32, 784)  # Batch of 32 images
Z = layer(X)
print(f"\nForward pass: ({32}, {784}) -> {Z.shape}")

# Backward pass
dZ = np.random.randn(32, 256)  # Gradient from above
dX = layer.backward(dZ)
print(f"Backward pass: gradient shape = {dX.shape}")
print(f"Weight gradient shape: {layer.dW.shape}")
print(f"Bias gradient shape: {layer.db.shape}")

print("\n‚úÖ Linear layer works correctly!")

---

## Part 3: Implementing ReLU Activation

### Why do we need activation functions?

Without activations, stacking linear layers would just be one big linear transformation:
$$XW_1W_2 = X(W_1W_2) = XW_{combined}$$

Activations add **non-linearity**, allowing the network to learn complex patterns.

### ReLU: The Most Popular Activation

$$\text{ReLU}(x) = \max(0, x)$$

**Derivative:**
$$\frac{d}{dx}\text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}$$

In [None]:
class ReLU:
    """
    Rectified Linear Unit activation.
    
    ELI5: ReLU is like a one-way valve. Positive values flow through
    unchanged, but negative values get blocked (become zero).
    This creates "sparsity" - many neurons output zero, which is efficient!
    """
    
    def __init__(self):
        self.cache = {}
    
    def forward(self, Z: np.ndarray) -> np.ndarray:
        """
        Forward pass: A = max(0, Z)
        """
        self.cache['Z'] = Z
        return np.maximum(0, Z)
    
    def backward(self, dA: np.ndarray) -> np.ndarray:
        """
        Backward pass: gradient is 1 where Z > 0, else 0.
        """
        Z = self.cache['Z']
        # Gradient flows through where Z was positive
        dZ = dA * (Z > 0).astype(float)
        return dZ
    
    def __call__(self, Z: np.ndarray) -> np.ndarray:
        return self.forward(Z)

In [None]:
# Visualize ReLU
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# ReLU function
x = np.linspace(-3, 3, 100)
relu = ReLU()
y = relu(x)

axes[0].plot(x, y, 'b-', linewidth=2)
axes[0].axhline(y=0, color='k', linewidth=0.5)
axes[0].axvline(x=0, color='k', linewidth=0.5)
axes[0].set_title('ReLU(x) = max(0, x)', fontsize=12)
axes[0].set_xlabel('x')
axes[0].set_ylabel('ReLU(x)')
axes[0].grid(True, alpha=0.3)

# ReLU gradient
grad = np.ones_like(x)
dZ = relu.backward(grad)

axes[1].plot(x, dZ, 'r-', linewidth=2)
axes[1].axhline(y=0, color='k', linewidth=0.5)
axes[1].axvline(x=0, color='k', linewidth=0.5)
axes[1].set_title("ReLU'(x) = 1 if x > 0, else 0", fontsize=12)
axes[1].set_xlabel('x')
axes[1].set_ylabel("Gradient")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üí° Key insight: ReLU passes gradient through only for positive inputs!")
print("   This is why 'dead ReLU' can happen - neurons that always output 0 never learn.")

---

## Part 4: Implementing Softmax + Cross-Entropy Loss

For classification, we need:
1. **Softmax**: Convert raw scores (logits) to probabilities
2. **Cross-Entropy**: Measure how different our predictions are from true labels

### Softmax
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

### Cross-Entropy Loss
$$L = -\sum_i y_i \log(\hat{y}_i)$$

Where $y$ is one-hot encoded true labels and $\hat{y}$ is softmax output.

### The Beautiful Gradient
When combined, the gradient is simply:
$$\frac{\partial L}{\partial z} = \hat{y} - y$$

In [None]:
class Softmax:
    """
    Softmax activation for classification.
    
    ELI5: Softmax is like converting exam scores to class rankings.
    It makes sure all probabilities sum to 1 (100%), with higher
    scores getting higher probabilities.
    """
    
    def __init__(self):
        self.cache = {}
    
    def forward(self, Z: np.ndarray) -> np.ndarray:
        """
        Compute softmax probabilities.
        
        Note: We subtract max for numerical stability (prevents overflow).
        """
        # Subtract max for numerical stability
        Z_shifted = Z - np.max(Z, axis=1, keepdims=True)
        exp_Z = np.exp(Z_shifted)
        probs = exp_Z / np.sum(exp_Z, axis=1, keepdims=True)
        self.cache['probs'] = probs
        return probs
    
    def __call__(self, Z: np.ndarray) -> np.ndarray:
        return self.forward(Z)


class CrossEntropyLoss:
    """
    Cross-entropy loss for classification.
    
    ELI5: Cross-entropy measures how "surprised" we are by the predictions.
    - Confident correct prediction: Low surprise, low loss
    - Confident WRONG prediction: High surprise, high loss
    """
    
    def __init__(self, epsilon: float = 1e-10):
        self.epsilon = epsilon  # Prevent log(0)
        self.cache = {}
    
    def forward(self, probs: np.ndarray, targets: np.ndarray) -> float:
        """
        Compute cross-entropy loss.
        
        Args:
            probs: Softmax probabilities, shape (batch_size, num_classes)
            targets: True labels as indices, shape (batch_size,)
            
        Returns:
            Scalar loss value
        """
        batch_size = probs.shape[0]
        
        # Clip probabilities for numerical stability
        probs_clipped = np.clip(probs, self.epsilon, 1 - self.epsilon)
        
        # Get probability of correct class for each sample
        correct_probs = probs_clipped[np.arange(batch_size), targets]
        
        # Negative log likelihood
        loss = -np.mean(np.log(correct_probs))
        
        # Save for backward
        self.cache['probs'] = probs
        self.cache['targets'] = targets
        
        return loss
    
    def backward(self) -> np.ndarray:
        """
        Compute gradient of loss with respect to logits (pre-softmax).
        
        The beautiful result: gradient = softmax_output - one_hot_targets
        """
        probs = self.cache['probs']
        targets = self.cache['targets']
        batch_size = probs.shape[0]
        
        # Start with softmax output
        grad = probs.copy()
        
        # Subtract 1 from the correct class probability
        grad[np.arange(batch_size), targets] -= 1
        
        return grad
    
    def __call__(self, probs: np.ndarray, targets: np.ndarray) -> float:
        return self.forward(probs, targets)

In [None]:
# Test Softmax + CrossEntropy
print("üß™ Testing Softmax + Cross-Entropy")
print("=" * 50)

# Logits for 3 samples, 4 classes
logits = np.array([
    [2.0, 1.0, 0.1, 0.5],   # Confident about class 0
    [0.1, 0.2, 3.0, 0.1],   # Confident about class 2
    [1.0, 1.0, 1.0, 1.0]    # Uncertain (uniform)
])
targets = np.array([0, 2, 1])  # True classes

print("Logits:")
print(logits)
print(f"\nTrue targets: {targets}")

# Forward pass
softmax = Softmax()
probs = softmax(logits)
print(f"\nSoftmax probabilities:")
print(probs)
print(f"Sum per sample: {probs.sum(axis=1)}")

# Loss
loss_fn = CrossEntropyLoss()
loss = loss_fn(probs, targets)
print(f"\nCross-entropy loss: {loss:.4f}")

# Gradient
grad = loss_fn.backward()
print(f"\nGradient (probs - one_hot):")
print(grad)

print("\nüí° Notice: The gradient is small for correct predictions, large for wrong ones!")

---

## Part 5: Building the Complete Network

Now let's put it all together into a Multi-Layer Perceptron (MLP).

In [None]:
class MLP:
    """
    Multi-Layer Perceptron for classification.
    
    Architecture: Input -> Linear -> ReLU -> Linear -> ReLU -> Linear -> Softmax
    """
    
    def __init__(self, layer_sizes: List[int]):
        """
        Initialize the MLP.
        
        Args:
            layer_sizes: List of layer sizes, e.g., [784, 256, 128, 10]
        """
        self.layers = []
        self.activations = []
        
        # Create layers
        for i in range(len(layer_sizes) - 1):
            self.layers.append(Linear(layer_sizes[i], layer_sizes[i + 1]))
            # Add ReLU for all layers except the last
            if i < len(layer_sizes) - 2:
                self.activations.append(ReLU())
        
        # Softmax for output
        self.softmax = Softmax()
        
        print(f"Created MLP: {' -> '.join(map(str, layer_sizes))}")
        print(f"Total parameters: {self.count_parameters():,}")
    
    def count_parameters(self) -> int:
        """Count total trainable parameters."""
        total = 0
        for layer in self.layers:
            total += layer.W.size + layer.b.size
        return total
    
    def forward(self, X: np.ndarray) -> np.ndarray:
        """
        Forward pass through the network.
        """
        # First n-1 layers: Linear + ReLU
        out = X
        for i in range(len(self.layers) - 1):
            out = self.layers[i](out)
            out = self.activations[i](out)
        
        # Last layer: Linear + Softmax
        out = self.layers[-1](out)
        out = self.softmax(out)
        
        return out
    
    def backward(self, grad: np.ndarray) -> None:
        """
        Backward pass through the network.
        """
        # Last layer (no ReLU backward needed, softmax+CE gradient already computed)
        grad = self.layers[-1].backward(grad)
        
        # Hidden layers in reverse order
        for i in range(len(self.layers) - 2, -1, -1):
            grad = self.activations[i].backward(grad)
            grad = self.layers[i].backward(grad)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Get class predictions (argmax of probabilities)."""
        probs = self.forward(X)
        return np.argmax(probs, axis=1)
    
    def __call__(self, X: np.ndarray) -> np.ndarray:
        return self.forward(X)

In [None]:
# Test the MLP
print("üß™ Testing MLP")
print("=" * 50)

# Create network: 784 -> 256 -> 128 -> 10
model = MLP([784, 256, 128, 10])

# Test forward pass
X_test = np.random.randn(32, 784)
output = model(X_test)
print(f"\nInput shape: {X_test.shape}")
print(f"Output shape: {output.shape}")
print(f"Output sums to 1: {np.allclose(output.sum(axis=1), 1)}")

# Test backward pass
targets = np.random.randint(0, 10, 32)
loss_fn = CrossEntropyLoss()
loss = loss_fn(output, targets)
grad = loss_fn.backward()
model.backward(grad)

print(f"\nLoss: {loss:.4f}")
print(f"Gradients computed for all layers: {all(l.dW is not None for l in model.layers)}")

print("\n‚úÖ MLP works correctly!")

---

## Part 6: Implementing SGD Optimizer

The optimizer updates weights based on gradients:
$$W = W - \eta \cdot \frac{\partial L}{\partial W}$$

In [None]:
class SGD:
    """
    Stochastic Gradient Descent optimizer.
    
    ELI5: SGD is like taking small steps downhill while blindfolded.
    You feel which way is down (gradient) and take a small step.
    The learning rate controls how big your steps are.
    """
    
    def __init__(self, learning_rate: float = 0.01):
        self.lr = learning_rate
    
    def step(self, model: MLP) -> None:
        """
        Update all weights in the model.
        """
        for layer in model.layers:
            layer.W -= self.lr * layer.dW
            layer.b -= self.lr * layer.db

---

## Part 7: Loading MNIST Dataset

MNIST is the "Hello World" of deep learning - handwritten digits 0-9.

In [None]:
def load_mnist(path: str = '../data') -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Load MNIST dataset.
    
    Returns:
        X_train, y_train, X_test, y_test
    """
    import gzip
    import urllib.request
    
    os.makedirs(path, exist_ok=True)
    
    base_url = 'http://yann.lecun.com/exdb/mnist/'
    files = {
        'train_images': 'train-images-idx3-ubyte.gz',
        'train_labels': 'train-labels-idx1-ubyte.gz',
        'test_images': 't10k-images-idx3-ubyte.gz',
        'test_labels': 't10k-labels-idx1-ubyte.gz'
    }
    
    def download_if_needed(filename: str) -> str:
        filepath = os.path.join(path, filename)
        if not os.path.exists(filepath):
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(base_url + filename, filepath)
        return filepath
    
    def load_images(filepath: str) -> np.ndarray:
        with gzip.open(filepath, 'rb') as f:
            f.read(16)  # Skip header
            data = np.frombuffer(f.read(), dtype=np.uint8)
            return data.reshape(-1, 784).astype(np.float32) / 255.0
    
    def load_labels(filepath: str) -> np.ndarray:
        with gzip.open(filepath, 'rb') as f:
            f.read(8)  # Skip header
            return np.frombuffer(f.read(), dtype=np.uint8)
    
    X_train = load_images(download_if_needed(files['train_images']))
    y_train = load_labels(download_if_needed(files['train_labels']))
    X_test = load_images(download_if_needed(files['test_images']))
    y_test = load_labels(download_if_needed(files['test_labels']))
    
    return X_train, y_train, X_test, y_test

In [None]:
# Load MNIST
print("üìÇ Loading MNIST dataset...")
X_train, y_train, X_test, y_test = load_mnist()

print(f"\nTraining set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"Input dimensions: {X_train.shape[1]} (28x28 flattened)")
print(f"Classes: {np.unique(y_train)}")

# Visualize some samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_train[i].reshape(28, 28), cmap='gray')
    ax.set_title(f'Label: {y_train[i]}')
    ax.axis('off')
plt.suptitle('Sample MNIST Digits', fontsize=14)
plt.tight_layout()
plt.show()

---

## Part 8: Training Loop

Now we'll put everything together and train our network!

In [None]:
def create_batches(X: np.ndarray, y: np.ndarray, batch_size: int, shuffle: bool = True):
    """
    Create mini-batches from data.
    """
    n_samples = X.shape[0]
    indices = np.arange(n_samples)
    
    if shuffle:
        np.random.shuffle(indices)
    
    for start in range(0, n_samples, batch_size):
        end = min(start + batch_size, n_samples)
        batch_idx = indices[start:end]
        yield X[batch_idx], y[batch_idx]


def compute_accuracy(model: MLP, X: np.ndarray, y: np.ndarray, batch_size: int = 256) -> float:
    """
    Compute accuracy on a dataset.
    """
    correct = 0
    total = 0
    
    for X_batch, y_batch in create_batches(X, y, batch_size, shuffle=False):
        predictions = model.predict(X_batch)
        correct += np.sum(predictions == y_batch)
        total += len(y_batch)
    
    return correct / total

In [None]:
def train_model(
    model: MLP,
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray,
    y_test: np.ndarray,
    epochs: int = 10,
    batch_size: int = 64,
    learning_rate: float = 0.01
) -> Dict:
    """
    Train the model and return training history.
    """
    optimizer = SGD(learning_rate)
    loss_fn = CrossEntropyLoss()
    
    history = {
        'train_loss': [],
        'train_acc': [],
        'test_acc': [],
        'epoch_time': []
    }
    
    print(f"\n{'='*60}")
    print(f"Training Configuration:")
    print(f"  - Epochs: {epochs}")
    print(f"  - Batch size: {batch_size}")
    print(f"  - Learning rate: {learning_rate}")
    print(f"  - Training samples: {X_train.shape[0]:,}")
    print(f"{'='*60}\n")
    
    for epoch in range(epochs):
        start_time = time.time()
        epoch_loss = 0
        n_batches = 0
        
        # Training
        for X_batch, y_batch in create_batches(X_train, y_train, batch_size):
            # Forward pass
            output = model(X_batch)
            loss = loss_fn(output, y_batch)
            epoch_loss += loss
            n_batches += 1
            
            # Backward pass
            grad = loss_fn.backward()
            model.backward(grad)
            
            # Update weights
            optimizer.step(model)
        
        # Compute metrics
        avg_loss = epoch_loss / n_batches
        train_acc = compute_accuracy(model, X_train, y_train)
        test_acc = compute_accuracy(model, X_test, y_test)
        epoch_time = time.time() - start_time
        
        # Record history
        history['train_loss'].append(avg_loss)
        history['train_acc'].append(train_acc)
        history['test_acc'].append(test_acc)
        history['epoch_time'].append(epoch_time)
        
        # Print progress
        print(f"Epoch {epoch + 1:2d}/{epochs} | "
              f"Loss: {avg_loss:.4f} | "
              f"Train Acc: {train_acc:.2%} | "
              f"Test Acc: {test_acc:.2%} | "
              f"Time: {epoch_time:.2f}s")
    
    print(f"\n{'='*60}")
    print(f"üéâ Training complete!")
    print(f"   Final Test Accuracy: {history['test_acc'][-1]:.2%}")
    print(f"   Total Training Time: {sum(history['epoch_time']):.2f}s")
    print(f"{'='*60}")
    
    return history

In [None]:
# Create and train the model!
print("üöÄ Building and Training Neural Network from Scratch!")
print("=" * 60)

# Set seed for reproducibility
np.random.seed(42)

# Create model: 784 -> 256 -> 128 -> 10
model = MLP([784, 256, 128, 10])

# Train!
history = train_model(
    model, 
    X_train, y_train,
    X_test, y_test,
    epochs=10,
    batch_size=64,
    learning_rate=0.1  # Higher LR for faster convergence
)

---

## Part 9: Visualizing Training Progress

In [None]:
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(history['train_loss'], 'b-', linewidth=2, label='Training Loss')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training Loss Over Time', fontsize=14)
axes[0].grid(True, alpha=0.3)
axes[0].legend()

# Accuracy curves
axes[1].plot(history['train_acc'], 'b-', linewidth=2, label='Train Accuracy')
axes[1].plot(history['test_acc'], 'r--', linewidth=2, label='Test Accuracy')
axes[1].axhline(y=0.95, color='g', linestyle=':', linewidth=2, label='95% Target')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Accuracy Over Time', fontsize=14)
axes[1].grid(True, alpha=0.3)
axes[1].legend()

plt.tight_layout()
plt.show()

# Final report
target_achieved = history['test_acc'][-1] >= 0.95
if target_achieved:
    print("\nüéâ SUCCESS! You achieved >95% accuracy on MNIST!")
else:
    print(f"\nüìà Current accuracy: {history['test_acc'][-1]:.2%}")
    print("   Try: more epochs, higher learning rate, or larger hidden layers")

In [None]:
# Visualize some predictions
fig, axes = plt.subplots(2, 5, figsize=(14, 6))

# Get random test samples
indices = np.random.choice(len(X_test), 10, replace=False)

for i, ax in enumerate(axes.flat):
    idx = indices[i]
    image = X_test[idx].reshape(28, 28)
    true_label = y_test[idx]
    
    # Get prediction
    pred_probs = model(X_test[idx:idx+1])[0]
    pred_label = np.argmax(pred_probs)
    confidence = pred_probs[pred_label]
    
    # Display
    ax.imshow(image, cmap='gray')
    color = 'green' if pred_label == true_label else 'red'
    ax.set_title(f'Pred: {pred_label} ({confidence:.0%})\nTrue: {true_label}', 
                 color=color, fontsize=10)
    ax.axis('off')

plt.suptitle('Model Predictions on Test Set', fontsize=14)
plt.tight_layout()
plt.show()

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting to normalize input data

```python
# ‚ùå Wrong - raw pixel values (0-255)
X_train = load_images(...)  # Values 0-255

# ‚úÖ Right - normalized to [0, 1]
X_train = load_images(...) / 255.0
```
**Why:** Large input values cause large activations and exploding gradients.

---

### Mistake 2: Wrong gradient shape

```python
# ‚ùå Wrong - forgetting to average over batch
self.dW = X.T @ dZ  # Shape correct, but values too large

# ‚úÖ Right - average over batch
self.dW = X.T @ dZ / batch_size
```
**Why:** Without averaging, larger batch sizes would have larger gradients.

---

### Mistake 3: Numerical instability in softmax

```python
# ‚ùå Wrong - can cause overflow
exp_Z = np.exp(Z)
probs = exp_Z / np.sum(exp_Z, axis=1, keepdims=True)

# ‚úÖ Right - subtract max for stability
Z_shifted = Z - np.max(Z, axis=1, keepdims=True)
exp_Z = np.exp(Z_shifted)
probs = exp_Z / np.sum(exp_Z, axis=1, keepdims=True)
```
**Why:** exp(1000) = overflow, but exp(1000-1000) = exp(0) = 1.

---

## ‚úã Try It Yourself

### Exercise 1: Experiment with Architecture

Try different network architectures and see how they affect accuracy:
- Deeper: `[784, 512, 256, 128, 64, 10]`
- Wider: `[784, 512, 512, 10]`
- Simpler: `[784, 128, 10]`

<details>
<summary>üí° Hint</summary>
Just change the layer_sizes list when creating the MLP.
Deeper networks may need lower learning rates!
</details>

In [None]:
# Your code here: Try a different architecture
# model_v2 = MLP([784, ???, 10])
# history_v2 = train_model(model_v2, X_train, y_train, X_test, y_test, ...)

### Exercise 2: Learning Rate Exploration

Try learning rates: 0.001, 0.01, 0.1, 1.0

Which one works best? Which ones fail?

In [None]:
# Your code here: Compare learning rates

---

## üéâ Checkpoint

You've accomplished something remarkable! You:

- ‚úÖ Built a neural network from scratch using only NumPy
- ‚úÖ Implemented forward and backward passes
- ‚úÖ Created Linear, ReLU, Softmax, and CrossEntropy components
- ‚úÖ Trained on real data and achieved >95% accuracy
- ‚úÖ Visualized training progress and predictions

**Key Insights:**
1. Neural networks are just matrix multiplication + non-linearity
2. Backpropagation traces gradients through the chain rule
3. Proper initialization and normalization are crucial
4. The softmax + cross-entropy gradient is beautifully simple

---

## üöÄ Challenge (Optional)

**Advanced Challenge:** Implement momentum in the SGD optimizer.

Momentum helps escape local minima and speeds up training:
$$v = \beta \cdot v - \eta \cdot \nabla L$$
$$W = W + v$$

Where $\beta$ is typically 0.9.

---

## üìñ Further Reading

- [Neural Networks and Deep Learning (Nielsen)](http://neuralnetworksanddeeplearning.com/) - Free online book
- [CS231n: Backpropagation](https://cs231n.github.io/optimization-2/) - Stanford course notes
- [3Blue1Brown: Neural Networks](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) - Excellent visualizations

---

## üßπ Cleanup

In [None]:
# Clean up memory
import gc

# Clear large variables if needed
# del X_train, y_train, X_test, y_test, model
gc.collect()

print("‚úÖ Cleanup complete!")
print("\nüéØ Next: Proceed to notebook 02-activation-function-study.ipynb")