<a target="_blank" href="https://colab.research.google.com/github/FranQuant/the_ai_engineer_capstones/blob/main/capstones/Week02_backprop/01_numpy_manual.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
</a>

# Week 2 Capstone — Manual NumPy Backprop (1-Hidden-Layer MLP)

This notebook implements:

1. A tiny synthetic dataset  
2. Manual forward pass (NumPy)  
3. Manual backward pass via chain rule  
4. Gradient check (finite differences)

Following the *Deep Learning Basics: Chain Rule → Backprop → nn.Module* handout (TAE Program, Week 2).

## 1. Imports & Deterministic Seeds

In [1]:
import numpy as np

# ------------------------------
# Deterministic seed
# ------------------------------
SEED = 42
rng = np.random.default_rng(SEED)

def set_seed(seed=42):
    global rng
    rng = np.random.default_rng(seed)

set_seed(SEED)

print("NumPy RNG initialized with seed =", SEED)

NumPy RNG initialized with seed = 42


## 2. Notebook Outline

1. Synthetic dataset  
2. Model definition (parameters, activation)  
3. Forward pass  
4. Loss function  
5. Backward pass (manual gradients)  
6. Gradient check (finite differences)  
7. Sanity checks  

## 3. (Tiny) Synthetic Dataset
We implement the XOR-style pattern:

$$
 x ~ U([-1, 1]^2), y = 1[x1 * x2 < 0]
$$


In [2]:
# ---------------------------------------
# Synthetic XOR-like dataset
# ---------------------------------------
# According to Section 6.1 of the Capstone:
#   x ~ Uniform([-1, 1]^2)
#   y = 1[x1 * x2 < 0]

def generate_toy_data(n_samples=64):
    X = rng.uniform(-1, 1, size=(n_samples, 2))
    y = (X[:, 0] * X[:, 1] < 0).astype(float)
    return X, y

X, y = generate_toy_data(4)
print("X sample:\n", X)
print("y sample:\n", y)

X sample:
 [[ 0.5479121  -0.12224312]
 [ 0.71719584  0.39473606]
 [-0.8116453   0.9512447 ]
 [ 0.5222794   0.57212861]]
y sample:
 [1. 0. 1. 0.]


## 4. Model Parameters (NumPy)

In [3]:
# ---------------------------------------
# Model parameters: d -> h -> 1
# ---------------------------------------

d = 2     # input dimension
h = 4     # hidden width
out = 1   # scalar output

# small Gaussian init to avoid saturation
W1 = rng.normal(0, 0.1, size=(h, d))
b1 = np.zeros((h,))

W2 = rng.normal(0, 0.1, size=(1, h))
b2 = np.zeros((1,))

def get_params():
    return W1, b1, W2, b2

print("W1:", W1.shape)
print("b1:", b1.shape)
print("W2:", W2.shape)
print("b2:", b2.shape)

W1: (4, 2)
b1: (4,)
W2: (1, 4)
b2: (1,)


## 5. Activation Functions (ReLU + ReLU')
We use ReLU:

$$
\text{ReLU}(u) = \max(0, u)
$$

Derivative:

$$
\text{ReLU}'(u) = 
\begin{cases}
1 & u > 0 \\
0 & u \le 0
\end{cases}
$$

In [4]:
def relu(u):
    return np.maximum(0, u)

def relu_prime(u):
    return (u > 0).astype(float)

## 6. Forward Pass (Single Sample)

For a single sample $(x \in \mathbb{R}^d)$:

$$
a_1 = W_1 x + b_1
$$

$$
h = \phi(a_1)
$$

$$
f = W_2 h + b_2
$$


In [5]:
def forward_single(x, W1, b1, W2, b2):
    a1 = W1 @ x + b1      # (h,)
    h  = relu(a1)         # (h,)
    f  = W2 @ h + b2      # (1,)
    return a1, h, f[0]    # return scalar f

## 7. Loss Function
We use the squared error:

$$
L(f, y) = \frac{1}{2}(f - y)^2
$$


In [6]:
def mse_loss(f, y):
    return 0.5 * (f - y)**2

## 8. Manual Backward Pass (Chain Rule)

Given

- error: $ \delta_f = f - y $
- output gradient:
  $$
  \frac{\partial L}{\partial W_2} = \delta_f \, h^{\top},
  \qquad
  \frac{\partial L}{\partial b_2} = \delta_f
  $$
- hidden layer:
  $$
  \delta_h = W_2^\top \delta_f,
  \qquad
  \delta_{a_1} = \delta_h \odot \phi'(a_1)
  $$
- input layer:
  $$
  \frac{\partial L}{\partial W_1} = \delta_{a_1} x^\top,
  \qquad
  \frac{\partial L}{\partial b_1} = \delta_{a_1}
  $$


In [7]:
def backward_single(x, y, a1, h, f, W1, b1, W2, b2):
    # ----- output layer -----
    df = f - y                # scalar

    dW2 = df * h              # (h,)
    db2 = df                  # scalar

    # ----- hidden layer -----
    # W2 comes as shape (1, h), so flatten it
    dh = W2.flatten() * df           # (h,)
    da1 = dh * relu_prime(a1)        # (h,)

    # ----- input layer -----
    dW1 = da1[:, None] @ x[None, :]  # (h, d)
    db1 = da1

    return dW1, db1, dW2, db2


## 9. Single-Sample Sanity Test Check

In [8]:
# pick first sample
x0, y0 = X[0], y[0]

a1, h, f = forward_single(x0, W1, b1, W2, b2)
L = mse_loss(f, y0)

dW1, db1, dW2, db2 = backward_single(x0, y0, a1, h, f, W1, b1, W2, b2)

print("Forward output f =", f)
print("Loss L =", L)
print("dW1 shape:", dW1.shape)
print("db1 shape:", db1.shape)
print("dW2 shape:", dW2.shape)
print("db2 scalar:", db2)

Forward output f = -0.0035382554995001636
Loss L = 0.5035445151254901
dW1 shape: (4, 2)
db1 shape: (4,)
dW2 shape: (4,)
db2 scalar: -1.0035382554995003


## 10. Gradient Check (Finite Differences)

### 10.1 Flatten / Unflatten Helpers

In [9]:
# ---------------------------------------
# Utilities to flatten and unflatten params
# ---------------------------------------

def flatten_params(W1, b1, W2, b2):
    """Flatten all parameters into a single 1D vector."""
    return np.concatenate([
        W1.reshape(-1),
        b1.reshape(-1),
        W2.reshape(-1),
        b2.reshape(-1),
    ])

def unflatten_params(theta, d=2, h=4):
    """Inverse of flatten_params."""
    # W1: (h, d)
    size_W1 = h * d
    W1 = theta[:size_W1].reshape(h, d)

    # b1: (h,)
    b1 = theta[size_W1:size_W1 + h]

    # W2: (1, h)
    start = size_W1 + h
    size_W2 = h
    W2 = theta[start:start + size_W2].reshape(1, h)

    # b2: scalar
    b2 = theta[start + size_W2]

    return W1, b1, W2, np.array([b2])

### 10.2 Loss Wrapper for a Single Sample
We define a function `loss_from_theta(theta, x, y)` so we can evaluate numeric derivatives.

In [10]:
# ---------------------------------------
# Loss wrapper for flattened parameter vector
# ---------------------------------------

def loss_from_theta(theta, x, y):
    W1, b1, W2, b2 = unflatten_params(theta)
    a1, h, f = forward_single(x, W1, b1, W2, b2)
    return mse_loss(f, y)


### 10.3 Analytic Gradient Flattening

In [11]:
def analytic_grad(theta, x, y):
    W1, b1, W2, b2 = unflatten_params(theta)
    a1, h, f = forward_single(x, W1, b1, W2, b2)
    dW1, db1, dW2, db2 = backward_single(x, y, a1, h, f, W1, b1, W2, b2)
    return flatten_params(dW1, db1, dW2, np.array([db2]))

### 10.4 Numeric Finite-Difference Gradient
This implements:

$$
\frac{\partial L}{\partial \theta_i} \approx \frac{L(\theta_i+\epsilon)-L(\theta_i-\epsilon)}{2\epsilon}
$$

In [12]:
def numeric_grad(theta, x, y, eps=1e-5):
    num_grads = np.zeros_like(theta)
    for i in range(len(theta)):
        theta_plus = theta.copy()
        theta_minus = theta.copy()
        theta_plus[i]  += eps
        theta_minus[i] -= eps
        
        f_plus  = loss_from_theta(theta_plus, x, y)
        f_minus = loss_from_theta(theta_minus, x, y)
        
        num_grads[i] = (f_plus - f_minus) / (2 * eps)
    return num_grads

## 10.5 Gradient Check Comparison

In [13]:
# pick one sample for checking
x0, y0 = X[0], y[0]

theta = flatten_params(W1, b1, W2, b2)

g_analytic = analytic_grad(theta, x0, y0)
g_numeric  = numeric_grad(theta, x0, y0)

abs_diff = np.abs(g_analytic - g_numeric)
rel_diff = abs_diff / (np.abs(g_numeric) + 1e-8)

print("Max absolute diff: ", abs_diff.max())
print("Max relative diff: ", rel_diff.max())

# Fail loudly if wrong
assert abs_diff.max() < 1e-5, "Gradient check FAILED!"
print("\nGradient check PASSED.")

Max absolute diff:  9.519570028093671e-12
Max relative diff:  8.721610611055463e-09

Gradient check PASSED.


## 11. Conclusion — What We Achieved

In this notebook we built a complete 2-layer neural network **from scratch in NumPy**, with **no autograd** and **no frameworks**.  
The core objective was to expose and verify the mathematical mechanics underlying backpropagation.

**Key takeaways:**

- We implemented all model components manually:
  - Parameters \((W_1, b_1, W_2, b_2)\)
  - ReLU activation
  - Forward computation  
  - Mean-squared error loss

- We derived **analytical gradients** for every parameter by hand, using:
  $$
  \frac{\partial L}{\partial W_2}, \quad 
  \frac{\partial L}{\partial b_2}, \quad
  \frac{\partial L}{\partial W_1}, \quad
  \frac{\partial L}{\partial b_1}.
  $$

- We implemented a **finite-difference gradient checker**, using:
  $$
  \frac{\partial L}{\partial \theta_i}
  \approx 
  \frac{L(\theta_i + \varepsilon) - L(\theta_i - \varepsilon)}
       {2\varepsilon}.
  $$

- The analytical gradients matched the numerical ones to machine precision:
  - Max absolute diff ≈ \(10^{-11}\)  
  - Max relative diff ≈ \(10^{-9}\)

This verifies that our backward-pass derivations and implementation are **correct**.
