<a target="_blank" href="https://colab.research.google.com/github/FranQuant/the_ai_engineer_capstones/blob/main/capstones/week02_backprop/02_pytorch_no_autograd.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
</a>

# 02 — PyTorch Implementation (No Autograd)

## 1. Imports & Deterministic Seeds

In [None]:
# ============================================
# 1. Imports & Deterministic Seeds (match NB01)
# ============================================
import torch
import numpy as np

torch.manual_seed(42)
rng = np.random.default_rng(42)

## 2. Synthetic Dataset (same as Notebook 01)

In [None]:
# ============================================
# 2. Synthetic Dataset (match NB01 exactly)
# ============================================
N = 500  # same as Notebook 01

X_np = rng.uniform(-1, 1, size=(N, 2)).astype(np.float32)
y_np = (X_np[:, 0] * X_np[:, 1] < 0).astype(np.float32)

# Convert to PyTorch (float32 everywhere)
X = torch.tensor(X_np, dtype=torch.float32)
y = torch.tensor(y_np, dtype=torch.float32)

## 3. NumPy Reference Forward Pass (from Notebook 01)
To ensure numerical parity between the NumPy and PyTorch implementations,
we replicate the minimal forward-pass functions from Notebook 01. These
are used for direct comparison in Section 8.

In [None]:
# ============================================
# 3. NumPy Reference Forward Pass (from NB01)
# ============================================

def relu_np(x):
    return np.maximum(0, x)

def forward_single(x, W1, b1, W2, b2):
    a1 = W1 @ x + b1        # (h,)
    h  = relu_np(a1)        # (h,)
    f  = W2 @ h + b2        # (1,)
    return a1, h, float(f[0])  # explicit scalar extract

## 4 .NumPy Model Parameters (for comparison)

Notebook 02 needs standalone NumPy parameters to reproduce the exact
forward pass used in Notebook 01. These are synchronized with the
PyTorch parameters in Section 5 so both implementations produce
identical outputs.

In [None]:
# ============================================
# 4. NumPy Model Parameters (match NB01 exactly)
# ============================================

d, h = 2, 4

# Weight initialization: small Gaussian (std = 0.1), cast to float32
W1 = rng.normal(0.0, 0.1, size=(h, d)).astype(np.float32)
W2 = rng.normal(0.0, 0.1, size=(1, h)).astype(np.float32)

# Biases as float32
b1 = np.zeros((h,), dtype=np.float32)
b2 = np.zeros((1,), dtype=np.float32)

## 5. Model Parameters in PyTorch (No Autograd)

In [None]:
# ============================================
# 5. PyTorch Parameters (synced with NumPy)
# ============================================

# Create tensors with the same shapes and dtypes as NumPy params
W1_t = torch.empty((h, d), dtype=torch.float32)
b1_t = torch.empty((h,),   dtype=torch.float32)
W2_t = torch.empty((1, h), dtype=torch.float32)
b2_t = torch.empty((1,),   dtype=torch.float32)

# Disable autograd for this notebook
for t in [W1_t, b1_t, W2_t, b2_t]:
    t.requires_grad_(False)

# Sync PyTorch parameters with NumPy (safe in-place copy)
W1_t.copy_(torch.tensor(W1, dtype=torch.float32))
b1_t.copy_(torch.tensor(b1, dtype=torch.float32))
W2_t.copy_(torch.tensor(W2, dtype=torch.float32))
b2_t.copy_(torch.tensor(b2, dtype=torch.float32))

## 6. Activation Function
#### Match NumPy ReLU:

In [None]:
def relu_t(x):
    return torch.clamp(x, min=0.0)

## 7. Forward Pass (PyTorch)

In [None]:
# ============================================
# 7. Forward Pass (PyTorch) — single-sample only
# ============================================

def forward_torch(x, W1, b1, W2, b2):
    """
    Single-sample forward pass matching NumPy forward_single.
    x:  (d,)
    W1: (h, d)
    b1: (h,)
    W2: (1, h)
    b2: (1,)
    Returns:
        a1: pre-activation (h,)
        h:  hidden layer (h,)
        f:  scalar output (torch scalar tensor)
    """
    a1 = x @ W1.T + b1        # (h,)
    h  = relu_t(a1)
    f  = W2 @ h + b2          # (1,)
    return a1, h, f.squeeze()  # scalar tensor

## 8. Loss Function
#### Match NumPy MSE:

In [None]:
# ============================================
# 8. Loss Function (match NumPy exactly)
# ============================================

def mse_loss_t(f, y):
    """
    Mean-squared error loss for a single sample.
    Matches the NumPy definition L = 0.5 * (f - y)**2.
    f: torch scalar tensor
    y: torch scalar tensor (float32)
    Returns: torch scalar tensor
    """
    return 0.5 * (f - y)**2

## 9. Numerical Consistency Test (NumPy vs Torch)
We compare the NumPy output from Notebook 01 with the PyTorch output here.

Pick a single sample:

In [None]:
# ============================================
# 9. Numerical Consistency Test (NumPy vs Torch)
# ============================================

i = 0  # pick sample
x_i_np = X_np[i]              # NumPy input
y_i    = y[i]                 # Torch scalar tensor target

# NumPy forward pass
a1_np, h_np, f_np = forward_single(x_i_np, W1, b1, W2, b2)   # scalar f_np (Python float)

# PyTorch forward pass
x_i_t = X[i]                  # torch.float32
a1_t, h_t, f_t = forward_torch(x_i_t, W1_t, b1_t, W2_t, b2_t)  # f_t is torch scalar tensor

f_t_val = f_t.item()

# Print comparison
print("NumPy output f_np =", f_np)
print("Torch output f_t =", f_t_val)
print("Absolute difference =", abs(f_np - f_t_val))

# Assertion — ensures consistency
assert abs(f_np - f_t_val) < 1e-6, "NumPy and PyTorch outputs diverge!"


## 10. Manual Backward (No Autograd)

We now re-implement the **chain-rule gradients** from Notebook 01, but using
PyTorch tensors with `requires_grad=False`. This keeps autograd disabled while
showing how backprop works numerically in Torch.

For a single sample $(x, y)$ and forward pass
$f = W_2 \, \text{ReLU}(W_1 x + b_1) + b_2$ with loss
$L = \tfrac12 (f - y)^2$, the gradients are:

- $dL/df = f - y$
- $dL/dW_2 = (f - y)\,h^\top$
- $dL/db_2 = f - y$
- $dL/dh = W_2^\top (f - y)$
- $dL/da_1 = dL/dh \odot \mathbf{1}_{a_1 > 0}$
- $dL/dW_1 = (dL/da_1)\,x^\top$
- $dL/db_1 = dL/da_1$

In [None]:
# ============================================
# 10. Manual Backward (no autograd)
# ============================================

def backward_torch(x, y, a1, h, f, W1, W2):
    """
    Manual chain-rule gradients in PyTorch (no autograd).
    All inputs are torch tensors with requires_grad=False.
    Shapes:
        x:  (d,)
        a1: (h,)
        h:  (h,)
        f:  scalar tensor
        W1: (h, d)
        W2: (1, h)
        y:  scalar tensor
    Returns:
        dW1: (h, d)
        db1: (h,)
        dW2: (1, h)
        db2: (1,)
    """
    # dL/df
    df = f - y                # scalar tensor

    # Output layer
    dW2 = df * h[None, :]     # (1, h)
    db2 = df.unsqueeze(0)     # (1,)

    # Hidden layer
    dh  = W2[0] * df          # (h,)
    da1 = dh * (a1 > 0).float()  # ReLU'

    # Input layer
    dW1 = da1[:, None] @ x[None, :]  # (h, d)
    db1 = da1                        # (h,)

    return dW1, db1, dW2, db2

## 11. Tiny Training Loop (Manual SGD, No Autograd)

We now run a small training loop using **manual gradients** only. Autograd
remains disabled (`requires_grad=False`), and we update `W1_t, b1_t, W2_t, b2_t`
in-place using SGD on the full dataset.

In [None]:
# ============================================
# 11. Tiny Training Loop (manual SGD, no autograd)
# ============================================

# Re-initialize Torch parameters from the NumPy baseline
W1_t.copy_(torch.tensor(W1, dtype=torch.float32))
b1_t.copy_(torch.tensor(b1, dtype=torch.float32))
W2_t.copy_(torch.tensor(W2, dtype=torch.float32))
b2_t.copy_(torch.tensor(b2, dtype=torch.float32))

learning_rate = 0.1
num_epochs = 50

loss_history = []

for epoch in range(1, num_epochs + 1):
    epoch_loss = 0.0

    for i in range(N):
        x_i = X[i]       # (d,)
        y_i = y[i]       # scalar tensor

        # Forward
        a1, h, f = forward_torch(x_i, W1_t, b1_t, W2_t, b2_t)  # f is scalar tensor
        loss = mse_loss_t(f, y_i)

        # Backward (manual)
        dW1, db1, dW2, db2 = backward_torch(x_i, y_i, a1, h, f, W1_t, W2_t)

        # SGD update (in-place)
        W1_t -= learning_rate * dW1
        b1_t -= learning_rate * db1
        W2_t -= learning_rate * dW2
        b2_t -= learning_rate * db2

        epoch_loss += loss.item()

    epoch_loss /= N
    loss_history.append(epoch_loss)

    if epoch % 10 == 0:
        print(f"epoch {epoch:03d} | mean loss {epoch_loss:.4f}")

## 12. Accuracy After Manual-Gradient Training

In [None]:
# ============================================
# 12. Accuracy after training
# ============================================

correct = 0
with torch.no_grad():
    for i in range(N):
        x_i = X[i]
        y_i = y[i]
        _, _, f = forward_torch(x_i, W1_t, b1_t, W2_t, b2_t)
        y_hat = (f >= 0.5).float()
        correct += (y_hat == y_i).item()

accuracy = correct / N
print(f"Final training accuracy (manual gradients): {accuracy:.4f}")

## 13. Conclusion

In this notebook we re-implemented the 2-layer neural network forward pass in
PyTorch **without autograd**, matching the NumPy reference model from Notebook 01
exactly. To guarantee full numerical parity, we replicated:

- the same dataset generation (same RNG, same sampling),
- the same parameter initialization (Gaussian weights, zero biases),
- the same forward equations (linear → ReLU → linear),
- the same loss definition \( L = \tfrac12 (f - y)^2 \),
- the same float32 dtype end-to-end.

After synchronization, NumPy and PyTorch forward passes matched to machine
precision, confirming deterministic equivalence.

We then extended this baseline with **manual backpropagation** implemented
directly in PyTorch tensors (`requires_grad=False`) and ran a small **SGD
training loop** using only our hand-derived gradients. The model trained
successfully and achieved reasonable accuracy on the quadrant-classification
task.

This notebook establishes the foundation for Notebook 03, where we introduce
**PyTorch autograd**, compare automatic gradients to our manual derivatives, and
validate that the computational graph reproduces the exact algebra from
Notebook 01.