# Day 13: Becoming a Backprop Ninja — Manual Tensor Backprop

**Building LLMs from Scratch** — Following Andrej Karpathy's makemore lectures.

---

## 1. Introduction

Today we **manually compute gradients** for tensor operations: matrix multiply, batch norm, and cross-entropy. No autograd — we derive every gradient by hand and verify against PyTorch. This is the "backprop ninja" exercise: understanding exactly how gradients flow through each operation.

We'll build a small network with:
- Linear layer: \( h = XW + b \)
- Batch normalization: normalize, then scale \( \gamma \) and shift \( \beta \)
- Output layer: \( \text{logits} = h_{bn} W_2 + b_2 \)
- Cross-entropy loss

For each step, we'll compute the gradients manually and compare with `torch.autograd`.

## 2. Setup

Import torch, set seed, create a small dataset.

In [None]:
import torch
import torch.nn.functional as F

torch.manual_seed(42)

# Small dataset: X (4, 10), Y (4,) with targets 0-4
X = torch.randn(4, 10)
Y = torch.randint(0, 5, (4,))  # 5 classes: 0, 1, 2, 3, 4

print(f"X shape: {X.shape}")
print(f"Y shape: {Y.shape}")
print(f"Y values: {Y.tolist()}")

## 3. Forward Pass

Build the full forward pass: linear → batch norm → output layer → cross-entropy.

We use `unbiased=False` for variance to match typical BatchNorm (divide by N, not N-1).

In [None]:
torch.manual_seed(42)

# Parameters (all require_grad for later autograd verification)
W = torch.randn(10, 20, requires_grad=True)
b = torch.randn(20, requires_grad=True)
gamma = torch.ones(20, requires_grad=True)
beta = torch.zeros(20, requires_grad=True)
W2 = torch.randn(20, 5, requires_grad=True)
b2 = torch.randn(5, requires_grad=True)

eps = 1e-5

# Forward pass
h = X @ W + b                                    # (4, 20)
h_mean = h.mean(0)                               # (20,)
h_var = h.var(0, unbiased=False)                 # (20,) - BatchNorm uses N, not N-1
h_norm = (h - h_mean) / (h_var + eps).sqrt()     # (4, 20)
h_bn = gamma * h_norm + beta                     # (4, 20)
logits = h_bn @ W2 + b2                          # (4, 5)
loss = F.cross_entropy(logits, Y)

# Retain gradients on intermediates for manual backprop verification
h.retain_grad()
h_norm.retain_grad()
h_bn.retain_grad()
logits.retain_grad()

print(f"h shape: {h.shape}")
print(f"h_bn shape: {h_bn.shape}")
print(f"logits shape: {logits.shape}")
print(f"loss: {loss.item():.4f}")

## 4. Manual Gradients for Matrix Multiply

For \( h = X @ W + b \), the chain rule gives:
- \( \frac{\partial L}{\partial W} = X^T @ \frac{\partial L}{\partial h} \)
- \( \frac{\partial L}{\partial X} = \frac{\partial L}{\partial h} @ W^T \)
- \( \frac{\partial L}{\partial b} = \sum_i \frac{\partial L}{\partial h_i} \) (sum over batch)

We'll compute `dh` from downstream (BatchNorm), then derive `dW`, `dX`, `db` by hand and verify against autograd.

## 5. Manual Gradients for BatchNorm

For \( h_{bn} = \gamma \cdot h_{norm} + \beta \):
- \( d\gamma = (dh_{bn} \cdot h_{norm}).sum(0) \)
- \( d\beta = dh_{bn}.sum(0) \)
- \( dh_{norm} = dh_{bn} \cdot \gamma \)

For the normalization \( h_{norm} = (h - \mu) / \sqrt{\sigma^2 + \epsilon} \), the backward through mean and variance gives:
\( dh = \frac{1}{N \cdot \sigma} \left( N \cdot dh_{norm} - dh_{norm}.sum(0) - h_{norm} \cdot (dh_{norm} * h_{norm}).sum(0) \right) \)

## 6. Manual Gradients for Cross-Entropy

For softmax + cross-entropy combined, the gradient is elegant:
$$ \frac{\partial L}{\partial \text{logits}} = \frac{\text{softmax(logits)} - \text{one\_hot}(Y)}{N} $$

where \( N \) is the batch size (since `F.cross_entropy` uses `reduction='mean'` by default).

In [None]:
# Run autograd to get reference gradients (retain_grad was set in forward pass)
loss.backward()

# --- Manual: Matrix Multiply (h = X @ W + b) ---
# Upstream gradient comes from BatchNorm: dh = gradient flowing into h
dh = h.grad  # from autograd (we'll verify our manual dh later)

dW_manual = X.T @ dh
dX_manual = dh @ W.T
db_manual = dh.sum(0)

print("=== Matrix Multiply Gradients ===")
print(f"dW match: {torch.allclose(dW_manual, W.grad, atol=1e-5)}")
print(f"db match: {torch.allclose(db_manual, b.grad, atol=1e-5)}")

In [None]:
# --- Manual: BatchNorm (h_bn = gamma * h_norm + beta) ---
dh_bn = h_bn.grad  # from autograd

dgamma_manual = (dh_bn * h_norm).sum(0)
dbeta_manual = dh_bn.sum(0)
dh_norm_manual = dh_bn * gamma

# Backward through normalization: h_norm = (h - h_mean) / std
N = h.shape[0]
std = (h_var + eps).sqrt()
dh_manual_bn = (1 / (N * std)) * (
    N * dh_norm_manual
    - dh_norm_manual.sum(0)
    - h_norm * (dh_norm_manual * h_norm).sum(0)
)

print("=== BatchNorm Gradients ===")
print(f"dgamma match: {torch.allclose(dgamma_manual, gamma.grad, atol=1e-5)}")
print(f"dbeta match: {torch.allclose(dbeta_manual, beta.grad, atol=1e-5)}")
print(f"dh (from BN) match: {torch.allclose(dh_manual_bn, h.grad, atol=1e-5)}")

In [None]:
# --- Manual: Cross-Entropy ---
# dlogits = (softmax(logits) - one_hot(Y)) / N
probs = F.softmax(logits, dim=1)
one_hot = F.one_hot(Y, num_classes=5).float()
dlogits_manual = (probs - one_hot) / X.shape[0]

print("=== Cross-Entropy Gradient ===")
print(f"dlogits match: {torch.allclose(dlogits_manual, logits.grad, atol=1e-5)}")
print("\nElegant formula: dlogits = (softmax(logits) - one_hot(Y)) / batch_size")

## 7. Full Verification

Re-run the full forward pass, then compute *all* manual gradients from scratch (without using any .grad from intermediates). Compare every parameter gradient with `torch.allclose(manual, auto, atol=1e-5)`.

In [None]:
# Full verification: compute ALL gradients manually from dlogits downward
# We need dlogits first (from cross-entropy), then propagate backward

# 1. Cross-entropy: dlogits
probs = F.softmax(logits, dim=1)
one_hot = F.one_hot(Y, num_classes=5).float()
dlogits = (probs - one_hot) / X.shape[0]

# 2. Output layer: logits = h_bn @ W2 + b2
dh_bn = dlogits @ W2.T
dW2_manual = h_bn.T @ dlogits
db2_manual = dlogits.sum(0)

# 3. BatchNorm: h_bn = gamma * h_norm + beta
dgamma = (dh_bn * h_norm).sum(0)
dbeta = dh_bn.sum(0)
dh_norm = dh_bn * gamma

# 4. Normalization: h_norm = (h - mean) / std
N = h.shape[0]
std = (h_var + eps).sqrt()
dh = (1 / (N * std)) * (
    N * dh_norm - dh_norm.sum(0) - h_norm * (dh_norm * h_norm).sum(0)
)

# 5. Linear: h = X @ W + b
dW_manual = X.T @ dh
db_manual = dh.sum(0)

# Verify all (autograd already ran in previous cells)
checks = [
    ("W", dW_manual, W.grad),
    ("b", db_manual, b.grad),
    ("gamma", dgamma, gamma.grad),
    ("beta", dbeta, beta.grad),
    ("W2", dW2_manual, W2.grad),
    ("b2", db2_manual, b2.grad),
]
print("=== Full Verification ===")
all_ok = True
for name, manual, auto in checks:
    ok = torch.allclose(manual, auto, atol=1e-5)
    all_ok = all_ok and ok
    print(f"  {name}: {ok}")
print(f"\nAll gradients match: {all_ok}")

---

**Building LLMs from Scratch** — [Day 13: Manual Backprop](https://omkarray.com/llm-day13.html) | [← Prev](llm_day12_batchnorm.ipynb) | [Next →](llm_day14.ipynb)