# 2-Layer MLP Backpropagation (ReLU + Softmax-CE)

We implement a 2-layer Multi-Layer Perceptron (MLP) and its **backward pass (backpropagation)**.

---

### Forward pass

- Input: $ X \in \mathbb{R}^{N \times d} $ 
- First affine:  
  $$
  Z_1 = X W_1 + b_1, \quad W_1 \in \mathbb{R}^{d \times H}, \, b_1 \in \mathbb{R}^H
  $$
- ReLU activation:  
  $$
  H_1 = \max(0, Z_1)
  $$
- Second affine (logits):  
  $$
  S = H_1 W_2 + b_2, \quad W_2 \in \mathbb{R}^{H \times K}, \, b_2 \in \mathbb{R}^K
  $$
- Softmax probabilities:  
  $$
  P = \mathrm{softmax}(S)
  $$
- Cross-entropy loss (with one-hot labels $Y$):  
  $$
  \mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} Y_{ik}\,\log P_{ik}
  $$

---

### Backward pass

Define gradient at logits:
$$
\frac{\partial \mathcal{L}}{\partial S} = \frac{P - Y}{N}.
$$

Then:
$$
\begin{aligned}
\frac{\partial \mathcal{L}}{\partial W_2} &= H_1^\top \frac{\partial \mathcal{L}}{\partial S}, \quad &
\frac{\partial \mathcal{L}}{\partial b_2} &= \mathbf{1}^\top \frac{\partial \mathcal{L}}{\partial S}, \\
\frac{\partial \mathcal{L}}{\partial H_1} &= \left(\frac{\partial \mathcal{L}}{\partial S}\right) W_2^\top, \quad &
\frac{\partial \mathcal{L}}{\partial Z_1} &= \frac{\partial \mathcal{L}}{\partial H_1} \odot \mathbf{1}[Z_1 > 0], \\
\frac{\partial \mathcal{L}}{\partial W_1} &= X^\top \frac{\partial \mathcal{L}}{\partial Z_1}, \quad &
\frac{\partial \mathcal{L}}{\partial b_1} &= \mathbf{1}^\top \frac{\partial \mathcal{L}}{\partial Z_1}.
\end{aligned}
$$

---

### Parameter update (SGD)

$$
\theta \leftarrow \theta - \eta \, \frac{\partial \mathcal{L}}{\partial \theta}.
$$

---

This forms the full **manual backpropagation** for a 2-layer MLP.


### 🔧 Manual Backprop for a 2-Layer MLP

This block implements a **2-layer MLP with ReLU + Softmax-CrossEntropy** using **manual backpropagation** (no autograd).  

Steps:
1. **Forward pass**  
   - Compute hidden layer: $Z_1 = XW_1 + b_1$, apply ReLU.  
   - Compute logits: $S = H_1 W_2 + b_2$.  
   - Apply softmax for class probabilities.  
   - Compute cross-entropy loss.  

2. **Backward pass**  
   - Start from $\frac{\partial \mathcal{L}}{\partial S} = (P-Y)/N$.  
   - Propagate gradients backward through $W_2, b_2$, then through ReLU, then $W_1, b_1$.  

3. **Update step**  
   - Parameters updated with vanilla SGD:  
     $$
     \theta \leftarrow \theta - \eta \, \frac{\partial \mathcal{L}}{\partial \theta}.
     $$

This shows the math of backprop in code, without relying on PyTorch’s `autograd`.


In [1]:
import torch
torch.manual_seed(0)

# Tiny toy dataset (N,d) -> K classes
N, d, H, K = 128, 20, 64, 5
X = torch.randn(N, d)
y = torch.randint(0, K, (N,))

# One-hot labels
Y = torch.zeros(N, K)
Y[torch.arange(N), y] = 1.0

# Parameters
W1 = torch.randn(d, H) * 0.02
b1 = torch.zeros(H)
W2 = torch.randn(H, K) * 0.02
b2 = torch.zeros(K)

lr = 0.5
epochs = 200

def softmax(logits):
    # stable softmax
    z = logits - logits.max(dim=1, keepdim=True).values
    expz = torch.exp(z)
    return expz / expz.sum(dim=1, keepdim=True)

for epoch in range(epochs):
    # Forward
    Z1 = X @ W1 + b1
    H1 = torch.clamp(Z1, min=0.0)
    S  = H1 @ W2 + b2
    P  = softmax(S)

    # Cross-entropy
    loss = -(Y * (P+1e-12).log()).sum() / N

    # Backward (manual)
    dS  = (P - Y) / N
    dW2 = H1.t() @ dS
    db2 = dS.sum(dim=0)

    dH1 = dS @ W2.t()
    dZ1 = dH1 * (Z1 > 0).float()

    dW1 = X.t() @ dZ1
    db1 = dZ1.sum(dim=0)

    # SGD step
    W1 -= lr * dW1
    b1 -= lr * db1
    W2 -= lr * dW2
    b2 -= lr * db2

    if (epoch+1) % 50 == 0:
        pred = P.argmax(dim=1)
        acc = (pred == y).float().mean().item()
        print(f"epoch {epoch+1:3d} | loss {loss.item():.4f} | acc {acc:.3f}")


epoch  50 | loss 0.9842 | acc 0.625
epoch 100 | loss 0.3256 | acc 0.969
epoch 150 | loss 0.1020 | acc 1.000
epoch 200 | loss 0.0477 | acc 1.000


### ⚡ 2-Layer MLP with Autograd (`nn.Sequential`)

In this version, we let **PyTorch’s autograd** handle the backward pass.  

- The model is defined with `nn.Sequential`:  
  1. Linear layer $(d \to H$)  
  2. ReLU activation  
  3. Linear layer ($H \to K$)  

- Loss: `nn.CrossEntropyLoss`, which internally applies **log-softmax + NLL**.  
- Optimizer: `torch.optim.SGD` updates parameters automatically after `loss.backward()`.

This shows the same network as before, but with **automatic differentiation** instead of manual gradient formulas.


In [2]:
import torch
from torch import nn

torch.manual_seed(0)

# Same synthetic data
N, d, H, K = 128, 20, 64, 5
X = torch.randn(N, d)
y = torch.randint(0, K, (N,))

model = nn.Sequential(
    nn.Linear(d, H),
    nn.ReLU(),
    nn.Linear(H, K)
)

criterion = nn.CrossEntropyLoss()
optim = torch.optim.SGD(model.parameters(), lr=0.5)

for epoch in range(200):
    optim.zero_grad()
    logits = model(X)
    loss = criterion(logits, y)
    loss.backward()                        # autograd handles backprop
    optim.step()

    if (epoch+1) % 50 == 0:
        acc = (logits.argmax(1) == y).float().mean().item()
        print(f"epoch {epoch+1:3d} | loss {loss.item():.4f} | acc {acc:.3f}")


epoch  50 | loss 0.4812 | acc 0.922
epoch 100 | loss 0.1280 | acc 1.000
epoch 150 | loss 0.0550 | acc 1.000
epoch 200 | loss 0.0318 | acc 1.000
