## 2-Layer MLP Forward Pass (with ReLU)

**Goal.** Implement the forward pass of a 2-layer multilayer perceptron (MLP) for a mini-batch.

**Notation & Shapes**
- $X \in \mathbb{R}^{N \times d_{\text{in}}}$: input batch (N = batch size).
- $W_1 \in \mathbb{R}^{d_{\text{in}} \times h}$, $b_1 \in \mathbb{R}^{h}$: first layer params.
- $W_2 \in \mathbb{R}^{h \times d_{\text{out}}}$, $b_2 \in \mathbb{R}^{d_{\text{out}}}$: second layer params.

**Forward equations**
$$
Z_1 = X W_1 + b_1,\qquad
A_1 = \mathrm{ReLU}(Z_1) = \max(0, Z_1),\qquad
Z_2 = A_1 W_2 + b_2.
$$

- For classification, $Z_2$ are logits. Probabilities (optional):
$$
\hat{Y} = \mathrm{softmax}(Z_2).
$$

**Key ideas**
- Bias terms are broadcast across the batch dimension.
- ReLU keeps positive values and zeros out negatives.
- With ReLU, He/Kaiming initialization keeps variances stable layer-to-layer.

**Complexity**
- Dominated by matrix multiplies: $O(N\,d_{\text{in}}\,h) + O(N\,h\,d_{\text{out}})$.


In [1]:
#imports
import torch
import torch.nn as nn
import math
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split

In [2]:
torch.manual_seed(0)

# Hyperparameters
N = 8             # batch size
d_in = 16         # input features
h = 32            # hidden size
d_out = 10        # output size

# Fake input batch
X = torch.randn(N, d_in)

# He/Kaiming init for ReLU layers
# W1 ~ N(0, sqrt(2/d_in)), W2 ~ N(0, sqrt(2/h))
W1 = torch.randn(d_in, h) * (2.0 / d_in) ** 0.5
b1 = torch.zeros(h)

W2 = torch.randn(h, d_out) * (2.0 / h) ** 0.5
b2 = torch.zeros(d_out)

# Forward pass
# 1) First affine
Z1 = X @ W1 + b1 

# 2) ReLU
A1 = torch.clamp(Z1, min=0.0) 

# 3) Second affine (logits)
Z2 = A1 @ W2 + b2

print("Z1 shape:", Z1.shape)
print("A1 shape:", A1.shape)
print("Z2 (logits) shape:", Z2.shape)

probs = torch.softmax(Z2, dim=1)
print("probs rows sum to ~1:", probs[0].sum().item())


Z1 shape: torch.Size([8, 32])
A1 shape: torch.Size([8, 32])
Z2 (logits) shape: torch.Size([8, 10])
probs rows sum to ~1: 1.0


### MLP with `nn.Module` / `nn.Sequential`

We can build the same two-layer network with high-level building blocks:

- `nn.Linear(d_in, h)` → `nn.ReLU()` → `nn.Linear(h, d_out)`

By default, `nn.Linear` includes a learnable bias. For ReLU networks, it’s common to use
Kaiming/He initialization:
```python
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
nn.init.zeros_(layer.bias)


In [3]:
d_in, h, d_out = 16, 32, 10

# Option A: nn.Sequential
mlp = nn.Sequential(
    nn.Linear(d_in, h),
    nn.ReLU(),
    nn.Linear(h, d_out)   # logits
)

# (Optional) Kaiming init for both Linear layers
for m in mlp:
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
        nn.init.zeros_(m.bias)

X = torch.randn(8, d_in)
logits = mlp(X)                 # forward
probs = torch.softmax(logits, dim=1)

print("logits shape:", logits.shape)
print("probs row sum:", probs[0].sum().item())


logits shape: torch.Size([8, 10])
probs row sum: 1.0


# Full Version
## 2-Layer MLP (ReLU): class, loss/opt, and a full training loop

**What these cell (below) does**
- Defines a small 2-layer MLP: `Linear(d_in → h) → ReLU → Linear(h → d_out)`  
- Builds a *learnable* synthetic multi-class dataset (labels made by a fixed “teacher” MLP)  
- Sets `CrossEntropyLoss` (logits + integer labels) and `Adam` optimizer  
- Trains with mini-batches, prints train/val loss and accuracy each epoch

---

### Shapes & forward equations
Let a batch $X \in \mathbb{R}^{N \times d_{\text{in}}}$.

- $W_1 \in \mathbb{R}^{d_{\text{in}} \times h},\; b_1 \in \mathbb{R}^{h}$  
- $W_2 \in \mathbb{R}^{h \times d_{\text{out}}},\; b_2 \in \mathbb{R}^{d_{\text{out}}}$

**Forward pass**
$$
Z_1 = X W_1 + b_1, \qquad
A_1 = \mathrm{ReLU}(Z_1) = \max(0, Z_1), \qquad
Z_2 = A_1 W_2 + b_2 \quad (\text{logits})
$$

For classification, probabilities are:
$$
p_{i,k} = \frac{\exp(Z_{2,i,k})}{\sum_{j=1}^{d_{\text{out}}} \exp(Z_{2,i,j})}
$$
(Softmax is **implicit** inside `CrossEntropyLoss`, so we pass raw logits.)

**Cross-entropy loss (averaged over the batch)**
$$
\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(Z_{2,i,y_i})}{\sum_{k=1}^{d_{\text{out}}} \exp(Z_{2,i,k})}
$$

**Accuracy**
$$
\text{acc} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}\!\left[\arg\max_k Z_{2,i,k} = y_i\right]
$$

---

### Initialization & choices
- ReLU networks benefit from **He/Kaiming** initialization. We use:
  - `kaiming_normal_` for `fc1.weight` (nonlinearity="relu")  
  - zeros for biases  
- `CrossEntropyLoss` expects:
  - `logits` of shape `(N, d_out)`  
  - integer labels of shape `(N,)` in `[0, d_out)`  
- Optimizer: `Adam(lr=1e-3)` (tweak `lr`, `weight_decay` as needed)

---

### Hyperparameters to tweak quickly
- `hidden` (width of the hidden layer)  
- `epochs`, `batch_size`, learning rate `lr`  
- `weight_decay` (L2 regularization)  
- Replace the synthetic dataset with your real `TensorDataset`/`DataLoader`

---

### Sanity checks
- Shapes: `Z1, A1 ∈ ℝ^{N×h}`, `logits ∈ ℝ^{N×d_out}`  
- `A1` should be **non-negative** (ReLU)  
- If you compute `softmax(logits, dim=1)`, each row should sum to ~1

---

### Extensions (next days)
- Add `Dropout` or `BatchNorm1d` between layers  
- Try a deeper MLP or different activations (GELU, LeakyReLU)  
- Add learning-rate schedules or gradient clipping


In [4]:
torch.manual_seed(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class MLP(nn.Module):
    def __init__(self, d_in: int, h: int, d_out: int):
        super().__init__()
        self.fc1 = nn.Linear(d_in, h)
        self.act = nn.ReLU()
        self.fc2 = nn.Linear(h, d_out)
        self._init_weights()

    def _init_weights(self):
        # He/Kaiming init for ReLU layers
        nn.init.kaiming_normal_(self.fc1.weight, nonlinearity="relu")
        nn.init.zeros_(self.fc1.bias)
        nn.init.kaiming_normal_(self.fc2.weight, nonlinearity="linear")
        nn.init.zeros_(self.fc2.bias)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.fc2(x)   # logits
        return x

In [5]:
N = 3000        # total samples
d_in = 16       # input features
h_true = 32     # hidden (teacher)
num_classes = 5

X = torch.randn(N, d_in)

with torch.no_grad():
    # teacher network (fixed random weights)
    W1_t = torch.randn(d_in, h_true) * (2.0/d_in) ** 0.5
    b1_t = torch.zeros(h_true)
    W2_t = torch.randn(h_true, num_classes) * (2.0/h_true) ** 0.5
    b2_t = torch.zeros(num_classes)

    Z1_t = X @ W1_t + b1_t
    A1_t = torch.clamp(Z1_t, min=0.0)
    logits_t = A1_t @ W2_t + b2_t
    y = torch.argmax(logits_t, dim=1)   # integer labels in [0, num_classes)

# Train/val split
train_ratio = 0.8
n_train = int(N * train_ratio)
n_val = N - n_train
dataset = TensorDataset(X, y)
train_ds, val_ds = random_split(dataset, [n_train, n_val], generator=torch.Generator().manual_seed(0))

batch_size = 128
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=batch_size, shuffle=False)

In [6]:
# Instantiate model, loss, optimizer
hidden = 64
model = MLP(d_in=d_in, h=hidden, d_out=num_classes).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.0)

In [7]:
# Training loop
def accuracy(logits, targets):
    preds = torch.argmax(logits, dim=1)
    return (preds == targets).float().mean().item()

epochs = 25
for epoch in range(1, epochs + 1):
    # ---- train ----
    model.train()
    running_loss, running_acc, n_batches = 0.0, 0.0, 0
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)

        optimizer.zero_grad()
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        running_acc  += accuracy(logits.detach(), yb)
        n_batches    += 1

    train_loss = running_loss / n_batches
    train_acc  = running_acc  / n_batches

    # ---- validate ----
    model.eval()
    with torch.no_grad():
        val_loss, val_acc, n_batches = 0.0, 0.0, 0
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            logits = model(xb)
            loss = criterion(logits, yb)
            val_loss += loss.item()
            val_acc  += accuracy(logits, yb)
            n_batches += 1

    val_loss /= n_batches
    val_acc  /= n_batches

    print(f"Epoch {epoch:02d} | "
          f"train loss {train_loss:.4f} acc {train_acc:.3f} | "
          f"val loss {val_loss:.4f} acc {val_acc:.3f}")


Epoch 01 | train loss 1.6169 acc 0.327 | val loss 1.3342 acc 0.463
Epoch 02 | train loss 1.2023 acc 0.536 | val loss 1.1265 acc 0.579
Epoch 03 | train loss 1.0246 acc 0.624 | val loss 1.0110 acc 0.637
Epoch 04 | train loss 0.9139 acc 0.676 | val loss 0.9250 acc 0.683
Epoch 05 | train loss 0.8335 acc 0.708 | val loss 0.8629 acc 0.706
Epoch 06 | train loss 0.7758 acc 0.725 | val loss 0.8144 acc 0.726
Epoch 07 | train loss 0.7297 acc 0.741 | val loss 0.7771 acc 0.728
Epoch 08 | train loss 0.6918 acc 0.756 | val loss 0.7442 acc 0.735
Epoch 09 | train loss 0.6611 acc 0.765 | val loss 0.7171 acc 0.748
Epoch 10 | train loss 0.6361 acc 0.777 | val loss 0.6940 acc 0.755
Epoch 11 | train loss 0.6145 acc 0.784 | val loss 0.6762 acc 0.763
Epoch 12 | train loss 0.5927 acc 0.791 | val loss 0.6578 acc 0.774
Epoch 13 | train loss 0.5755 acc 0.794 | val loss 0.6432 acc 0.773
Epoch 14 | train loss 0.5605 acc 0.799 | val loss 0.6293 acc 0.781
Epoch 15 | train loss 0.5471 acc 0.801 | val loss 0.6165 acc 0