
# Week 5 — Neural Networks (All-in-One Notebook)

**Dataset:** MNIST (handwritten digits 28×28)

This single notebook consolidates the whole week plan (Days 29–35):
- **Day 29:** 2-Layer NN from scratch (NumPy) – forward/backprop + gradient check  
- **Day 30:** Activations + Testing (ReLU/Sigmoid), debugging, simple visualization  
- **Day 31:** PyTorch MLP – build & quick train  
- **Day 32:** Train on MNIST – baseline results  
- **Day 33:** Optimizers – compare SGD vs Adam (loss curves)  
- **Day 34:** TensorBoard – log and visualize training  
- **Day 35:** Weekly Summary – cleanup & README

> Tip: Run Runtime → Restart & Run All to ensure everything works from a cold start.



## 0) Environment & Requirements (Optional)
If your environment does not already have the required libraries, install them first (uncomment to run):

```bash
# !pip install numpy matplotlib torch torchvision tensorboard scikit-learn
```



## 1) Setup & Utilities
Utilities shared across the week: seeding, device selection, plotting helpers.


In [None]:

import os, math, time, random
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

import torch
from torch import nn
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE


In [None]:

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

def plot_series(y_values, title="Metric over epochs", xlabel="Epoch", ylabel="Value"):
    plt.figure()
    plt.plot(y_values)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.grid(True)
    plt.show()

def accuracy(pred_logits, targets):
    preds = pred_logits.argmax(dim=1)
    return (preds == targets).float().mean().item()



---
## Day 29) 2-Layer Neural Network (NumPy): Forward, Backprop, Gradient Check
We'll implement a simple **2-layer MLP** (Input → Hidden → Output) with Softmax cross-entropy loss.

**Plan**
1. Build synthetic toy data (small, 2D) for easy debugging and visualization.
2. Implement forward pass.
3. Implement backward pass (manual gradients).
4. Add **gradient checking** using finite differences on a tiny batch to validate backprop.


In [None]:

# Synthetic toy data (2D) — two Gaussian clusters
def make_toy_data(n_per_class=100):
    mean0, mean1 = np.array([-1.0, -1.0]), np.array([1.0, 1.0])
    cov = np.array([[0.3, 0.0],[0.0, 0.3]])
    X0 = np.random.multivariate_normal(mean0, cov, size=n_per_class)
    X1 = np.random.multivariate_normal(mean1, cov, size=n_per_class)
    X = np.vstack([X0, X1])
    y = np.concatenate([np.zeros(n_per_class, dtype=int), np.ones(n_per_class, dtype=int)])
    return X, y

X_toy, y_toy = make_toy_data(80)
X_toy.shape, y_toy.shape, np.bincount(y_toy)


In [None]:

# 2-layer MLP implemented in NumPy
class TwoLayerMLP:
    def __init__(self, input_dim, hidden_dim, output_dim, activation="relu", weight_scale=0.01, seed=42):
        rng = np.random.default_rng(seed)
        self.W1 = weight_scale * rng.standard_normal((input_dim, hidden_dim))
        self.b1 = np.zeros((1, hidden_dim))
        self.W2 = weight_scale * rng.standard_normal((hidden_dim, output_dim))
        self.b2 = np.zeros((1, output_dim))
        self.activation = activation

    def _act(self, z):
        if self.activation == "relu":
            return np.maximum(0, z)
        elif self.activation == "sigmoid":
            return 1.0 / (1.0 + np.exp(-z))
        else:
            raise ValueError("Unknown activation")  

    def _act_grad(self, z):
        if self.activation == "relu":
            return (z > 0).astype(z.dtype)
        elif self.activation == "sigmoid":
            s = 1.0 / (1.0 + np.exp(-z))
            return s * (1 - s)
        else:
            raise ValueError("Unknown activation")

    def forward(self, X):
        z1 = X @ self.W1 + self.b1      # (N, H)
        a1 = self._act(z1)              # (N, H)
        scores = a1 @ self.W2 + self.b2 # (N, C)
        cache = (X, z1, a1, scores)
        return scores, cache

    @staticmethod
    def softmax(logits):
        logits = logits - logits.max(axis=1, keepdims=True)
        e = np.exp(logits)
        return e / e.sum(axis=1, keepdims=True)

    @staticmethod
    def cross_entropy(probs, y):
        N = y.shape[0]
        return -np.log(probs[np.arange(N), y] + 1e-12).mean()

    def loss(self, X, y):
        scores, cache = self.forward(X)
        probs = self.softmax(scores)
        L = self.cross_entropy(probs, y)
        return L, probs, cache

    def backward(self, probs, cache, y):
        X, z1, a1, scores = cache
        N = X.shape[0]

        # Gradient wrt scores
        dscores = probs.copy()
        dscores[np.arange(N), y] -= 1.0
        dscores /= N

        # Backprop into W2, b2, a1
        dW2 = a1.T @ dscores              # (H, C)
        db2 = dscores.sum(axis=0, keepdims=True)  # (1, C)
        da1 = dscores @ self.W2.T         # (N, H)

        # Backprop through activation
        dz1 = da1 * self._act_grad(z1)    # (N, H)

        # Backprop into W1, b1
        dW1 = X.T @ dz1                   # (D, H)
        db1 = dz1.sum(axis=0, keepdims=True)  # (1, H)
        return dW1, db1, dW2, db2


In [None]:

# Gradient check (finite differences) for a tiny batch
def grad_check(model, X, y, eps=1e-5):
    # Forward & backprop to get analytic grads
    L, probs, cache = model.loss(X, y)
    dW1, db1, dW2, db2 = model.backward(probs, cache, y)

    def rel_error(a, b):
        return np.abs(a - b).max() / (np.maximum(1e-8, np.abs(a) + np.abs(b))).max()

    # Check a few random elements per parameter
    rng = np.random.default_rng(0)
    checks = []

    # Helper to compute loss with current params
    def loss_with_params():
        L, _, _ = model.loss(X, y)
        return L

    for name in ["W1", "b1", "W2", "b2"]:
        param = getattr(model, name)
        grad = {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}[name]
        flat_idx = rng.choice(param.size, size=min(10, param.size), replace=False)
        idxs = np.array(np.unravel_index(flat_idx, param.shape)).T

        for (i, j) in idxs:
            old_val = param[i, j] if param.ndim == 2 else param[0, j]
            # positive perturb
            if param.ndim == 2:
                param[i, j] = old_val + eps
            else:
                param[0, j] = old_val + eps
            L_pos = loss_with_params()
            # negative perturb
            if param.ndim == 2:
                param[i, j] = old_val - eps
            else:
                param[0, j] = old_val - eps
            L_neg = loss_with_params()
            # restore
            if param.ndim == 2:
                param[i, j] = old_val
            else:
                param[0, j] = old_val
            num_grad = (L_pos - L_neg) / (2 * eps)
            ana_grad = grad[i, j] if grad.ndim == 2 else grad[0, j]
            checks.append(rel_error(num_grad, ana_grad))
    return max(checks), np.mean(checks)

# run check on a tiny subset
X_small, y_small = X_toy[:5], y_toy[:5]
model_np = TwoLayerMLP(input_dim=2, hidden_dim=5, output_dim=2, activation="relu", weight_scale=1e-2)
max_err, mean_err = grad_check(model_np, X_small, y_small)
print({"max_rel_error": max_err, "mean_rel_error": mean_err})


In [None]:

# Train the NumPy model on toy data (for sanity)
def train_numpy(model, X, y, lr=1.0, epochs=200):
    losses = []
    for epoch in range(epochs):
        L, probs, cache = model.loss(X, y)
        dW1, db1, dW2, db2 = model.backward(probs, cache, y)
        # SGD step
        model.W1 -= lr * dW1
        model.b1 -= lr * db1
        model.W2 -= lr * dW2
        model.b2 -= lr * db2
        losses.append(L)
        if (epoch+1) % (epochs//4) == 0:
            print(f"Epoch {epoch+1}/{epochs} | Loss: {L:.4f}")
    return losses

model_np = TwoLayerMLP(input_dim=2, hidden_dim=8, output_dim=2, activation="relu", weight_scale=1e-1)
loss_curve = train_numpy(model_np, X_toy, y_toy, lr=0.5, epochs=200)

# Plot loss (one chart, default style, no explicit color)
plot_series(loss_curve, title="NumPy 2-Layer MLP - Training Loss (Toy Data)", xlabel="Epoch", ylabel="Loss")



---
## Day 30) Activations + Testing
Try **ReLU** and **Sigmoid**, compare training loss on the toy dataset. This helps confirm your backprop math.


In [None]:

loss_curves = {}
for act in ["relu", "sigmoid"]:
    m = TwoLayerMLP(2, 16, 2, activation=act, weight_scale=0.1, seed=SEED)
    lc = train_numpy(m, X_toy, y_toy, lr=0.3, epochs=150)
    loss_curves[act] = lc

# Plot curves separately (follow rule: one chart per figure)
plot_series(loss_curves["relu"], title="Activation: ReLU (Toy Data)", xlabel="Epoch", ylabel="Loss")
plot_series(loss_curves["sigmoid"], title="Activation: Sigmoid (Toy Data)", xlabel="Epoch", ylabel="Loss")



---
## Day 31–32) PyTorch MLP for MNIST
We now switch to **MNIST** and build a simple feedforward network in PyTorch, then train & evaluate.


In [None]:

transform = transforms.Compose([
    transforms.ToTensor(),           # [0, 1]
    transforms.Normalize((0.1307,), (0.3081,))  # standard MNIST stats
])

root = "./data"
train_dataset = datasets.MNIST(root=root, train=True, download=True, transform=transform)
test_dataset  = datasets.MNIST(root=root, train=False, download=True, transform=transform)

len(train_dataset), len(test_dataset)


In [None]:

# Train/val split from training set
val_ratio = 0.1
n_val = int(len(train_dataset) * val_ratio)
n_train = len(train_dataset) - n_val
train_ds, val_ds = random_split(train_dataset, [n_train, n_val], generator=torch.Generator().manual_seed(SEED))

BATCH_SIZE = 128
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
val_loader   = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False)
test_loader  = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

n_train, n_val, len(test_loader.dataset)


In [None]:

class MNISTMLP(nn.Module):
    def __init__(self, in_dim=28*28, hidden=256, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(in_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, num_classes)
        )
    def forward(self, x):
        return self.net(x)

model = MNISTMLP().to(DEVICE)
sum(p.numel() for p in model.parameters())


In [None]:

def train_one_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss, total_correct, total_count = 0.0, 0, 0
    for xb, yb in loader:
        xb, yb = xb.to(DEVICE), yb.to(DEVICE)
        optimizer.zero_grad()
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * xb.size(0)
        total_correct += (logits.argmax(1) == yb).sum().item()
        total_count += xb.size(0)
    return total_loss / total_count, total_correct / total_count

@torch.no_grad()
def evaluate(model, loader, criterion):
    model.eval()
    total_loss, total_correct, total_count = 0.0, 0, 0
    for xb, yb in loader:
        xb, yb = xb.to(DEVICE), yb.to(DEVICE)
        logits = model(xb)
        loss = criterion(logits, yb)
        total_loss += loss.item() * xb.size(0)
        total_correct += (logits.argmax(1) == yb).sum().item()
        total_count += xb.size(0)
    return total_loss / total_count, total_correct / total_count


In [None]:

# Baseline training
EPOCHS = 5
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

train_losses, val_losses, val_accs = [], [], []
for epoch in range(1, EPOCHS+1):
    tr_loss, tr_acc = train_one_epoch(model, train_loader, optimizer, criterion)
    va_loss, va_acc = evaluate(model, val_loader, criterion)
    train_losses.append(tr_loss)
    val_losses.append(va_loss)
    val_accs.append(va_acc)
    print(f"Epoch {epoch}/{EPOCHS} | Train Loss {tr_loss:.4f} | Val Loss {va_loss:.4f} | Val Acc {va_acc:.4f}")

plot_series(train_losses, title="MNIST MLP — Training Loss", xlabel="Epoch", ylabel="Loss")
plot_series(val_losses, title="MNIST MLP — Validation Loss", xlabel="Epoch", ylabel="Loss")
plot_series(val_accs, title="MNIST MLP — Validation Accuracy", xlabel="Epoch", ylabel="Accuracy")


In [None]:

# Test set evaluation
test_loss, test_acc = evaluate(model, test_loader, criterion)
print({"test_loss": round(test_loss, 4), "test_acc": round(test_acc, 4)})


In [None]:

# Save model
save_path = Path("./mnist_mlp.pth")
torch.save(model.state_dict(), save_path)
save_path



---
## Day 33) Optimizers — SGD vs Adam
We train **two identical models** with different optimizers and plot the loss/accuracy curves for comparison.


In [None]:

def train_model_with_optimizer(optimizer_name="SGD", epochs=5, lr=0.001):
    m = MNISTMLP().to(DEVICE)
    criterion = nn.CrossEntropyLoss()
    if optimizer_name.lower() == "sgd":
        opt = torch.optim.SGD(m.parameters(), lr=lr)
    elif optimizer_name.lower() == "adam":
        opt = torch.optim.Adam(m.parameters(), lr=lr)
    else:
        raise ValueError("Unsupported optimizer")

    tr_losses, va_losses, va_accs = [], [], []
    for ep in range(1, epochs+1):
        tr_loss, _ = train_one_epoch(m, train_loader, opt, criterion)
        va_loss, va_acc = evaluate(m, val_loader, criterion)
        tr_losses.append(tr_loss); va_losses.append(va_loss); va_accs.append(va_acc)
    return tr_losses, va_losses, va_accs

sgd_tr, sgd_va, sgd_acc = train_model_with_optimizer("SGD", epochs=5, lr=0.1)
adam_tr, adam_va, adam_acc = train_model_with_optimizer("Adam", epochs=5, lr=0.001)

# Plot separately per rule (single plot per figure)
plot_series(sgd_va, title="Validation Loss — SGD", xlabel="Epoch", ylabel="Loss")
plot_series(adam_va, title="Validation Loss — Adam", xlabel="Epoch", ylabel="Loss")
plot_series(sgd_acc, title="Validation Accuracy — SGD", xlabel="Epoch", ylabel="Accuracy")
plot_series(adam_acc, title="Validation Accuracy — Adam", xlabel="Epoch", ylabel="Accuracy")



---
## Day 34) TensorBoard Logging
We'll log training/validation metrics to **TensorBoard** using `SummaryWriter`.

**How to launch locally (in a terminal):**
```bash
tensorboard --logdir runs --port 6006
# then open http://localhost:6006 in your browser
```


In [None]:

from torch.utils.tensorboard import SummaryWriter

log_dir = "runs/mnist_mlp"
writer = SummaryWriter(log_dir=log_dir)

model_tb = MNISTMLP().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_tb.parameters(), lr=1e-3)

EPOCHS_TB = 3
global_step = 0
for epoch in range(1, EPOCHS_TB+1):
    model_tb.train()
    for xb, yb in train_loader:
        xb, yb = xb.to(DEVICE), yb.to(DEVICE)
        optimizer.zero_grad()
        logits = model_tb(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()

        # Log training loss
        writer.add_scalar("Loss/train", loss.item(), global_step)
        global_step += 1

    # Log validation metrics per epoch
    val_loss, val_acc = evaluate(model_tb, val_loader, criterion)
    writer.add_scalar("Loss/val", val_loss, epoch)
    writer.add_scalar("Accuracy/val", val_acc, epoch)

# Optionally, add the computational graph
example_input = torch.randn(1, 1, 28, 28).to(DEVICE)
writer.add_graph(model_tb, example_input)
writer.flush()
writer.close()

print(f"TensorBoard logs written to: {log_dir}")



---
## Day 35) Weekly Summary & README

Use this section to summarize your results and lessons. Fill in the placeholders after running the experiments.



### Results Snapshot (Fill me in)
- **MNIST Test Accuracy (baseline MLP):** `__%`
- **SGD vs Adam (val acc after 5 epochs):** `SGD: __%`, `Adam: __%`
- **Key learnings:**
  - Backprop math validated with gradient check (max relative error ~1e-__)
  - ReLU vs Sigmoid behavior on toy data
  - Optimizer impact on convergence speed and stability
  - TensorBoard helped inspect loss & accuracy

### Next ideas
- Try BatchNorm / Dropout
- CNN (LeNet-like) on MNIST for higher accuracy
- Learning rate schedules / OneCycle
- Data augmentation


In [None]:

print("Notebook generated on: 2025-11-06T17:04:25.621227Z")
