# Day 11: Activations & Gradients — The Fragility of Deep Nets

**Building LLMs from Scratch** — Following Andrej Karpathy's makemore lectures.

---

## 1. Introduction

Why are deep networks fragile? As we stack more layers, two problems emerge:

- **Vanishing gradients**: Gradients shrink exponentially as they flow backward through many layers. Early layers receive almost no signal and barely update.
- **Exploding gradients**: Sometimes gradients grow exponentially instead, causing numerical instability and NaNs.
- **Dead neurons**: With activations like tanh, neurons can "saturate" (outputs near ±1). In the saturated region, the gradient is ≈ 0, so the neuron stops learning.

Today we build a deeper MLP, inspect activation statistics, and fix the problem with **proper weight initialization** (Kaiming).

## 2. Dataset & Model Setup

Same names dataset from Day 6. Build a deeper MLP: embedding C(27, 10), then 3 hidden layers of size 100 with tanh activations, output layer to 27 classes. All parameters use `requires_grad=True`.

In [None]:
import torch
import torch.nn.functional as F

words = ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'mia', 'charlotte', 'amelia', 'harper', 'evelyn',
         'abigail', 'emily', 'ella', 'elizabeth', 'camila', 'luna', 'sofia', 'avery', 'mila', 'aria']

chars = sorted(list(set(''.join(words))))
stoi = {'.': 0, **{c: i + 1 for i, c in enumerate(chars)}}
itos = {i: c for c, i in stoi.items()}

# Build bigram pairs
xs, ys = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for c1, c2 in zip(chs[:-1], chs[1:]):
        xs.append(stoi[c1])
        ys.append(stoi[c2])

xs = torch.tensor(xs)
ys = torch.tensor(ys)

print(f"Dataset: {len(words)} names, {len(xs)} bigram pairs")
print(f"Vocabulary: {len(stoi)} chars")

In [None]:
def build_mlp(use_kaiming=False):
    """Build MLP: embed(27,10) -> 3x hidden(100, tanh) -> out(27)."""
    C = torch.randn(27, 10, requires_grad=True)
    W1 = torch.randn(10, 100, requires_grad=True)
    b1 = torch.randn(100, requires_grad=True)
    W2 = torch.randn(100, 100, requires_grad=True)
    b2 = torch.randn(100, requires_grad=True)
    W3 = torch.randn(100, 100, requires_grad=True)
    b3 = torch.randn(100, requires_grad=True)
    W4 = torch.randn(100, 27, requires_grad=True)
    b4 = torch.randn(27, requires_grad=True)

    if use_kaiming:
        # Kaiming init: W *= sqrt(2/fan_in) for tanh
        C.data *= (2.0 / 27) ** 0.5
        W1.data *= (2.0 / 10) ** 0.5
        W2.data *= (2.0 / 100) ** 0.5
        W3.data *= (2.0 / 100) ** 0.5
        W4.data *= (2.0 / 100) ** 0.5
        # Biases: keep small (often 0)
        b1.data.zero_()
        b2.data.zero_()
        b3.data.zero_()
        b4.data.zero_()

    return {'C': C, 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2, 'W3': W3, 'b3': b3, 'W4': W4, 'b4': b4}


def forward(params, xs, return_activations=False):
    """Forward pass. Returns logits and optionally list of pre-activations per hidden layer."""
    emb = params['C'][xs]  # (N, 10)
    h1_pre = emb @ params['W1'] + params['b1']
    h1 = torch.tanh(h1_pre)
    h2_pre = h1 @ params['W2'] + params['b2']
    h2 = torch.tanh(h2_pre)
    h3_pre = h2 @ params['W3'] + params['b3']
    h3 = torch.tanh(h3_pre)
    logits = h3 @ params['W4'] + params['b4']

    if return_activations:
        return logits, [h1, h2, h3]
    return logits

## 3. The Saturation Problem

Forward pass with **random init**. For each hidden layer, print mean, std, and fraction of activations > 0.99 or < -0.99 (saturated tanh). With bad init, most activations are saturated.

In [None]:
torch.manual_seed(42)
params_bad = build_mlp(use_kaiming=False)
logits, activations = forward(params_bad, xs, return_activations=True)

print("Activation statistics (random init — BAD):")
print("-" * 55)
for i, h in enumerate(activations):
    mean = h.mean().item()
    std = h.std().item()
    saturated = ((h > 0.99) | (h < -0.99)).float().mean().item()
    print(f"Layer {i+1}: mean={mean:+.4f}, std={std:.4f}, saturated (|x|>0.99): {saturated*100:.1f}%")

print("\n→ Most activations are saturated! tanh gradient ≈ 0 in saturated region.")

## 4. Visualizing Activation Distributions

Plot histograms of activations for each layer — before (bad init) and after (Kaiming) fix.

In [None]:
import matplotlib.pyplot as plt

def plot_activation_histograms(params, title, axs):
    _, activations = forward(params, xs, return_activations=True)
    for i, h in enumerate(activations):
        axs[i].hist(h.detach().flatten().numpy(), bins=50, alpha=0.8, edgecolor='black', linewidth=0.5)
        axs[i].set_title(f"Layer {i+1}")
        axs[i].set_xlabel("Activation value")
        axs[i].axvline(0, color='red', linestyle='--', alpha=0.7)
        axs[i].axvline(0.99, color='orange', linestyle=':', alpha=0.7)
        axs[i].axvline(-0.99, color='orange', linestyle=':', alpha=0.7)

fig, axs = plt.subplots(1, 3, figsize=(12, 4))
plot_activation_histograms(params_bad, "Random init (bad)", axs)
fig.suptitle("Activation distributions — Random init (saturated)", fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
torch.manual_seed(42)
params_kaiming = build_mlp(use_kaiming=True)

fig, axs = plt.subplots(1, 3, figsize=(12, 4))
plot_activation_histograms(params_kaiming, "Kaiming init (good)", axs)
fig.suptitle("Activation distributions — Kaiming init (well-distributed)", fontsize=12)
plt.tight_layout()
plt.show()

## 5. Kaiming Initialization

Kaiming init: `W *= (2.0 / fan_in)**0.5` for tanh. This keeps the variance of activations roughly constant across layers. Show that activations are now well-distributed (std ≈ 0.6–0.7 across layers).

In [None]:
print("Activation statistics (Kaiming init — GOOD):")
print("-" * 55)
_, activations = forward(params_kaiming, xs, return_activations=True)
for i, h in enumerate(activations):
    mean = h.mean().item()
    std = h.std().item()
    saturated = ((h > 0.99) | (h < -0.99)).float().mean().item()
    print(f"Layer {i+1}: mean={mean:+.4f}, std={std:.4f}, saturated (|x|>0.99): {saturated*100:.1f}%")

print("\n→ std ≈ 0.6–0.7 across layers. Activations are well-distributed, gradients can flow.")

## 6. Gradient Flow

After backward pass, plot the mean absolute gradient for each layer's weights. Compare: vanishing gradients with bad init vs healthy gradients with Kaiming.

In [None]:
def get_grad_stats(params):
    """Run forward+backward, return mean |grad| per weight layer."""
    logits = forward(params, xs)
    loss = F.cross_entropy(logits, ys)
    for p in params.values():
        p.grad = None
    loss.backward()

    grad_stats = []
    for name in ['W1', 'W2', 'W3', 'W4']:
        g = params[name].grad
        grad_stats.append(g.abs().mean().item())
    return grad_stats


torch.manual_seed(42)
params_bad = build_mlp(use_kaiming=False)
grads_bad = get_grad_stats(params_bad)

torch.manual_seed(42)
params_kaiming = build_mlp(use_kaiming=True)
grads_kaiming = get_grad_stats(params_kaiming)

fig, ax = plt.subplots(figsize=(8, 4))
x = ['W1', 'W2', 'W3', 'W4']
ax.bar([i - 0.2 for i in range(4)], grads_bad, width=0.4, label='Random init', color='coral', alpha=0.8)
ax.bar([i + 0.2 for i in range(4)], grads_kaiming, width=0.4, label='Kaiming init', color='steelblue', alpha=0.8)
ax.set_xticks(range(4))
ax.set_xticklabels(x)
ax.set_ylabel("Mean |gradient|")
ax.set_title("Gradient flow: Bad init → vanishing; Kaiming → healthy")
ax.legend()
ax.set_yscale('log')
plt.tight_layout()
plt.show()

print("Random init:", [f"{g:.2e}" for g in grads_bad])
print("Kaiming init:", [f"{g:.2e}" for g in grads_kaiming])

## 7. Xavier vs Kaiming

**Xavier (Glorot)**: `1/sqrt(fan_in)` — assumes linear activations (symmetric, zero-centered). Good for tanh/sigmoid when layers are not too deep.

**Kaiming (He)**: `sqrt(2/fan_in)` — designed for ReLU (half the activations are zero). For tanh, we use the same formula because tanh is also zero-centered and we want to preserve variance through the nonlinearity.

**When to use which:**
- **Xavier**: tanh, sigmoid, linear — when activations are symmetric.
- **Kaiming**: ReLU, LeakyReLU — when half (or more) of activations are zero. Kaiming accounts for the "dead" half.

For our tanh MLP, both work; Kaiming often gives slightly better gradient flow in deeper nets.

In [None]:
def build_mlp_xavier():
    C = torch.randn(27, 10, requires_grad=True)
    W1 = torch.randn(10, 100, requires_grad=True)
    b1 = torch.randn(100, requires_grad=True)
    W2 = torch.randn(100, 100, requires_grad=True)
    b2 = torch.randn(100, requires_grad=True)
    W3 = torch.randn(100, 100, requires_grad=True)
    b3 = torch.randn(100, requires_grad=True)
    W4 = torch.randn(100, 27, requires_grad=True)
    b4 = torch.randn(27, requires_grad=True)

    # Xavier: 1/sqrt(fan_in)
    C.data *= (1.0 / 27) ** 0.5
    W1.data *= (1.0 / 10) ** 0.5
    W2.data *= (1.0 / 100) ** 0.5
    W3.data *= (1.0 / 100) ** 0.5
    W4.data *= (1.0 / 100) ** 0.5
    b1.data.zero_(); b2.data.zero_(); b3.data.zero_(); b4.data.zero_()

    return {'C': C, 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2, 'W3': W3, 'b3': b3, 'W4': W4, 'b4': b4}


torch.manual_seed(42)
params_xavier = build_mlp_xavier()
_, acts_xavier = forward(params_xavier, xs, return_activations=True)

torch.manual_seed(42)
params_k = build_mlp(use_kaiming=True)
_, acts_kaiming = forward(params_k, xs, return_activations=True)

print("Xavier vs Kaiming — activation std per layer:")
print("-" * 45)
for i in range(3):
    sx = acts_xavier[i].std().item()
    sk = acts_kaiming[i].std().item()
    print(f"Layer {i+1}: Xavier std={sx:.4f}, Kaiming std={sk:.4f}")

print("\nBoth keep activations in a reasonable range. Kaiming tends to be slightly larger (std ~0.6–0.7).")

---

**Blog:** [Day 11 — Activations & Gradients](https://omkarray.com/llm-day11.html)

**Prev:** [Day 10 — Embeddings & LR](llm_day10_embeddings_lr.ipynb) · **Next:** [Day 12 — BatchNorm](llm_day12_batchnorm.ipynb)