# Day 9: MLP Language Model (Bengio et al. 2003)

**Building LLMs from Scratch** — Following Andrej Karpathy's makemore lectures.

---

## 1. Introduction

We move from **bigram** models to a more powerful **MLP (Multi-Layer Perceptron) language model** following the seminal work of Bengio et al. (2003). Key improvements:

- **Context windows**: Instead of conditioning on a single previous character, we use a fixed context of multiple characters (e.g., 3 chars → predict next)
- **Learned embeddings**: Each character is mapped to a dense vector via an embedding lookup table, allowing the model to learn meaningful representations
- **Non-linear hidden layer**: A tanh-activated hidden layer captures complex interactions between context characters

This architecture is a precursor to modern neural language models.

## 2. Dataset

Use the hardcoded names list. Build `stoi` (string → index) and `itos` (index → string) with `'.'` as token 0 (start/end marker).

In [None]:
import torch

words = ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'mia', 'charlotte', 'amelia', 'harper', 'evelyn',
         'abigail', 'emily', 'ella', 'elizabeth', 'camila', 'luna', 'sofia', 'avery', 'mila', 'aria']

chars = sorted(list(set(''.join(words))))
stoi = {'.': 0, **{c: i + 1 for i, c in enumerate(chars)}}
itos = {i: c for c, i in stoi.items()}

print(f"Dataset: {len(words)} names")
print(f"Vocabulary size: {len(stoi)}")
print(f"stoi: {stoi}")

## 3. Building the Dataset

Create training examples with `block_size=3` (context window). For each word, create (context, target) pairs where context is 3 consecutive chars and target is the next char. Store as `X` (n, block_size) and `Y` (n,) tensors.

In [None]:
block_size = 3

X, Y = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for i in range(len(chs) - block_size):
        context = chs[i:i + block_size]
        target = chs[i + block_size]
        X.append([stoi[c] for c in context])
        Y.append(stoi[target])

X = torch.tensor(X)
Y = torch.tensor(Y)

print(f"Training examples: {len(X)}")
print(f"X shape: {X.shape}")  # (n, block_size)
print(f"Y shape: {Y.shape}")  # (n,)
print(f"Example: context {[itos[x.item()] for x in X[0]]} -> target {itos[Y[0].item()]}")

## 4. The MLP Architecture

Parameters:
- `C`: embedding lookup table (27 chars, 10-dim embeddings)
- `W1`, `b1`: hidden layer (3×10=30 input, 200 hidden)
- `W2`, `b2`: output layer (200 → 27 logits)

All parameters have `requires_grad=True` for training.

In [None]:
torch.manual_seed(42)

C = torch.randn((27, 10), requires_grad=True)   # embedding lookup table
W1 = torch.randn((30, 200), requires_grad=True)  # 3*10=30 input, 200 hidden
b1 = torch.randn(200, requires_grad=True)
W2 = torch.randn((200, 27), requires_grad=True)  # output layer
b2 = torch.randn(27, requires_grad=True)

parameters = [C, W1, b1, W2, b2]
print(f"C: {C.shape}, W1: {W1.shape}, b1: {b1.shape}, W2: {W2.shape}, b2: {b2.shape}")

## 5. Forward Pass

1. **Embedding lookup**: `emb = C[X]` — each context char gets a 10-dim vector
2. **Flatten**: `emb.view(-1, 30)` — concatenate 3 context embeddings into 30-dim input
3. **Hidden layer**: `h = tanh(emb @ W1 + b1)`
4. **Output layer**: `logits = h @ W2 + b2`
5. **Loss**: `F.cross_entropy(logits, Y)`

In [None]:
import torch.nn.functional as F

def forward(X, Y):
    emb = C[X]                    # (n, block_size, 10)
    h = torch.tanh(emb.view(-1, 30) @ W1 + b1)  # (n, 200)
    logits = h @ W2 + b2          # (n, 27)
    loss = F.cross_entropy(logits, Y)
    return loss

# Sanity check
loss = forward(X, Y)
print(f"Initial loss: {loss.item():.4f}")

## 6. Training Loop

Mini-batch training: sample 32 random indices per step, forward, backward, update. Run 10000 steps, print loss every 1000.

In [None]:
import torch.nn.functional as F

torch.manual_seed(42)
C = torch.randn((27, 10), requires_grad=True)
W1 = torch.randn((30, 200), requires_grad=True)
b1 = torch.randn(200, requires_grad=True)
W2 = torch.randn((200, 27), requires_grad=True)
b2 = torch.randn(27, requires_grad=True)

parameters = [C, W1, b1, W2, b2]
step_size = 0.1
n = X.shape[0]
batch_size = 32

losses = []
for step in range(10000):
    # Mini-batch
    ix = torch.randint(0, n, (batch_size,))
    Xb, Yb = X[ix], Y[ix]
    
    # Forward
    emb = C[Xb]
    h = torch.tanh(emb.view(-1, 30) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Yb)
    losses.append(loss.item())
    
    # Backward
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # Update
    for p in parameters:
        p.data -= step_size * p.grad
    
    if step % 1000 == 0:
        print(f"Step {step}: loss = {loss.item():.4f}")

## 7. Sampling

Generate 10 names from the trained model. Start with '.' context, sample next char, shift context, repeat until '.'.

In [None]:
import torch.nn.functional as F

torch.manual_seed(123)
generated = []
for _ in range(10):
    out = []
    context = [0, 0, 0]  # ['.', '.', '.']
    while True:
        emb = C[torch.tensor([context])]  # (1, 3, 10)
        h = torch.tanh(emb.view(1, -1) @ W1 + b1)
        logits = h @ W2 + b2
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1).item()
        context = context[1:] + [ix]
        if ix == 0:
            break
        out.append(itos[ix])
    generated.append(''.join(out))

print("Generated names:")
for name in generated:
    print(f"  {name}")

## 8. Loss Curve

Plot training loss over iterations.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(losses, alpha=0.8, linewidth=0.5)
plt.xlabel('Training step')
plt.ylabel('Loss')
plt.title('MLP Language Model — Training Loss')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

**Building LLMs from Scratch** — [Day 9: MLP LM](https://omkarray.com/llm-day9.html) | [← Prev](llm_day08_training_loops.ipynb) | [Next →](llm_day10_mlp_refactor.ipynb)