
# makemore – Part 2: MLP Character-Level Language Model (Tutorial + Exercises)

In this notebook we build a **multi-layer perceptron (MLP)** character-level language model
for baby names, following the second makemore video.

This notebook is designed so that you can understand the **entire video and original notebook**
just by working through:

- Every new idea is introduced in **plain language**, with **small examples**.
- Then there is an **Exercise** cell (`### YOUR CODE HERE` + `NotImplementedError(...)`).
- Then a **Solution** cell with a detailed, commented implementation.
- Many solutions also have a short **solution discussion** in markdown.

We assume you already saw the **bigram** notebook (Part 1), but we briefly recap the key ideas.



## What we are going to build

We will:

1. **Load** a list of names from `names.txt`.
2. Turn it into a dataset of many small examples `(context, next_char)`:
   - context is a window of the previous `block_size` characters,
   - next_char is the character that actually follows in the training name.
3. Build a **character embedding** lookup table:
   - each character gets a small learned vector.
4. Build a **2-layer MLP**:
   - input: concatenated embeddings for the context,
   - hidden layer with `tanh` nonlinearity,
   - output layer that predicts a distribution over the next character.
5. Train the network with **mini-batch gradient descent** and `F.cross_entropy`.
6. Measure **train / dev / test** loss and discuss under/overfitting.
7. Visualise character embeddings and **sample new names**.

Along the way we’ll explain:

- why a pure **count table** explodes when we use longer context,
- how embeddings help generalisation (like in Bengio et al. 2003),
- how `view`, broadcasting, and advanced indexing in PyTorch work,
- why `F.cross_entropy` is preferred over a manual softmax.



## 1. Setup – imports and dataset

We’ll use:

- `torch` for tensors and autograd,
- `torch.nn.functional` (imported as `F`) for things like `one_hot`, `cross_entropy`, `softmax`, …
- `matplotlib` just for a few plots.

We also load `names.txt`, which should contain **one name per line**:


In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

%matplotlib inline

# Load the dataset: one name per line
words = open('names.txt', 'r').read().splitlines()
len(words), words[:10]


## Exercise 1 – Exploring the names dataset (recap)

Before diving into the MLP, let’s warm up (and confirm that the file looks sane).

Using the list `words`:

1. Print the **number of names**.
2. Print the **first 10 names**.
3. Compute the **minimum** and **maximum** name length (in characters).

This is similar to the Part 1 notebook, but it’s good practice.


In [None]:
# Exercise 1 – your turn: basic dataset stats

### YOUR CODE HERE
raise NotImplementedError("Exercise 1: print dataset stats (count, examples, min/max length)")

# Hints:
# - len(words) -> number of names
# - lengths = [len(w) for w in words]
# - min(lengths), max(lengths)

In [None]:
# Solution 1 – dataset stats

print("Number of names:", len(words))
print("First 10 names:", words[:10])

lengths = [len(w) for w in words]
print("Shortest name length:", min(lengths))
print("Longest  name length:", max(lengths))


**Solution discussion:**

- `len(words)` gives the number of names (rows in the file).
- `words[:10]` slices the first 10 names so you can eyeball them.
- We build a `lengths` list with a simple list comprehension, then use `min` and `max`.

This tells us roughly how long typical names are and how wide our context window (number of previous characters) might reasonably be.



## 2. Character vocabulary

We work at the **character** level. We need:

- a list of all characters in the dataset,
- a mapping from character → integer index (`stoi`),
- a mapping from index → character (`itos`),
- a special character `'.'` to represent both **start** and **end** of a word.

We will:

- assign index **0** to `'.'`,
- and indices **1..26** to `a..z`.


In [None]:
# Build the character vocabulary and mappings

chars = sorted(list(set(''.join(words))))  # all unique characters
stoi = {s: i + 1 for i, s in enumerate(chars)}  # reserve 0 for '.'
stoi['.'] = 0
itos = {i: s for s, i in stoi.items()}

vocab_size = len(stoi)
print("chars:", chars)
print("vocab_size:", vocab_size)
print("stoi:", stoi)


## 3. From bigrams to longer context (and why we need an MLP)

In the **bigram** model (Part 1), we used a 27×27 table of counts/probabilities:

- rows: previous character,
- columns: next character.

This worked but was weak: it only looks at **one** previous character.

If we tried to extend this table to longer context:

- 1-char context: 27 possible histories → 27 rows.
- 2-char context: 27² = 729 rows.
- 3-char context: 27³ = 19 683 rows.
- 4-char context: 27⁴ = 531 441 rows.

Two problems:

1. The table grows **exponentially** with context length.
2. Many rows will be almost empty; we won’t have enough data to estimate them.

Instead, Bengio et al. (2003) proposed:

- give each symbol (word in their paper, character here) a **learnable embedding vector**, e.g. 10‑dimensional,
- feed a **fixed-length block of previous symbols** into a neural network,
- have the network output a probability distribution over the next symbol,
- train the whole thing by **maximising log-likelihood** (minimising negative log-likelihood).

Even if the model hasn’t seen the **exact** context `"dog was running in a ..."`, it may have seen:

- `"the dog was running in a ..."`
- `"a cat was running in a ..."`

and learn that `"a"` and `"the"` have similar embeddings, `"dog"` and `"cat"` are similar, etc.
This way, it can **generalise** to new but related contexts.

We will build a character-level version of that idea.



## 4. Context windows (`block_size`) and a rolling example

We choose a **context length** `block_size`, the number of previous characters we give the model.

In this notebook we’ll start with:

```python
block_size = 3
```

That means:

- For each position in a word, the input is the previous 3 characters (padded with `'.'`),
- The target is the current character.

For example, for `block_size = 3` and the word `"emma"` we will generate:

```text
... -> e
..e -> m
.e m -> m
emm -> a
mma -> .   (end of name)
```

Let’s build these contexts for **just one word** to see exactly what’s happening.


In [None]:
# Exercise 2 – your turn: contexts for one word

block_size = 3

# Use the FIRST word in the dataset
w = words[0]

### YOUR CODE HERE
raise NotImplementedError("Exercise 2: print all (context -> next_char) pairs for one word")

# Hints:
# - start with context = [0] * block_size  ('.' -> index 0)
# - loop over characters in w + '.'
# - convert ch to index with stoi[ch]
# - print ''.join(itos[i] for i in context), '--->', itos[ix]
# - then update context = context[1:] + [ix]

In [None]:
# Solution 2 – contexts for one word

block_size = 3

w = words[0]
print("Word:", w)

context = [0] * block_size  # start with '...' (all dots)
for ch in w + '.':
    ix = stoi[ch]
    # show the mapping from context to next char
    print(''.join(itos[i] for i in context), '--->', itos[ix])
    # roll the context window forward
    context = context[1:] + [ix]


**Solution discussion:**

- We start with `context = [0, 0, 0]`, which decodes to `"..."`.
- For each character `ch` in `w + '.'`:
  - we look up its index `ix = stoi[ch]`,
  - we print the current context and the next character,
  - we **slide the window**: drop the oldest index (`context[1:]`) and append the new one (`+ [ix]`).

This pattern is exactly how we’ll build the full dataset.



## 5. Building the full dataset and splitting into train/dev/test

Now we’ll turn *all* words into a big dataset of `(context, next_char)` pairs.

We’ll wrap the logic in a function:

```python
def build_dataset(words):
    X, Y = [], []
    # for each word:
    #   start with context = [0] * block_size
    #   loop over characters in w + '.'
    #   append context to X and ix (target) to Y
    # return X, Y as tensors
```

Then we’ll:

1. Shuffle the list of words.
2. Split into 3 parts:
   - 80% train,
   - 10% dev (validation),
   - 10% test.
3. Build `Xtr, Ytr`, `Xdev, Ydev`, `Xte, Yte` with `build_dataset`.

We’ll use these splits to talk about **overfitting** and **generalisation** later.


In [None]:
# Exercise 3 – your turn: build dataset + splits

block_size = 3  # keep using 3-char context

### YOUR CODE HERE
raise NotImplementedError("Exercise 3: implement build_dataset and create train/dev/test splits")

# Hints:
# - def build_dataset(words):
#       X, Y = [], []
#       for each word w:
#           context = [0] * block_size
#           for ch in w + '.':
#               ix = stoi[ch]
#               X.append(context)
#               Y.append(ix)
#               context = context[1:] + [ix]
#       convert X, Y to tensors and return
# - use random.seed(...) then random.shuffle(words)
# - n1 = int(0.8 * len(words))
#   n2 = int(0.9 * len(words))
# - words[:n1], words[n1:n2], words[n2:]

In [None]:
# Solution 3 – build dataset + splits

import random

def build_dataset(words_subset):
    X, Y = [], []
    for w in words_subset:
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print("built dataset:", X.shape, Y.shape)
    return X, Y

# shuffle words once with a fixed seed for reproducibility
random.seed(42)
random.shuffle(words)

n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])

Xtr.shape, Ytr.shape, Xdev.shape, Ydev.shape, Xte.shape, Yte.shape


**Solution discussion:**

- `build_dataset` encapsulates the context-building logic we tested on a single word.
- We convert `X` and `Y` to tensors so they work nicely with PyTorch.
- We split the shuffled names into 3 disjoint sets:
  - **train** (`Xtr, Ytr`) is used to fit model parameters,
  - **dev** (`Xdev, Ydev`) is used to tune hyperparameters (like hidden size, embedding size, learning rate),
  - **test** (`Xte, Yte`) is only used at the very end to get a final performance number.

Now we have a large supervised learning dataset of shape roughly `(200k+, 3)` → `(200k+, )` for the train split.



## 6. Character embeddings – replacing huge tables with vectors

The bigram model stored probabilities in a huge **table**.

Instead, we’ll give every character a small **embedding vector** of size `n_embed`.

- Think of `C` as a matrix of shape `(vocab_size, n_embed)`.
- Row `i` is the embedding for character with index `i`.
- These embeddings start as random values and are tuned by gradient descent.

This is exactly like Bengio’s word embeddings, but at the character level.

To build intuition, let’s work with a tiny embedding dimension (2D) and embed a single index in two ways:

1. Direct lookup: `C[ix]`
2. One-hot → matrix multiply (like a linear layer with no bias).


In [None]:
# Exercise 4 – your turn: embedding a single index in two equivalent ways

vocab_size = len(stoi)
C_demo = torch.randn((vocab_size, 2))  # 2D embeddings for illustration

ix = 5  # arbitrary character index (not special in any way)

### YOUR CODE HERE
raise NotImplementedError("Exercise 4: embed ix with direct indexing and with one-hot @ C_demo")

# Hints:
# - emb_direct = C_demo[ix]
# - one_hot = F.one_hot(torch.tensor(ix), num_classes=vocab_size).float()
# - emb_via_one_hot = one_hot @ C_demo
# - print both and compare

In [None]:
# Solution 4 – embedding a single index in two equivalent ways

vocab_size = len(stoi)
C_demo = torch.randn((vocab_size, 2))

ix = 5  # some index

# 1) Direct lookup: treat C_demo as a lookup table
emb_direct = C_demo[ix]

# 2) One-hot encode ix, then multiply by C_demo
one_hot = F.one_hot(torch.tensor(ix), num_classes=vocab_size).float()  # shape: (vocab_size,)
emb_via_one_hot = one_hot @ C_demo  # shape: (2,)

print("emb_direct     :", emb_direct)
print("emb_via_one_hot:", emb_via_one_hot)
print("difference     :", (emb_direct - emb_via_one_hot).abs().max().item())


**Solution discussion:**

- The one-hot vector is all zeros except a 1 at position `ix`.
- When we compute `one_hot @ C_demo`, the matrix multiply picks out exactly the `ix`‑th row of `C_demo`.
- So `C_demo[ix]` and `one_hot @ C_demo` are numerically identical (up to floating-point rounding).

This shows that:

> “Embedding lookup” is just a special, efficient case of a linear layer with one-hot inputs.

We’ll mostly use the **lookup** style (`C[X]`) because it’s faster and more concise.



## 7. Embedding an entire batch of contexts

We don’t want to embed characters one at a time; we want to embed entire **batches** of contexts.

Recall:

- `Xtr` has shape `(N, block_size)` and stores character indices.
- If we have an embedding matrix `C` of shape `(vocab_size, n_embed)`, then `C[Xtr]` gives us:

```python
C[Xtr].shape == (N, block_size, n_embed)
```

Let’s check this on a small batch.


In [None]:
# Exercise 5 – your turn: embed a batch of contexts

# Take a small batch of 4 training examples
X_batch_demo = Xtr[:4]

### YOUR CODE HERE
raise NotImplementedError("Exercise 5: embed X_batch_demo and print shapes")

# Hints:
# - for now, reuse C_demo with n_embed = 2 (just for shape intuition)
# - emb = C_demo[X_batch_demo]
# - print X_batch_demo.shape and emb.shape

In [None]:
# Solution 5 – embed a small batch and inspect shapes

X_batch_demo = Xtr[:4]  # shape: (4, block_size)

emb_demo = C_demo[X_batch_demo]  # shape: (4, block_size, 2)

print("X_batch_demo.shape:", X_batch_demo.shape)
print("emb_demo.shape    :", emb_demo.shape)


**Solution discussion:**

- `X_batch_demo` has shape `(4, 3)` → 4 examples, each a context of length 3.
- `C_demo` has shape `(vocab_size, 2)`. Indexing with a 2D tensor `X_batch_demo` gives a 3D tensor:
  - first dimension: which example (0..3),
  - second dimension: position in the context (0..2),
  - third dimension: embedding dimension (0..1).

We’ll now switch from the toy `C_demo` to a “real” embedding matrix with a larger dimension (e.g. 10).


## 8. Defining the MLP parameters

We now define our actual model hyperparameters:

- `n_embed`: embedding dimension (e.g. 10),
- `n_hidden`: number of neurons in the hidden layer (e.g. 200),
- `block_size = 3`: context length (already chosen).

Our parameters will be:

- `C` : `(vocab_size, n_embed)` – character embedding table,
- `W1`: `(block_size * n_embed, n_hidden)` – weights of hidden layer,
- `b1`: `(n_hidden,)` – biases of hidden layer,
- `W2`: `(n_hidden, vocab_size)` – weights of output layer,
- `b2`: `(vocab_size,)` – biases of output layer.


In [None]:
# Define model hyperparameters
n_embed = 10   # embedding dimension
n_hidden = 200 # hidden layer size

g = torch.Generator().manual_seed(2147483647)  # for reproducibility

C = torch.randn((vocab_size, n_embed), generator=g)
W1 = torch.randn((block_size * n_embed, n_hidden), generator=g)
b1 = torch.randn(n_hidden, generator=g)
W2 = torch.randn((n_hidden, vocab_size), generator=g)
b2 = torch.randn(vocab_size, generator=g)

parameters = [C, W1, b1, W2, b2]
print("Total parameters:", sum(p.nelement() for p in parameters))

for p in parameters:
    p.requires_grad = True


## 9. Forward pass: from indices to logits

Given a batch `Xb` of shape `(B, block_size)` and targets `Yb` of shape `(B,)`, the forward pass is:

1. `emb = C[Xb]`
   - shape: `(B, block_size, n_embed)`
2. `emb_flat = emb.view(B, block_size * n_embed)`
   - shape: `(B, block_size * n_embed)` – concatenate embeddings in the context.
3. Hidden layer: `h = torch.tanh(emb_flat @ W1 + b1)`
   - `@` is matrix multiplication,
   - `b1` is broadcast over the batch dimension.
4. Output layer: `logits = h @ W2 + b2`
   - `logits` shape: `(B, vocab_size)`.

Let’s implement this for a small batch and check shapes.


In [None]:
# Exercise 6 – your turn: forward pass to logits on a small batch

batch_size = 32
ix = torch.randint(0, Xtr.shape[0], (batch_size,))
Xb = Xtr[ix]
Yb = Ytr[ix]

### YOUR CODE HERE
raise NotImplementedError("Exercise 6: compute emb, emb_flat, h, logits and print shapes")

# Hints:
# - emb = C[Xb]
# - B = emb.shape[0]
# - emb_flat = emb.view(B, -1)
# - h = torch.tanh(emb_flat @ W1 + b1)
# - logits = h @ W2 + b2

In [None]:
# Solution 6 – forward pass to logits

batch_size = 32
ix = torch.randint(0, Xtr.shape[0], (batch_size,))
Xb = Xtr[ix]
Yb = Ytr[ix]

# 1. Embed
emb = C[Xb]                    # (B, block_size, n_embed)
B = emb.shape[0]

# 2. Flatten context embeddings
emb_flat = emb.view(B, -1)     # (B, block_size * n_embed)

# 3. Hidden layer with tanh nonlinearity
h = torch.tanh(emb_flat @ W1 + b1)  # (B, n_hidden)

# 4. Output layer logits
logits = h @ W2 + b2           # (B, vocab_size)

print("emb.shape     :", emb.shape)
print("emb_flat.shape:", emb_flat.shape)
print("h.shape       :", h.shape)
print("logits.shape  :", logits.shape)


**Solution discussion:**

Key PyTorch tricks:

- `emb.view(B, -1)` **reshapes** the last two dimensions `(block_size, n_embed)` into one `(block_size * n_embed)`.
  - This is much more efficient than manually concatenating tensors.
- When we do `emb_flat @ W1 + b1`:
  - `emb_flat` is `(B, 30)` (for `block_size=3, n_embed=10`),
  - `W1` is `(30, n_hidden)`,
  - result is `(B, n_hidden)`,
  - adding `b1` (shape `(n_hidden,)`) uses **broadcasting**: the same bias vector is added to every row.

So far we’ve built a pure **linear MLP** that outputs unnormalised scores (`logits`) for each possible next character.



## 10. From logits to loss: softmax and cross-entropy

To train the model, we want a **loss** that measures how well the logits match the actual next characters.

Steps (manually):

1. Convert logits to **unnormalised counts** by exponentiating: `counts = logits.exp()`
2. Convert counts to probabilities by row-wise normalisation:
   \[ P_{ij} = \frac{\exp(\text{logits}_{ij})}{\sum_k \exp(\text{logits}_{ik})} \]
3. For each example `i`, take the probability of the correct class `Yb[i]`.
4. Take log-probabilities and average the **negative** log-probabilities.

PyTorch has a ready-made function `F.cross_entropy(logits, targets)` that does all this:

- applies softmax internally in a numerically stable way,
- then computes the average negative log-likelihood.

Let’s compute the loss both ways and compare.


In [None]:
# Exercise 7 – your turn: manual NLL vs F.cross_entropy

### YOUR CODE HERE
raise NotImplementedError("Exercise 7: compute manual NLL and compare with F.cross_entropy")

# Hints:
# - counts = logits.exp()
# - probs = counts / counts.sum(1, keepdims=True)
# - log_probs_correct = probs[torch.arange(B), Yb].log()
# - loss_manual = -log_probs_correct.mean()
# - loss_ce = F.cross_entropy(logits, Yb)
# - print(loss_manual.item(), loss_ce.item())

In [None]:
# Solution 7 – manual NLL vs F.cross_entropy

# Reuse logits and Yb from the previous cell
B = logits.shape[0]

# Manual computation
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
log_probs_correct = probs[torch.arange(B), Yb].log()
loss_manual = -log_probs_correct.mean()

# Using PyTorch helper
loss_ce = F.cross_entropy(logits, Yb)

print("Manual NLL loss :", loss_manual.item())
print("F.cross_entropy :", loss_ce.item())


**Solution discussion:**

- Both methods compute the **average negative log-likelihood** over the batch.
- `F.cross_entropy` and our manual version should match very closely.

Why use `F.cross_entropy` in practice?

1. **Efficiency** – it doesn’t need to store unnecessary intermediate tensors.
2. **Numerical stability** – it effectively subtracts `logits.max()` inside the softmax to avoid overflow when logits get large.
3. **Convenience** – less boilerplate code and less room for subtle bugs.

From now on we’ll use `F.cross_entropy` for the loss.



## 11. One gradient descent step with autograd

Let’s now train the network a tiny bit.

The steps for **one update** are:

1. Pick a mini-batch of examples.
2. Forward pass → logits → loss.
3. Zero all parameter gradients.
4. Backward pass: `loss.backward()`.
5. Update each parameter with gradient descent:
   \[ p \leftarrow p - \text{learning\_rate} \times p.\text{grad} \]

We’ll implement just **one update** and check that the loss goes down **on the same batch**.


In [None]:
# Exercise 8 – your turn: one training step on a mini-batch

# Use a fresh random mini-batch
ix = torch.randint(0, Xtr.shape[0], (batch_size,))
Xb = Xtr[ix]
Yb = Ytr[ix]

### YOUR CODE HERE
raise NotImplementedError("Exercise 8: do one gradient descent step and print loss before/after")

# Hints:
# - Forward pass to get loss_before (use F.cross_entropy)
# - Zero gradients: for p in parameters: p.grad = None
# - loss_before.backward()
# - learning rate, e.g. lr = 0.1
# - update parameters: p.data += -lr * p.grad
# - recompute loss_after on the *same* batch and print both

In [None]:
# Solution 8 – one training step on a mini-batch

# Fresh mini-batch
ix = torch.randint(0, Xtr.shape[0], (batch_size,))
Xb = Xtr[ix]
Yb = Ytr[ix]

# Forward pass – loss before update
emb = C[Xb]
B = emb.shape[0]
emb_flat = emb.view(B, -1)
h = torch.tanh(emb_flat @ W1 + b1)
logits = h @ W2 + b2
loss_before = F.cross_entropy(logits, Yb)

print("loss before update:", loss_before.item())

# Backward pass
for p in parameters:
    p.grad = None
loss_before.backward()

# Gradient descent update
lr = 0.1
for p in parameters:
    p.data += -lr * p.grad

# Forward pass – loss after update (on the *same* batch)
emb = C[Xb]
B = emb.shape[0]
emb_flat = emb.view(B, -1)
h = torch.tanh(emb_flat @ W1 + b1)
logits = h @ W2 + b2
loss_after = F.cross_entropy(logits, Yb)

print("loss after  update:", loss_after.item())


**Solution discussion:**

- `loss.backward()` walks backward through all operations and fills `p.grad` for each parameter `p`.
- The gradient tells us which direction **increases** the loss.
- Stepping with `p.data += -lr * p.grad` roughly moves us towards a parameter setting with lower loss.

On a **single batch** the loss should almost always go down after one step. On the full training set it’s more noisy, which is why we average over many mini-batches.



## 12. Full training loop with mini-batches and a simple LR schedule

We now put it all together:

- mini-batch size: 32
- number of steps: e.g. 200,000
- learning rate schedule:
  - `lr = 0.1` for the first 100,000 steps,
  - `lr = 0.01` for the remaining 100,000 steps.

In practice, you can lower these numbers if you’re just experimenting or if training is too slow.

We will record the loss every 1000 steps so we can plot it.


In [None]:
# Exercise 9 – your turn: training loop with mini-batches

max_steps = 200000
batch_size = 32

loss_history = []

### YOUR CODE HERE
raise NotImplementedError("Exercise 9: implement the training loop over many mini-batches")

# Hints:
# For i in range(max_steps):
#   - sample ix = torch.randint(0, Xtr.shape[0], (batch_size,))
#   - forward to compute loss (F.cross_entropy)
#   - zero grads, backward, update with lr = 0.1 if i < 100000 else 0.01
#   - every 1000 steps: append loss.item() to loss_history and print it

In [None]:
# Solution 9 – training loop with mini-batches

max_steps = 200000
batch_size = 32

loss_history = []

for i in range(max_steps):
    # Sample a mini-batch of indices
    ix = torch.randint(0, Xtr.shape[0], (batch_size,))
    Xb = Xtr[ix]
    Yb = Ytr[ix]

    # Forward pass
    emb = C[Xb]               # (B, block_size, n_embed)
    B = emb.shape[0]
    emb_flat = emb.view(B, -1)
    h = torch.tanh(emb_flat @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Yb)

    # Backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # Update
    lr = 0.1 if i < 100000 else 0.01
    for p in parameters:
        p.data += -lr * p.grad

    # Logging
    if i % 1000 == 0:
        loss_history.append(loss.item())
        print(f"step {i:6d} | loss = {loss.item():.4f}")


> **Note:** If 200k steps is too slow for your machine, feel free to reduce `max_steps`
and/or the hidden size and embedding size. The logic remains exactly the same.

Let’s plot the training loss samples.


In [None]:
plt.figure()
plt.plot(loss_history)
plt.xlabel("checkpoint (every 1000 steps)")
plt.ylabel("mini-batch loss")
plt.title("Training loss over time")
plt.show()


## 13. Evaluating train and dev loss

The training loss we plotted is computed on **mini-batches**, not the whole dataset.

To see how well the model really fits the data, we compute:

- **train loss** on *all* `Xtr, Ytr`,
- **dev loss** on *all* `Xdev, Ydev`.

We expect the dev loss to be a bit higher than the train loss (because the model was trained on train, not dev).
If dev loss is much higher than train loss, we are **overfitting**.
If dev loss is similar but both are high, we are probably **underfitting**.


In [None]:
# Exercise 10 – your turn: compute train and dev loss

### YOUR CODE HERE
raise NotImplementedError("Exercise 10: compute average loss on train and dev splits")

# Hints:
# - define a helper function split_loss(X, Y):
#      emb = C[X]
#      N = emb.shape[0]
#      emb_flat = emb.view(N, -1)
#      h = torch.tanh(emb_flat @ W1 + b1)
#      logits = h @ W2 + b2
#      return F.cross_entropy(logits, Y)
# - then call it on Xtr, Ytr and Xdev, Ydev

In [None]:
# Solution 10 – compute train and dev loss

def split_loss(X, Y):
    emb = C[X]
    N = emb.shape[0]
    emb_flat = emb.view(N, -1)
    h = torch.tanh(emb_flat @ W1 + b1)
    logits = h @ W2 + b2
    return F.cross_entropy(logits, Y)

loss_train = split_loss(Xtr, Ytr)
loss_dev = split_loss(Xdev, Ydev)

print("train loss:", loss_train.item())
print("dev   loss:", loss_dev.item())


**Solution discussion:**

- `split_loss` is just the forward pass applied to *all* examples in a split at once.
- We get a single scalar cross-entropy loss for each split.
- In the original lesson, typical numbers (after enough training) were roughly:
  - train ≈ 2.12
  - dev   ≈ 2.17

If your numbers are similar, your model is behaving as in the video.
If dev is much lower or higher, you may have changed hyperparameters or training length.



## 14. Visualising character embeddings

Character embeddings live in a `n_embed`-dimensional space (here 10D), so we can’t really “see” them directly.

However, we can at least plot the **first two** dimensions to see whether the model has learned anything interesting.

Often you’ll see:

- vowels clustering near each other,
- special tokens like `'.'` or rare letters (`'q'`) sitting in special places.


In [None]:
# Exercise 11 – your turn: scatter plot of embeddings (first two dims)

### YOUR CODE HERE
raise NotImplementedError("Exercise 11: scatter plot C[:,0] vs C[:,1] with character labels")

# Hints:
# - plt.figure(figsize=(6,6))
# - plt.scatter(C[:,0].data, C[:,1].data)
# - loop over i in range(vocab_size): plt.text(C[i,0].item(), C[i,1].item(), itos[i], ...)
# - plt.grid(True)

In [None]:
# Solution 11 – scatter plot of embeddings

plt.figure(figsize=(6, 6))
plt.scatter(C[:, 0].data, C[:, 1].data, s=200)

for i in range(vocab_size):
    ch = itos[i]
    plt.text(C[i, 0].item(), C[i, 1].item(), ch, ha='center', va='center', color='white')

plt.xlabel("embedding dim 0")
plt.ylabel("embedding dim 1")
plt.grid(True)
plt.title("Character embeddings (first two dimensions)")
plt.show()


**Solution discussion:**

Remember that these embeddings started as random noise.

After training, the network has moved them around so that characters with **similar roles** in names
(e.g. vowels, consonants, start/end token `'.'`) tend to have related vectors.

In the Bengio paper, this was done at the **word** level; here we see a smaller, character-level analogue.



## 15. Sampling new names from the trained MLP

We can now use the trained model to **generate** names.

Sampling procedure:

1. Start with `context = [0] * block_size` (all dots).
2. Repeat:
   - embed the context,
   - run through MLP to get logits,
   - convert logits to probabilities with `F.softmax`,
   - sample an index from the distribution using `torch.multinomial`,
   - shift the context and append this index,
   - save the index in an output list,
   - stop when we sample `ix == 0` (end token `'.'`).

Let’s implement a loop that samples 20 names.


In [None]:
# Exercise 12 – your turn: sampling names from the trained model

g = torch.Generator().manual_seed(2147483647 + 10)  # deterministic sampling

### YOUR CODE HERE
raise NotImplementedError("Exercise 12: implement sampling loop to print 20 generated names")

# Hints:
# for _ in range(20):
#   context = [0] * block_size
#   out = []
#   while True:
#       x = torch.tensor([context])
#       emb = C[x]
#       emb_flat = emb.view(1, -1)
#       h = torch.tanh(emb_flat @ W1 + b1)
#       logits = h @ W2 + b2
#       probs = F.softmax(logits, dim=1)
#       ix = torch.multinomial(probs, num_samples=1, generator=g).item()
#       context = context[1:] + [ix]
#       out.append(ix)
#       if ix == 0: break
#   print(''.join(itos[i] for i in out))

In [None]:
# Solution 12 – sampling names from the trained model

g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    context = [0] * block_size
    out = []
    while True:
        x = torch.tensor([context])          # (1, block_size)
        emb = C[x]                           # (1, block_size, n_embed)
        emb_flat = emb.view(1, -1)           # (1, block_size * n_embed)
        h = torch.tanh(emb_flat @ W1 + b1)   # (1, n_hidden)
        logits = h @ W2 + b2                 # (1, vocab_size)
        probs = F.softmax(logits, dim=1)     # (1, vocab_size)

        ix = torch.multinomial(probs, num_samples=1, generator=g).item()
        context = context[1:] + [ix]
        out.append(ix)
        if ix == 0:
            break

    print(''.join(itos[i] for i in out))


You should see names that are much more realistic than the pure bigram model:

- They tend to have plausible syllable structure,
- They often look like variations of real names,
- But many will be unique (not exactly present in `names.txt`).

This is exactly what the MLP is designed to do: capture longer-range patterns than bigrams,
without exploding into a gigantic count table.



## 16. Wrap-up and next steps

In this notebook you:

- Revisited the bigram model’s limitation: **too little context** or an **exploding table**.
- Built a **character-level dataset** of `(context, next_char)` pairs for arbitrary `block_size`.
- Implemented a **learned embedding table** and saw its equivalence to one-hot + linear.
- Constructed a **2-layer MLP** by hand (no `nn.Module` magic):
  - embedding lookup,
  - concatenation via `view`,
  - hidden layer with `tanh`,
  - output layer producing logits.
- Used **`F.cross_entropy`** to compute negative log-likelihood in a numerically stable way.
- Trained with **mini-batch stochastic gradient descent** and a simple learning rate schedule.
- Evaluated **train vs dev loss** and discussed under- and overfitting.
- Visualised character embeddings (first two dimensions).
- Sampled new names from the trained MLP.

From here, you can experiment with:

- **Changing context length** (`block_size`).
- **Changing embedding size** (`n_embed`).
- **Changing hidden size** (`n_hidden`).
- Trying different **learning rate schedules**, batch sizes, or optimisers.
- Implementing a **learning rate finder** like in the video (sweep over log-spaced learning rates and plot loss).

All the core ideas from the MLP makemore video are now encoded in this notebook as exercises + solutions,
so you can re-derive everything yourself just by working through it.
