# Day 6: Intro to Language Modeling — The Bigram Model

**Building LLMs from Scratch** — Following Andrej Karpathy's makemore lectures.

---

## 1. Introduction

From autograd to language: we shift gears to **language modeling** — predicting the next token (character or word) given previous context. The simplest model is the **bigram model**: we predict the next character given only the previous one.

This notebook builds a character-level bigram model from scratch using PyTorch.

## 2. Loading the Dataset

We use a small hardcoded list of common names. Each name is a sequence of characters we'll model.

In [None]:
words = ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'mia', 'charlotte', 'amelia', 'harper', 'evelyn',
         'abigail', 'emily', 'ella', 'elizabeth', 'camila', 'luna', 'sofia', 'avery', 'mila', 'aria']

print(f"Dataset: {len(words)} names")
print(words)

## 3. Character Mappings

Build `stoi` (string → int) and `itos` (int → string) mappings. We use `.` as the special start/end token so every name is wrapped as `.name.`

In [None]:
chars = ['.'] + [chr(i) for i in range(ord('a'), ord('z') + 1)]  # 27 chars: . + a-z
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}

print(f"Vocabulary size: {len(stoi)} (including '.' as start/end token)")
print(f"stoi: {stoi}")
print(f"itos: {itos}")

## 4. Building the Bigram Count Matrix

Count all bigrams (pairs of consecutive characters). We use a 27×27 matrix: 26 letters + `.` (index 0).

In [None]:
import torch

N = torch.zeros((27, 27), dtype=torch.int32)

for w in words:
    chs = ['.'] + list(w) + ['.']
    for c1, c2 in zip(chs[:-1], chs[1:]):
        ix1, ix2 = stoi[c1], stoi[c2]
        N[ix1, ix2] += 1

print("Bigram count matrix (rows=first char, cols=second char):")
print(N)

## 5. Visualizing the Matrix

Plot the count matrix with matplotlib. Rows and columns are labeled by character.

In [None]:
import matplotlib.pyplot as plt

labels = [itos[i] for i in range(27)]

fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(N, cmap='Blues')

ax.set_xticks(range(27))
ax.set_yticks(range(27))
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel('Next character')
ax.set_ylabel('Current character')
ax.set_title('Bigram Count Matrix')

# Annotate top values (counts > 0)
for i in range(27):
    for j in range(27):
        if N[i, j].item() > 0:
            ax.text(j, i, int(N[i, j].item()), ha='center', va='center', fontsize=8)

plt.colorbar(im, ax=ax, label='Count')
plt.tight_layout()
plt.show()

## 6. Converting to Probabilities

Normalize each row to get P(next | current). We use **add-1 smoothing** to avoid zeros: `P = (N + 1).float()` then divide by row sums.

In [None]:
P = (N + 1).float()  # add-1 smoothing
P = P / P.sum(1, keepdim=True)

print("Probability matrix P[next | current] (sample rows):")
print(f"P[0,:] (after '.'): {P[0].tolist()}")
print(f"Row sums: {P.sum(1)}")

## 7. Sampling from the Model

Generate new names by sampling from the bigram distribution. Start with `.`, sample next char, repeat until we hit `.` again.

In [None]:
torch.manual_seed(42)

def sample_name():
    out = []
    ix = 0  # start with '.'
    while True:
        p = P[ix]
        ix = torch.multinomial(p, num_samples=1).item()
        if ix == 0:
            break
        out.append(itos[ix])
    return ''.join(out)

print("Generated names:")
for _ in range(10):
    print(sample_name())

## 8. Evaluating with NLL

Compute **negative log-likelihood** on the training data. Lower NLL = better model. For each bigram (ix1, ix2), we use -log P[ix1, ix2].

In [None]:
log_likelihood = 0.0
n = 0

for w in words:
    chs = ['.'] + list(w) + ['.']
    for c1, c2 in zip(chs[:-1], chs[1:]):
        ix1, ix2 = stoi[c1], stoi[c2]
        log_likelihood += torch.log(P[ix1, ix2]).item()
        n += 1

nll = -log_likelihood / n
print(f"Negative Log-Likelihood (mean): {nll:.4f}")
print(f"(Lower is better)")

In [None]:
# Alternative: vectorized NLL using tensors
ix1_list, ix2_list = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for c1, c2 in zip(chs[:-1], chs[1:]):
        ix1_list.append(stoi[c1])
        ix2_list.append(stoi[c2])

ix1 = torch.tensor(ix1_list)
ix2 = torch.tensor(ix2_list)
nll_vec = -torch.log(P[ix1, ix2]).mean()
print(f"Vectorized NLL: {nll_vec.item():.4f}")

---

**Blog:** [Day 6 — Bigram Model](https://omkarray.com/llm-day6.html)

**Prev:** [Day 5 — Training the MLP](llm_day05_training.ipynb) · **Next:** [Day 7 — MLP Language Model](llm_day07_mlp_lm.ipynb)