In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# N-gram Language Models: Predicting the Next Word by Counting -- Vizuara

---

## 1. Why Does This Matter?

Every time your phone suggests the next word, every time a chatbot completes your sentence, every time a search engine guesses what you are about to type -- a **language model** is at work. And it all started with the simplest idea imaginable: **counting.**

Before neural networks, before transformers, before GPT -- the dominant approach to language modeling was the **N-gram model.** The idea is disarmingly simple: look at the last few words, and predict what comes next based on how often you have seen that pattern before.

In this notebook, you will build an N-gram language model **completely from scratch.** No deep learning libraries, no pre-trained weights -- just Python, counting, and probability. By the end, you will have a working model that can generate (surprisingly coherent) text, and you will understand exactly why this approach eventually hit a wall that only neural networks could overcome.

**What you will build:**
- A bigram (2-gram) and trigram (3-gram) language model
- Probability estimation from raw counts
- Text generation using sampling
- Perplexity evaluation to measure model quality
- Visualizations of the sparsity problem that killed N-grams

Let us begin.

---

## 2. Building Intuition

Think about how you predict the next word yourself. If I say "I went to the grocery ___", you instantly think "store." Why? Because in your lifetime of reading and listening, you have encountered "grocery store" thousands of times -- far more often than "grocery elephant" or "grocery democracy."

You are doing something remarkably similar to counting: your brain has observed word patterns so frequently that the prediction feels automatic.

An N-gram model does exactly this, but explicitly. It counts every pair (or triple, or quadruple) of consecutive words in a large corpus and uses those counts to estimate probabilities.

Let us start with the simplest case: what is the probability that a given word appears at all?

In [None]:
# The foundation: word frequencies in a corpus
corpus_text = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the bird flew over the house
the cat sat on the rug
"""

# Tokenize
words = corpus_text.lower().split()
total_words = len(words)

# Count each word
from collections import Counter
word_counts = Counter(words)

print("Word frequencies:")
print("-" * 30)
for word, count in word_counts.most_common():
    prob = count / total_words
    print(f"  '{word}': count={count}, P('{word}') = {count}/{total_words} = {prob:.3f}")

print(f"\nTotal words: {total_words}")
print(f"Vocabulary size: {len(word_counts)}")

Now let us visualize these frequencies. This is our first look at the **distributional structure** of language.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Sort by frequency
words_sorted = word_counts.most_common()
labels = [w for w, c in words_sorted]
counts = [c for w, c in words_sorted]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of word frequencies
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(labels)))
axes[0].barh(labels[::-1], counts[::-1], color=colors[::-1])
axes[0].set_xlabel('Count', fontsize=12)
axes[0].set_title('Word Frequencies in Our Tiny Corpus', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Zipf's law preview: rank vs frequency
ranks = np.arange(1, len(counts) + 1)
axes[1].plot(ranks, sorted(counts, reverse=True), 'o-', color='#2196F3', markersize=8, linewidth=2)
axes[1].set_xlabel('Rank', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title("Rank vs. Frequency (Zipf's Law Preview)", fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Notice something: "the" appears far more often than any other word. This is a universal property of natural language called **Zipf's Law** -- a handful of words account for most of the text, while most words are rare. This will matter a lot when we hit the sparsity problem later.

---

## 3. The Mathematics

A **language model** assigns a probability to a sequence of words $w_1, w_2, \ldots, w_n$. Using the **chain rule of probability**, we decompose:

$$P(w_1, w_2, \ldots, w_n) = \prod_{i=1}^{n} P(w_i \mid w_1, \ldots, w_{i-1})$$

The problem: estimating $P(w_i \mid w_1, \ldots, w_{i-1})$ requires seeing the exact sequence $w_1, \ldots, w_{i-1}$ many times. For long sequences, this is impossible.

The **Markov assumption** fixes this by truncating the history:

- **Bigram** (n=2): $P(w_i \mid w_1, \ldots, w_{i-1}) \approx P(w_i \mid w_{i-1})$
- **Trigram** (n=3): $P(w_i \mid w_1, \ldots, w_{i-1}) \approx P(w_i \mid w_{i-2}, w_{i-1})$

And we estimate these conditional probabilities by counting:

$$P(w_i \mid w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})}$$

Let us verify this computationally. If our corpus has "cat sat" appearing 2 times and "cat" appearing 3 times, then $P(\text{sat} \mid \text{cat}) = 2/3$.

In [None]:
# Let's compute this by hand and verify
# Chain rule: P("the cat sat") = P("the") * P("cat"|"the") * P("sat"|"cat")

corpus = [
    "the cat sat on the mat",
    "the cat ate the fish",
    "the dog sat on the mat",
    "the bird flew over the house",
    "the cat sat on the rug",
]

from collections import defaultdict

# Count unigrams and bigrams
unigram_counts = defaultdict(int)
bigram_counts = defaultdict(int)

for sentence in corpus:
    words = sentence.lower().split()
    for i in range(len(words)):
        unigram_counts[words[i]] += 1
        if i < len(words) - 1:
            bigram_counts[(words[i], words[i+1])] += 1

# Bigram probability
def bigram_prob(w1, w2):
    """P(w2 | w1) = Count(w1, w2) / Count(w1)"""
    if unigram_counts[w1] == 0:
        return 0.0
    return bigram_counts[(w1, w2)] / unigram_counts[w1]

# Compute P("the cat sat")
p_the = unigram_counts["the"] / sum(unigram_counts.values())
p_cat_given_the = bigram_prob("the", "cat")
p_sat_given_cat = bigram_prob("cat", "sat")

print("Computing P('the cat sat') using the chain rule + bigram assumption:")
print(f"  P('the')         = {unigram_counts['the']}/{sum(unigram_counts.values())} = {p_the:.4f}")
print(f"  P('cat'|'the')   = {bigram_counts[('the','cat')]}/{unigram_counts['the']} = {p_cat_given_the:.4f}")
print(f"  P('sat'|'cat')   = {bigram_counts[('cat','sat')]}/{unigram_counts['cat']} = {p_sat_given_cat:.4f}")
print(f"\n  P('the cat sat') = {p_the:.4f} × {p_cat_given_the:.4f} × {p_sat_given_cat:.4f} = {p_the * p_cat_given_the * p_sat_given_cat:.6f}")

---

## 4. Let's Build It -- Component by Component

Now let us build a proper N-gram language model class that can handle any value of n.

In [None]:
import numpy as np
from collections import defaultdict, Counter

class NgramLanguageModel:
    """
    An N-gram language model built from scratch.

    This model estimates P(next_word | previous n-1 words)
    by counting occurrences in the training corpus.
    """

    def __init__(self, n=2, smoothing=0.0):
        """
        Args:
            n: The N in N-gram (2 = bigram, 3 = trigram, etc.)
            smoothing: Laplace smoothing parameter (0 = no smoothing)
        """
        self.n = n
        self.smoothing = smoothing
        self.ngram_counts = defaultdict(int)   # Count of full n-grams
        self.context_counts = defaultdict(int)  # Count of (n-1)-gram contexts
        self.vocabulary = set()
        self.total_tokens = 0

    def _get_ngrams(self, tokens):
        """Extract all n-grams from a list of tokens."""
        # Add start/end tokens for proper sentence boundaries
        padded = ["<s>"] * (self.n - 1) + tokens + ["</s>"]
        ngrams = []
        for i in range(len(padded) - self.n + 1):
            ngram = tuple(padded[i:i + self.n])
            ngrams.append(ngram)
        return ngrams

    def train(self, corpus):
        """
        Train the model by counting n-grams in the corpus.

        Args:
            corpus: List of sentences (strings)
        """
        for sentence in corpus:
            tokens = sentence.lower().split()
            self.vocabulary.update(tokens)
            self.total_tokens += len(tokens)

            ngrams = self._get_ngrams(tokens)
            for ngram in ngrams:
                context = ngram[:-1]  # First n-1 words
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1

        self.vocabulary.add("<s>")
        self.vocabulary.add("</s>")
        print(f"Trained {self.n}-gram model:")
        print(f"  Vocabulary size: {len(self.vocabulary)}")
        print(f"  Total tokens: {self.total_tokens}")
        print(f"  Unique {self.n}-grams: {len(self.ngram_counts)}")
        print(f"  Unique contexts: {len(self.context_counts)}")

    def probability(self, word, context):
        """
        Compute P(word | context) using counts + optional smoothing.

        Args:
            word: The word to predict
            context: Tuple of previous (n-1) words
        """
        ngram = context + (word,)

        numerator = self.ngram_counts[ngram] + self.smoothing
        denominator = self.context_counts[context] + self.smoothing * len(self.vocabulary)

        if denominator == 0:
            return 1 / len(self.vocabulary)  # Uniform fallback

        return numerator / denominator

    def get_distribution(self, context):
        """Get the full probability distribution over vocabulary given a context."""
        dist = {}
        for word in self.vocabulary:
            dist[word] = self.probability(word, context)
        return dist

# Train a bigram model
corpus = [
    "the cat sat on the mat",
    "the cat ate the fish",
    "the dog sat on the mat",
    "the bird flew over the house",
    "the cat sat on the rug",
]

bigram_model = NgramLanguageModel(n=2)
bigram_model.train(corpus)

Let us examine what the model learned:

In [None]:
# What does the model predict after "the"?
context = ("the",)
dist = bigram_model.get_distribution(context)

# Sort by probability
sorted_dist = sorted(dist.items(), key=lambda x: x[1], reverse=True)

print(f"P(word | 'the') -- top predictions:")
print("-" * 40)
for word, prob in sorted_dist[:10]:
    bar = "█" * int(prob * 50)
    print(f"  {word:12s}  {prob:.3f}  {bar}")

In [None]:
# Visualize the bigram transition matrix
# This is the heart of the N-gram model -- a lookup table of probabilities

# Get the most common words for visualization
common_words = [w for w, _ in Counter(
    [w for s in corpus for w in s.lower().split()]
).most_common(8)]
common_words = ["<s>"] + common_words + ["</s>"]

# Build transition matrix
matrix = np.zeros((len(common_words), len(common_words)))
for i, w1 in enumerate(common_words):
    for j, w2 in enumerate(common_words):
        matrix[i, j] = bigram_model.probability(w2, (w1,))

fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(matrix, cmap='YlOrRd', aspect='auto')

ax.set_xticks(range(len(common_words)))
ax.set_yticks(range(len(common_words)))
ax.set_xticklabels(common_words, rotation=45, ha='right', fontsize=11)
ax.set_yticklabels(common_words, fontsize=11)

# Annotate cells with probabilities
for i in range(len(common_words)):
    for j in range(len(common_words)):
        if matrix[i, j] > 0.01:
            color = 'white' if matrix[i, j] > 0.3 else 'black'
            ax.text(j, i, f'{matrix[i,j]:.2f}', ha='center', va='center',
                    fontsize=9, color=color, fontweight='bold')

ax.set_xlabel('Next Word', fontsize=13, fontweight='bold')
ax.set_ylabel('Current Word', fontsize=13, fontweight='bold')
ax.set_title('Bigram Transition Probabilities\nP(next word | current word)',
             fontsize=15, fontweight='bold')
plt.colorbar(im, label='Probability')
plt.tight_layout()
plt.show()

---

## 5. Your Turn

**TODO 1: Build a Trigram Model**

The bigram model only looks at the previous word. A trigram model looks at the previous TWO words, which should give better predictions.

In [None]:
# TODO: Train a trigram model on the same corpus and compare
#
# Instructions:
# 1. Create a trigram model using NgramLanguageModel(n=3)
# 2. Train it on the corpus
# 3. Compare P("sat" | "cat") from bigram vs P("sat" | "the", "cat") from trigram
# 4. Which gives a higher probability? Why?

# YOUR CODE HERE
trigram_model = NgramLanguageModel(n=3)
# trigram_model.train(???)
#
# bigram_p = bigram_model.probability("sat", ("cat",))
# trigram_p = trigram_model.probability("sat", ("the", "cat"))
#
# print(f"Bigram  P('sat' | 'cat')        = {bigram_p:.4f}")
# print(f"Trigram P('sat' | 'the', 'cat') = {trigram_p:.4f}")

**TODO 2: Implement Laplace Smoothing**

Without smoothing, unseen n-grams get probability zero. Laplace smoothing adds a small count to every possible n-gram.

The smoothed probability is:

$$P_{\text{smooth}}(w_i \mid w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i) + \alpha}{\text{Count}(w_{i-1}) + \alpha \cdot |V|}$$

where $|V|$ is the vocabulary size and $\alpha$ is usually 1.

In [None]:
# TODO: Compare predictions with and without smoothing
#
# Instructions:
# 1. Create a smoothed bigram model with smoothing=1.0
# 2. Train it on the same corpus
# 3. Compare P("ran" | "cat") with and without smoothing
# 4. What happens to zero-probability events?

# YOUR CODE HERE
# smoothed_model = NgramLanguageModel(n=2, smoothing=1.0)
# smoothed_model.train(corpus)
#
# p_no_smooth = bigram_model.probability("ran", ("cat",))
# p_smooth = smoothed_model.probability("ran", ("cat",))
#
# print(f"Without smoothing: P('ran' | 'cat') = {p_no_smooth:.6f}")
# print(f"With smoothing:    P('ran' | 'cat') = {p_smooth:.6f}")

---

## 6. Putting It All Together

Now let us add text generation. The model samples the next word from its probability distribution, then uses that word as context for the next prediction.

In [None]:
class NgramGenerator(NgramLanguageModel):
    """Extends the N-gram model with text generation capabilities."""

    def generate(self, max_length=20, temperature=1.0, seed=None):
        """
        Generate text by sampling from the model.

        Args:
            max_length: Maximum number of words to generate
            temperature: Controls randomness (lower = more deterministic)
            seed: Random seed for reproducibility
        """
        if seed is not None:
            np.random.seed(seed)

        # Start with the beginning-of-sentence context
        context = tuple(["<s>"] * (self.n - 1))
        generated = []

        for _ in range(max_length):
            # Get probability distribution
            dist = self.get_distribution(context)

            # Apply temperature
            words = list(dist.keys())
            probs = np.array([dist[w] for w in words])

            # Temperature scaling
            if temperature != 1.0:
                log_probs = np.log(probs + 1e-10)
                scaled = log_probs / temperature
                probs = np.exp(scaled) / np.sum(np.exp(scaled))

            # Normalize
            probs = probs / probs.sum()

            # Sample
            idx = np.random.choice(len(words), p=probs)
            next_word = words[idx]

            if next_word == "</s>":
                break

            generated.append(next_word)

            # Update context: slide the window
            context = tuple(list(context[1:]) + [next_word])

        return " ".join(generated)

# Build and train the generator
gen_model = NgramGenerator(n=2, smoothing=0.1)
gen_model.train(corpus)

# Generate several sentences
print("Generated sentences (bigram, temperature=1.0):")
print("=" * 50)
for i in range(10):
    text = gen_model.generate(max_length=15, temperature=1.0, seed=i)
    print(f"  {i+1}. {text}")

Let us see how temperature affects generation:

In [None]:
# Temperature comparison
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
temperatures = [0.3, 1.0, 2.0]
labels = ["Low (0.3)\nMore deterministic", "Normal (1.0)\nBalanced", "High (2.0)\nMore random"]

for ax, temp, label in zip(axes, temperatures, labels):
    sentences = []
    for seed in range(8):
        text = gen_model.generate(max_length=10, temperature=temp, seed=seed*10)
        sentences.append(text)

    ax.set_xlim(0, 1)
    ax.set_ylim(-0.5, len(sentences) - 0.5)
    for i, s in enumerate(sentences):
        ax.text(0.05, len(sentences) - 1 - i, s, fontsize=9,
                fontfamily='monospace', verticalalignment='center')
    ax.set_title(f'Temperature = {temp}\n{label}', fontsize=12, fontweight='bold')
    ax.axis('off')

plt.suptitle('Effect of Temperature on Text Generation', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

Now let us build on a real corpus to see the model in action:

In [None]:
# Let's use a larger corpus -- nursery rhymes and simple text
large_corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat ate the fish",
    "the dog ate the bone",
    "the bird flew over the house",
    "the bird sat on the tree",
    "the cat ran after the dog",
    "the dog ran after the cat",
    "the fish swam in the pond",
    "the bird flew over the pond",
    "i like the cat",
    "i like the dog",
    "i saw the bird fly",
    "the cat is on the mat",
    "the dog is on the rug",
    "the rug is on the floor",
    "i want to eat the fish",
    "the cat wants to eat the fish",
    "the dog wants to play with the cat",
    "i saw the cat on the mat",
    "the bird is in the tree",
    "the fish is in the water",
    "i like to see the cat sit on the mat",
    "the cat and the dog are friends",
    "i have a cat and a dog",
]

# Train bigram and trigram generators
bigram_gen = NgramGenerator(n=2, smoothing=0.1)
bigram_gen.train(large_corpus)

trigram_gen = NgramGenerator(n=3, smoothing=0.1)
trigram_gen.train(large_corpus)

print("\n--- Bigram Generated Sentences ---")
for i in range(5):
    print(f"  {bigram_gen.generate(max_length=12, seed=i+42)}")

print("\n--- Trigram Generated Sentences ---")
for i in range(5):
    print(f"  {trigram_gen.generate(max_length=12, seed=i+42)}")

---

## 7. Training and Results

Let us measure our model's quality using **perplexity** -- the standard evaluation metric for language models. Perplexity measures how "surprised" the model is by test data.

$$\text{Perplexity} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(w_i \mid \text{context})\right)$$

Lower perplexity = better model. A perplexity of $k$ means the model is as uncertain as if it were choosing uniformly among $k$ words at each step.

In [None]:
def compute_perplexity(model, test_sentences):
    """
    Compute perplexity of the model on test sentences.

    Perplexity = exp(-1/N * sum(log P(w_i | context)))
    """
    total_log_prob = 0.0
    total_tokens = 0

    for sentence in test_sentences:
        tokens = sentence.lower().split()
        padded = ["<s>"] * (model.n - 1) + tokens + ["</s>"]

        for i in range(model.n - 1, len(padded)):
            context = tuple(padded[i - model.n + 1:i])
            word = padded[i]

            prob = model.probability(word, context)
            if prob > 0:
                total_log_prob += np.log(prob)
            else:
                total_log_prob += np.log(1e-10)  # Avoid -inf

            total_tokens += 1

    avg_log_prob = total_log_prob / total_tokens
    perplexity = np.exp(-avg_log_prob)
    return perplexity

# Test on held-out sentences
test_sentences = [
    "the cat sat on the rug",
    "the dog ate the fish",
    "the bird flew over the tree",
]

# Compare models
models = {
    "Bigram (no smooth)": NgramLanguageModel(n=2, smoothing=0.0),
    "Bigram (smooth=0.1)": NgramLanguageModel(n=2, smoothing=0.1),
    "Bigram (smooth=1.0)": NgramLanguageModel(n=2, smoothing=1.0),
    "Trigram (smooth=0.1)": NgramLanguageModel(n=3, smoothing=0.1),
}

results = {}
for name, model in models.items():
    model.train(large_corpus)
    ppl = compute_perplexity(model, test_sentences)
    results[name] = ppl
    print(f"  {name:25s}  Perplexity = {ppl:.2f}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
names = list(results.keys())
ppls = list(results.values())
colors = ['#E53935', '#FB8C00', '#43A047', '#1E88E5']
bars = ax.bar(names, ppls, color=colors, edgecolor='white', linewidth=2)

for bar, ppl in zip(bars, ppls):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
            f'{ppl:.1f}', ha='center', fontsize=12, fontweight='bold')

ax.set_ylabel('Perplexity (lower is better)', fontsize=12)
ax.set_title('Model Comparison: Perplexity on Test Set', fontsize=14, fontweight='bold')
ax.set_ylim(0, max(ppls) * 1.2)
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

---

## 8. Final Output

Now let us demonstrate the fundamental limitation of N-grams: the **sparsity problem.**

In [None]:
# The sparsity problem: visualizing how much of the count table is empty

# With just 12 unique words, how many possible bigrams are there?
vocab = list(bigram_gen.vocabulary)
vocab_size = len(vocab)
possible_bigrams = vocab_size ** 2
observed_bigrams = len(bigram_gen.ngram_counts)

print(f"Vocabulary size: {vocab_size}")
print(f"Possible bigrams: {vocab_size}^2 = {possible_bigrams}")
print(f"Observed bigrams: {observed_bigrams}")
print(f"Coverage: {observed_bigrams/possible_bigrams*100:.1f}%")
print(f"Zero entries: {possible_bigrams - observed_bigrams} ({(1 - observed_bigrams/possible_bigrams)*100:.1f}%)")

# Scale this up to realistic vocabulary sizes
vocab_sizes = [100, 1000, 10000, 50000, 100000]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: possible vs observed (log scale)
for v in vocab_sizes:
    possible = v ** 2
    # Estimate observed: in practice, roughly O(V * avg_context_diversity)
    # For real corpora, typically ~ 10*V observed bigrams
    estimated_observed = min(10 * v, possible)
    axes[0].scatter(v, possible, color='#E53935', s=80, zorder=5)
    axes[0].scatter(v, estimated_observed, color='#1E88E5', s=80, zorder=5)

axes[0].plot(vocab_sizes, [v**2 for v in vocab_sizes], 'r--', label='Possible bigrams (V²)', linewidth=2)
axes[0].plot(vocab_sizes, [min(10*v, v**2) for v in vocab_sizes], 'b--', label='Observed (≈10V)', linewidth=2)
axes[0].set_xscale('log')
axes[0].set_yscale('log')
axes[0].set_xlabel('Vocabulary Size', fontsize=12)
axes[0].set_ylabel('Number of Bigrams', fontsize=12)
axes[0].set_title('The Sparsity Explosion', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(alpha=0.3)

# Right: coverage percentage
coverage = [min(10*v, v**2) / (v**2) * 100 for v in vocab_sizes]
axes[1].bar(range(len(vocab_sizes)), coverage,
            color=['#4CAF50', '#FF9800', '#F44336', '#F44336', '#F44336'],
            edgecolor='white', linewidth=2)
axes[1].set_xticks(range(len(vocab_sizes)))
axes[1].set_xticklabels([f'V={v:,}' for v in vocab_sizes], rotation=15)
axes[1].set_ylabel('Coverage (%)', fontsize=12)
axes[1].set_title('% of Bigrams Observed in Training', fontsize=14, fontweight='bold')
axes[1].set_ylim(0, 105)

for i, c in enumerate(coverage):
    axes[1].text(i, c + 2, f'{c:.1f}%', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nThe verdict: with a realistic vocabulary of 50,000 words,")
print("a bigram model has 2.5 BILLION possible entries, but observes only ~0.02%.")
print("Most word pairs get probability ZERO — not because they are impossible,")
print("but because we simply never saw them. This is the sparsity problem.")

In [None]:
# The similarity problem: N-grams see no relationship between similar words

print("The Similarity Blind Spot")
print("=" * 50)
print()
print("To an N-gram model, 'cat' and 'dog' are COMPLETELY unrelated.")
print("Learning 'the cat sat' tells it NOTHING about 'the dog sat'.")
print()

# Demonstrate
pairs = [("cat", "sat"), ("dog", "sat"), ("cat", "ran"), ("dog", "ran")]
print("Bigram probabilities:")
for w1, w2 in pairs:
    p = bigram_gen.probability(w2, (w1,))
    status = "✓ seen" if bigram_gen.ngram_counts[(w1, w2)] > 0 else "✗ unseen"
    print(f"  P('{w2}' | '{w1}') = {p:.4f}  [{status}]")

print()
print("A human knows that cats and dogs can both 'sit' and 'run.'")
print("But to the N-gram model, each word is just an arbitrary symbol.")
print()
print("→ This is the fundamental limitation that neural language models solve.")
print("  By learning EMBEDDINGS — dense vectors where similar words are nearby —")
print("  knowledge about 'cat' automatically transfers to 'dog'.")

---

## 9. Reflection and Next Steps

**What we learned:**

1. **Language models assign probabilities to word sequences.** The chain rule lets us decompose this into a product of conditional probabilities.

2. **N-gram models estimate these probabilities by counting.** Simple, interpretable, and fast — but brittle.

3. **The Markov assumption trades accuracy for tractability.** Bigrams look at 1 word of history, trigrams look at 2, etc.

4. **Smoothing helps but does not solve the fundamental problem.** Adding small counts to unseen n-grams prevents zero probabilities, but does not capture word similarity.

5. **The sparsity problem grows exponentially with vocabulary size.** With 50K words, most bigram entries are zero.

6. **N-grams have no notion of semantic similarity.** "Cat" and "dog" are as different as "cat" and "democracy."

**What comes next:**

In the next notebook, we will see how **neural language models** solve both of these problems at once. By representing words as dense vectors (embeddings), a neural network learns that similar words should have similar predictions — and knowledge transfers automatically.

This is the leap from **counting** to **learning.**

In [None]:
print("=" * 60)
print("  NOTEBOOK COMPLETE: N-gram Language Models")
print("  You built a bigram/trigram model from scratch,")
print("  generated text, measured perplexity, and saw")
print("  why counting alone cannot capture language.")
print()
print("  Next: Neural Language Models & Word Embeddings")
print("=" * 60)