# Week 1 Lab: N-grams and Statistical Language Models

## Learning Objectives
- Build n-gram language models from scratch
- Calculate perplexity to evaluate model quality
- Generate text using different n-gram models
- Understand smoothing techniques
- Compare unigram, bigram, and trigram models

---

## Part 1: Setup and Data Loading

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import random
import re
import math
import pandas as pd
from typing import List, Dict, Tuple

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

# Configure visualization
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Setup complete!")

In [None]:
# Sample text corpus (Alice in Wonderland excerpt)
sample_text = """Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do. Once or twice she had peeped into the book her sister was reading,
but it had no pictures or conversations in it. And what is the use of a book, thought Alice,
without pictures or conversations?

So she was considering in her own mind, as well as she could, for the hot day made her feel
very sleepy and stupid, whether the pleasure of making a daisy chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so very remarkable in that. Alice did not even think it so very much out of
the way to hear the Rabbit say to itself, Oh dear! Oh dear! I shall be late! But when the
Rabbit actually took a watch out of its waistcoat pocket, and looked at it, and then hurried on,
Alice started to her feet, for it flashed across her mind that she had never before seen a
rabbit with either a waistcoat pocket, or a watch to take out of it."""

print(f"Corpus length: {len(sample_text)} characters")
print(f"First 200 characters:\n{sample_text[:200]}...")

## Part 2: Text Preprocessing

In [None]:
def preprocess_text(text: str, lowercase: bool = True) -> List[str]:
    """Preprocess text into list of tokens"""
    # Convert to lowercase
    if lowercase:
        text = text.lower()
    
    # Replace newlines with spaces
    text = text.replace('\n', ' ')
    
    # Add spaces around punctuation
    text = re.sub(r'([.!?,;])', r' \1 ', text)
    
    # Split into tokens
    tokens = text.split()
    
    # Remove empty tokens
    tokens = [token for token in tokens if token]
    
    return tokens

# Preprocess the corpus
tokens = preprocess_text(sample_text)
print(f"Total tokens: {len(tokens)}")
print(f"Unique tokens: {len(set(tokens))}")
print(f"\nFirst 20 tokens:\n{tokens[:20]}")

In [None]:
# Analyze token frequencies
token_freq = Counter(tokens)
print("Most common tokens:")
for token, count in token_freq.most_common(15):
    print(f"  '{token}': {count}")

# Plot token frequency distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot top 20 most common tokens
top_tokens = token_freq.most_common(20)
ax1.bar(range(len(top_tokens)), [count for _, count in top_tokens])
ax1.set_xticks(range(len(top_tokens)))
ax1.set_xticklabels([token for token, _ in top_tokens], rotation=45, ha='right')
ax1.set_xlabel('Token')
ax1.set_ylabel('Frequency')
ax1.set_title('Top 20 Most Frequent Tokens')

# Plot Zipf's law
frequencies = sorted(token_freq.values(), reverse=True)
ax2.loglog(range(1, len(frequencies)+1), frequencies, 'b-', alpha=0.6)
ax2.set_xlabel('Token Rank (log scale)')
ax2.set_ylabel('Frequency (log scale)')
ax2.set_title("Zipf's Law Distribution")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 3: Building N-gram Models

In [None]:
class NGramModel:
    def __init__(self, n: int, smoothing: str = 'none', alpha: float = 1.0):
        """
        N-gram language model
        
        Args:
            n: Order of n-gram (1=unigram, 2=bigram, 3=trigram, etc.)
            smoothing: Smoothing technique ('none', 'laplace', 'add-k')
            alpha: Smoothing parameter for add-k smoothing
        """
        self.n = n
        self.smoothing = smoothing
        self.alpha = alpha
        self.ngram_counts = defaultdict(int)
        self.context_counts = defaultdict(int)
        self.vocabulary = set()
        self.total_tokens = 0
        
    def train(self, tokens: List[str]):
        """Train the n-gram model on tokenized text"""
        # Add special tokens for sentence boundaries
        tokens = ['<START>'] * (self.n - 1) + tokens + ['<END>']
        
        # Build vocabulary
        self.vocabulary = set(tokens)
        self.vocab_size = len(self.vocabulary)
        self.total_tokens = len(tokens)
        
        # Count n-grams
        for i in range(len(tokens) - self.n + 1):
            ngram = tuple(tokens[i:i+self.n])
            context = ngram[:-1]
            
            self.ngram_counts[ngram] += 1
            self.context_counts[context] += 1
        
        print(f"Trained {self.n}-gram model:")
        print(f"  Vocabulary size: {self.vocab_size}")
        print(f"  Unique {self.n}-grams: {len(self.ngram_counts)}")
        
    def probability(self, ngram: Tuple[str, ...]) -> float:
        """Calculate probability of an n-gram"""
        context = ngram[:-1]
        
        if self.n == 1:
            # Unigram model
            count = self.ngram_counts[ngram]
            if self.smoothing == 'laplace' or self.smoothing == 'add-k':
                return (count + self.alpha) / (self.total_tokens + self.alpha * self.vocab_size)
            else:
                return count / self.total_tokens if self.total_tokens > 0 else 0
        else:
            # N-gram model (n > 1)
            ngram_count = self.ngram_counts[ngram]
            context_count = self.context_counts[context]
            
            if self.smoothing == 'laplace' or self.smoothing == 'add-k':
                return (ngram_count + self.alpha) / (context_count + self.alpha * self.vocab_size)
            else:
                return ngram_count / context_count if context_count > 0 else 0
    
    def generate(self, num_tokens: int = 50, temperature: float = 1.0) -> str:
        """Generate text using the n-gram model"""
        # Start with initial context
        context = ['<START>'] * (self.n - 1)
        generated = []
        
        for _ in range(num_tokens):
            # Get possible next tokens
            candidates = defaultdict(float)
            
            for ngram in self.ngram_counts:
                if ngram[:-1] == tuple(context):
                    next_token = ngram[-1]
                    prob = self.probability(ngram)
                    candidates[next_token] = prob ** (1/temperature) if prob > 0 else 0
            
            if not candidates or all(p == 0 for p in candidates.values()):
                # Fallback to random token
                next_token = random.choice(list(self.vocabulary - {'<START>', '<END>'}))
            else:
                # Sample from distribution
                tokens = list(candidates.keys())
                probs = np.array(list(candidates.values()))
                probs = probs / probs.sum()
                next_token = np.random.choice(tokens, p=probs)
            
            if next_token == '<END>':
                break
                
            generated.append(next_token)
            
            # Update context
            context = context[1:] + [next_token]
        
        return ' '.join(generated)

# Train different n-gram models
unigram = NGramModel(n=1)
unigram.train(tokens)

bigram = NGramModel(n=2)
bigram.train(tokens)

trigram = NGramModel(n=3)
trigram.train(tokens)

# Train with Laplace smoothing
bigram_smooth = NGramModel(n=2, smoothing='laplace')
bigram_smooth.train(tokens)

## Part 4: Text Generation

In [None]:
# Generate text with different models
print("=== Text Generation Examples ===")

print("\n1. UNIGRAM Model (random words):")
print(unigram.generate(30, temperature=1.0))

print("\n2. BIGRAM Model (pairs of words):")
print(bigram.generate(30, temperature=1.0))

print("\n3. TRIGRAM Model (triplets of words):")
print(trigram.generate(30, temperature=1.0))

print("\n4. BIGRAM with Smoothing:")
print(bigram_smooth.generate(30, temperature=1.0))

print("\n5. TRIGRAM with Low Temperature (more deterministic):")
print(trigram.generate(30, temperature=0.5))

print("\n6. TRIGRAM with High Temperature (more random):")
print(trigram.generate(30, temperature=2.0))

## Part 5: Perplexity Calculation

In [None]:
def calculate_perplexity(model: NGramModel, test_tokens: List[str]) -> float:
    """
    Calculate perplexity of a model on test data
    Lower perplexity = better model
    """
    # Add boundary tokens
    test_tokens = ['<START>'] * (model.n - 1) + test_tokens + ['<END>']
    
    log_prob_sum = 0
    num_tokens = 0
    
    for i in range(len(test_tokens) - model.n + 1):
        ngram = tuple(test_tokens[i:i+model.n])
        prob = model.probability(ngram)
        
        if prob > 0:
            log_prob_sum += math.log2(prob)
        else:
            # Handle zero probability (use small value)
            log_prob_sum += math.log2(1e-10)
        
        num_tokens += 1
    
    # Calculate perplexity
    avg_log_prob = log_prob_sum / num_tokens
    perplexity = 2 ** (-avg_log_prob)
    
    return perplexity

# Split data for evaluation
split_point = int(len(tokens) * 0.8)
train_tokens = tokens[:split_point]
test_tokens = tokens[split_point:]

print(f"Train set: {len(train_tokens)} tokens")
print(f"Test set: {len(test_tokens)} tokens\n")

# Retrain models on train set
models = [
    ('Unigram', NGramModel(n=1)),
    ('Bigram', NGramModel(n=2)),
    ('Trigram', NGramModel(n=3)),
    ('Bigram + Laplace', NGramModel(n=2, smoothing='laplace')),
    ('Trigram + Laplace', NGramModel(n=3, smoothing='laplace'))
]

perplexities = []
for name, model in models:
    model.train(train_tokens)
    perp = calculate_perplexity(model, test_tokens)
    perplexities.append((name, perp))
    print(f"{name:20} Perplexity: {perp:.2f}")

In [None]:
# Visualize perplexity comparison
fig, ax = plt.subplots(figsize=(10, 6))

names = [name for name, _ in perplexities]
values = [perp for _, perp in perplexities]

bars = ax.bar(names, values, color=['#FF6B6B', '#4ECDC4', '#95E77E', '#FFA07A', '#98D8C8'])
ax.set_ylabel('Perplexity (lower is better)')
ax.set_title('Model Comparison: Perplexity on Test Data')
ax.set_ylim(0, max(values) * 1.1)

# Add value labels on bars
for bar, value in zip(bars, values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{value:.1f}', ha='center', va='bottom')

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Part 6: Analyzing Model Behavior

In [None]:
def get_next_word_distribution(model: NGramModel, context: List[str], top_k: int = 10):
    """Get probability distribution for next word given context"""
    # Adjust context length for model
    if len(context) >= model.n - 1:
        context = context[-(model.n-1):]
    else:
        context = ['<START>'] * (model.n - 1 - len(context)) + context
    
    # Get probabilities for all possible next words
    next_word_probs = {}
    
    for ngram in model.ngram_counts:
        if ngram[:-1] == tuple(context):
            next_word = ngram[-1]
            prob = model.probability(ngram)
            next_word_probs[next_word] = prob
    
    # Sort by probability
    sorted_probs = sorted(next_word_probs.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_probs[:top_k]

# Test with different contexts
test_contexts = [
    ['alice', 'was'],
    ['the', 'rabbit'],
    ['she', 'had'],
    ['white', 'rabbit']
]

print("=== Next Word Predictions ===")

for context in test_contexts:
    print(f"\nContext: {' '.join(context)}")
    
    for model_name, model in [('Bigram', bigram), ('Trigram', trigram)]:
        predictions = get_next_word_distribution(model, context, top_k=5)
        
        if predictions:
            print(f"  {model_name} predictions:")
            for word, prob in predictions:
                print(f"    '{word}': {prob:.3f}")
        else:
            print(f"  {model_name}: No predictions (unseen context)")

In [None]:
# Analyze most common n-grams
def show_top_ngrams(model: NGramModel, k: int = 10):
    """Display most frequent n-grams"""
    top_ngrams = sorted(model.ngram_counts.items(), key=lambda x: x[1], reverse=True)[:k]
    
    print(f"\nTop {k} {model.n}-grams:")
    for ngram, count in top_ngrams:
        ngram_str = ' '.join(ngram)
        print(f"  '{ngram_str}': {count}")
    
    return top_ngrams

# Show top n-grams for each model
for name, model in [('Unigram', unigram), ('Bigram', bigram), ('Trigram', trigram)]:
    top = show_top_ngrams(model, k=8)
    
# Visualize bigram frequencies
top_bigrams = show_top_ngrams(bigram, k=15)

fig, ax = plt.subplots(figsize=(12, 5))
bigram_labels = [' '.join(bg[0]) for bg in top_bigrams]
bigram_counts = [bg[1] for bg in top_bigrams]

ax.barh(range(len(bigram_labels)), bigram_counts)
ax.set_yticks(range(len(bigram_labels)))
ax.set_yticklabels(bigram_labels)
ax.set_xlabel('Frequency')
ax.set_title('Most Common Bigrams')
ax.invert_yaxis()

plt.tight_layout()
plt.show()

## Part 7: Advanced - Interpolation and Backoff

In [None]:
class InterpolatedNGramModel:
    """N-gram model with interpolation between different orders"""
    
    def __init__(self, max_n: int = 3, lambdas: List[float] = None):
        """
        Args:
            max_n: Maximum n-gram order
            lambdas: Interpolation weights (must sum to 1)
        """
        self.max_n = max_n
        self.models = [NGramModel(n=i, smoothing='laplace') for i in range(1, max_n+1)]
        
        if lambdas is None:
            # Equal weights by default
            self.lambdas = [1/max_n] * max_n
        else:
            assert len(lambdas) == max_n and abs(sum(lambdas) - 1.0) < 1e-6
            self.lambdas = lambdas
    
    def train(self, tokens: List[str]):
        """Train all component models"""
        for model in self.models:
            model.train(tokens)
        print(f"Trained interpolated model with orders 1-{self.max_n}")
        print(f"Interpolation weights: {self.lambdas}")
    
    def probability(self, word: str, context: List[str]) -> float:
        """Calculate interpolated probability"""
        prob = 0
        
        for i, model in enumerate(self.models):
            n = i + 1
            if n == 1:
                # Unigram
                ngram = (word,)
            else:
                # Use appropriate context length
                context_len = min(n-1, len(context))
                if context_len < n-1:
                    # Pad with START tokens
                    padded_context = ['<START>'] * (n-1-context_len) + context[-context_len:]
                else:
                    padded_context = context[-(n-1):]
                ngram = tuple(padded_context) + (word,)
            
            prob += self.lambdas[i] * model.probability(ngram)
        
        return prob

# Train interpolated model
interpolated = InterpolatedNGramModel(max_n=3, lambdas=[0.1, 0.3, 0.6])
interpolated.train(train_tokens)

# Compare with individual models
print("\n=== Probability Comparison ===")
test_cases = [
    (['alice'], 'was'),
    (['the', 'white'], 'rabbit'),
    (['she', 'had'], 'never')
]

for context, word in test_cases:
    print(f"\nP({word} | {' '.join(context)}):")
    
    # Individual models
    for i, model in enumerate(interpolated.models):
        n = i + 1
        if n == 1:
            ngram = (word,)
        else:
            context_for_model = context[-(n-1):] if len(context) >= n-1 else ['<START>'] * (n-1-len(context)) + context
            ngram = tuple(context_for_model) + (word,)
        
        prob = model.probability(ngram)
        print(f"  {n}-gram: {prob:.4f}")
    
    # Interpolated
    interp_prob = interpolated.probability(word, context)
    print(f"  Interpolated: {interp_prob:.4f}")

## Part 8: Practical Exercise - Build Your Own Corpus

In [None]:
# Exercise: Create your own specialized language model
print("=== Exercise: Domain-Specific Language Model ===")
print("\nTry creating a language model for a specific domain!")
print("Ideas:")
print("  1. Scientific abstracts")
print("  2. Recipe instructions")
print("  3. News headlines")
print("  4. Poetry")
print("  5. Technical documentation")

# Example: Simple recipe corpus
recipe_corpus = """Preheat oven to 350 degrees. Mix flour and sugar in bowl.
Add eggs and milk. Stir until smooth. Pour batter into pan.
Bake for 30 minutes. Let cool before serving.
Preheat oven to 400 degrees. Season chicken with salt and pepper.
Place in baking dish. Add vegetables around chicken.
Bake for 45 minutes until golden. Serve hot with rice."""

# Train a model on recipe text
recipe_tokens = preprocess_text(recipe_corpus)
recipe_model = NGramModel(n=3, smoothing='laplace')
recipe_model.train(recipe_tokens)

print("\n=== Recipe Language Model ===")
print("Generated recipe instructions:")
for i in range(3):
    print(f"\n{i+1}. {recipe_model.generate(20, temperature=0.8)}")

# Show common patterns
print("\nCommon recipe phrases:")
show_top_ngrams(recipe_model, k=5)

## Part 9: Visualization Matching Presentation

In this section, we'll recreate the key charts from the Week 1 presentation. This helps you understand how the visualizations were created and allows you to experiment with different parameters.

### 9.1 N-gram Context Windows

Let's visualize how unigram, bigram, and trigram models use different amounts of context.

In [None]:
# Before/After Smoothing Comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Sample words following "the" 
test_context = ['the']
sample_words = ['cat', 'dog', 'mat', 'rabbit', 'alice', 'book', 'UNSEEN1', 'UNSEEN2', 'UNSEEN3']

# Get probabilities without smoothing
probs_before = []
for word in sample_words:
    ngram = ('the', word)
    prob = bigram.probability(ngram)  # No smoothing model
    probs_before.append(prob)

# Get probabilities with smoothing
probs_after = []
for word in sample_words:
    ngram = ('the', word)
    prob = bigram_smooth.probability(ngram)  # Laplace smoothing
    probs_after.append(prob)

# Before smoothing
colors_before = ['#95E77E' if p > 0 else '#FF6B6B' for p in probs_before]
ax1.bar(range(len(sample_words)), probs_before, color=colors_before, alpha=0.7,
        edgecolor='#404040', linewidth=1.5)
ax1.set_ylabel('Probability', fontsize=11, fontweight='bold')
ax1.set_title('Before Smoothing (MLE)', fontsize=12, fontweight='bold')
ax1.set_xticks(range(len(sample_words)))
ax1.set_xticklabels(sample_words, rotation=45, ha='right')
ax1.grid(True, alpha=0.2, axis='y')

# Mark zeros
for i, prob in enumerate(probs_before):
    if prob == 0:
        ax1.text(i, 0.0005, 'ZERO', ha='center', fontsize=8, color='red', fontweight='bold')

# After smoothing  
colors_after = ['#4ECDC4' if probs_before[i] > 0 else '#95E77E' for i in range(len(sample_words))]
ax2.bar(range(len(sample_words)), probs_after, color=colors_after, alpha=0.7,
        edgecolor='#404040', linewidth=1.5)
ax2.set_ylabel('Probability', fontsize=11, fontweight='bold')
ax2.set_title('After Smoothing (Laplace)', fontsize=12, fontweight='bold')
ax2.set_xticks(range(len(sample_words)))
ax2.set_xticklabels(sample_words, rotation=45, ha='right')
ax2.grid(True, alpha=0.2, axis='y')

# Mark rescued probabilities
for i, (prob_before, prob_after) in enumerate(zip(probs_before, probs_after)):
    if prob_before == 0 and prob_after > 0:
        ax2.text(i, prob_after, f'{prob_after:.4f}', ha='center', va='bottom',
                fontsize=7, color='#95E77E', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nObservations:")
print(f"  - Seen bigrams: Probability slightly reduced (mass stolen)")
print(f"  - Unseen bigrams: Get small non-zero probability (mass received)")
print(f"  - Total probability still sums to 1.0")

### 10.3 Before/After Smoothing Visualization

See exactly how smoothing shifts probability mass.

In [None]:
# Experiment with different k values
k_values = [0.001, 0.01, 0.1, 0.5, 1.0]

print("=== Effect of k Parameter on Perplexity ===\n")
print(f"{'k Value':<10} {'Perplexity':<15} {'Effect':<30}")
print("-" * 55)

k_results = []
for k in k_values:
    model_k = NGramModel(n=2, smoothing='add-k', alpha=k)
    model_k.train(train_tokens)
    perp_k = calculate_perplexity(model_k, test_tokens)
    k_results.append((k, perp_k))
    
    if k == 0.01:
        effect = "Often optimal for large corpora"
    elif k == 1.0:
        effect = "Laplace (too aggressive)"
    elif k < 0.01:
        effect = "Very light smoothing"
    else:
        effect = "Moderate smoothing"
    
    print(f"{k:<10.3f} {perp_k:<15.2f} {effect:<30}")

# Visualize effect of k
fig, ax = plt.subplots(figsize=(10, 6))

ks = [k for k, _ in k_results]
perps_k = [perp for _, perp in k_results]

ax.plot(ks, perps_k, marker='o', markersize=10, linewidth=2.5, color='#3333B2')
ax.axvline(x=0.01, color='#95E77E', linestyle='--', linewidth=2, alpha=0.7, label='Often optimal (k=0.01)')

ax.set_xlabel('Smoothing Parameter k', fontsize=12, fontweight='bold')
ax.set_ylabel('Perplexity', fontsize=12, fontweight='bold')
ax.set_title('Effect of Smoothing Parameter on Model Performance', fontsize=14, fontweight='bold')
ax.set_xscale('log')
ax.grid(True, alpha=0.2)
ax.legend(fontsize=10)

plt.tight_layout()
plt.show()

print("\nKey insight: Smaller k usually better for large corpora!")

### 10.2 Comparing k Values

Experiment with different smoothing parameters to see the effect.

In [None]:
# Experiment: Probability redistribution visualization

# Train a small bigram model
small_corpus = "the cat sat on the mat the dog".split()
test_model = NGramModel(n=2, smoothing='none')
test_model.train(small_corpus)

test_model_smooth = NGramModel(n=2, smoothing='add-k', alpha=0.1)
test_model_smooth.train(small_corpus)

# Compare probabilities for "the" followed by different words
test_words = ['cat', 'dog', 'mat', 'xylophone', 'elephant', 'computer']

print("=== Probability Comparison: P(word | 'the') ===\n")
print(f"{'Word':<15} {'No Smoothing':<18} {'With Smoothing (k=0.1)':<25}")
print("-" * 60)

for word in test_words:
    prob_no_smooth = test_model.probability(('the', word))
    prob_smooth = test_model_smooth.probability(('the', word))
    
    marker_no = "*" if prob_no_smooth == 0 else " "
    marker_smooth = "+" if prob_smooth > 0 and prob_no_smooth == 0 else " "
    
    print(f"{word:<15} {prob_no_smooth:<18.6f}{marker_no}  {prob_smooth:<25.6f}{marker_smooth}")

print("\n* = Zero probability (unseen bigram)")
print("+ = Non-zero after smoothing (rescued from zero!)")

## Part 10: Smoothing Experiments

Deep dive into smoothing techniques - implement and compare different methods.

### 10.1 Probability Mass Redistribution

Visualize how smoothing redistributes probability from seen to unseen events.

In [None]:
# Chart 4: Text Quality Progression
# This recreates the concept from Slide 2

print("=== Text Quality with Different N-gram Orders ===\n")

# Generate text with each model
models_for_quality = [
    ("1-gram (no context)", unigram),
    ("2-gram (1 word)", bigram),
    ("3-gram (2 words)", trigram)
]

for i, (name, model) in enumerate(models_for_quality, 1):
    print(f"{i}. {name}:")
    generated = model.generate(20, temperature=0.8)
    print(f"   \"{generated}\"")
    print()

print("Notice how text becomes more coherent with higher n!")
print("\n1-gram: Random words, no grammar")
print("2-gram: Local coherence, some grammar")  
print("3-gram: Better grammar, more natural flow")

### 9.4 Text Quality Progression

See how generated text improves with higher n-gram order (from Slide 2).

In [None]:
# Chart 3: Smoothing Methods Comparison
# This recreates the chart from Slide 27

# Train models with different smoothing
smoothing_models = [
    ('No Smoothing', NGramModel(n=2, smoothing='none')),
    ('Add-1 (Laplace)', NGramModel(n=2, smoothing='laplace', alpha=1.0)),
    ('Add-0.1', NGramModel(n=2, smoothing='add-k', alpha=0.1)),
    ('Add-0.01', NGramModel(n=2, smoothing='add-k', alpha=0.01))
]

smoothing_perps = []
for name, model in smoothing_models:
    model.train(train_tokens)
    perp = calculate_perplexity(model, test_tokens)
    # Handle infinity for no smoothing (zero probabilities)
    if perp > 10000:
        perp = 9999  # Cap for visualization
    smoothing_perps.append((name, perp))
    print(f"{name:20} Perplexity: {perp:.1f if perp < 1000 else 'Infinity (zeros)'}")

fig, ax = plt.subplots(figsize=(11, 6))

methods = [name for name, _ in smoothing_perps]
perps = [perp for _, perp in smoothing_perps]

colors = ['#CC0000', '#FF9999', '#4ECDC4', '#95E77E']
bars = ax.bar(methods, perps, color=colors, alpha=0.7, edgecolor='#404040', linewidth=2)

ax.set_ylabel('Perplexity (lower is better)', fontsize=12, fontweight='bold')
ax.set_title('Smoothing Methods: Performance Comparison', fontsize=14, fontweight='bold')

# Add value labels
for i, (method, perp) in enumerate(smoothing_perps):
    label = '∞' if perp > 1000 else f'{perp:.0f}'
    ax.text(i, perp + 100 if perp < 1000 else 300, label, ha='center', fontsize=10, fontweight='bold')

plt.grid(True, alpha=0.2, linestyle='--')
plt.xticks(rotation=15, ha='right')
plt.tight_layout()
plt.show()

print("\nKey takeaway: Smoothing is essential - no smoothing gives infinite perplexity!")

### 9.3 Smoothing Methods Comparison

Compare different smoothing techniques (from Slide 27).

In [None]:
# Chart 2: Perplexity Comparison (from models already trained)
# This recreates the chart from Slide 33

fig, ax = plt.subplots(figsize=(10, 6))

# Use the perplexities already calculated on test data
n_gram_types = ['Unigram', 'Bigram', 'Trigram']
n_gram_perps = [perplexities[0][1], perplexities[1][1], perplexities[2][1]]

bars = ax.bar(n_gram_types, n_gram_perps, color='#3333B2', alpha=0.7,
              edgecolor='#404040', linewidth=2)

# Highlight trigram
bars[2].set_color('#95E77E')

ax.set_ylabel('Perplexity', fontsize=12, fontweight='bold')
ax.set_title('Perplexity vs N-gram Order (with smoothing)', fontsize=14, fontweight='bold')

# Add value labels and improvement percentages
for i in range(len(n_gram_perps)):
    ax.text(i, n_gram_perps[i] + 5, f'{n_gram_perps[i]:.1f}', ha='center',
            fontsize=11, fontweight='bold')
    
    if i > 0:
        improvement = ((n_gram_perps[i-1] - n_gram_perps[i]) / n_gram_perps[i-1]) * 100
        ax.text(i, n_gram_perps[i] - 15, f'−{improvement:.0f}%', ha='center',
                fontsize=9, color='#95E77E', fontweight='bold')

plt.grid(True, alpha=0.2, linestyle='--')
plt.tight_layout()
plt.show()

print(f"Perplexity improves as we increase context!")
print(f"  Unigram → Bigram: {((n_gram_perps[0]-n_gram_perps[1])/n_gram_perps[0]*100):.1f}% improvement")
print(f"  Bigram → Trigram: {((n_gram_perps[1]-n_gram_perps[2])/n_gram_perps[1]*100):.1f}% improvement")

### 9.2 Perplexity Comparison Across N-gram Orders

Recreate the perplexity comparison chart from Slide 33.

In [None]:
# Chart 1: N-gram Context Windows Visualization
# This recreates the chart from Slide 12 of the presentation

from matplotlib.patches import FancyBboxPatch, FancyArrowPatch

fig, axes = plt.subplots(3, 1, figsize=(12, 8))

sentence = ["The", "cat", "sat", "on", "the", "mat"]

COLOR_CURRENT = '#FF6B6B'  # Red for current
COLOR_CONTEXT = '#4ECDC4'  # Teal for context
COLOR_PREDICT = '#95E77E'  # Green for prediction
COLOR_NEUTRAL = '#E0E0E0'  # Gray for neutral

# Unigram (n=1)
ax = axes[0]
ax.set_title("Unigram (n=1): No Context", fontsize=12, fontweight='bold')
ax.set_xlim(-0.5, 6.5)
ax.set_ylim(-0.5, 1.5)
ax.axis('off')

for i, word in enumerate(sentence):
    color = COLOR_CURRENT if i == 2 else COLOR_NEUTRAL
    rect = FancyBboxPatch((i-0.4, 0), 0.8, 0.8, boxstyle="round,pad=0.08",
                          facecolor=color, edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(i, 0.4, word, ha='center', va='center', fontsize=10, fontweight='bold')

ax.text(3, -0.3, "Focus: 'sat' (P(sat) only)", ha='center', fontsize=9)

# Bigram (n=2)
ax = axes[1]
ax.set_title("Bigram (n=2): Previous Word", fontsize=12, fontweight='bold')
ax.set_xlim(-0.5, 6.5)
ax.set_ylim(-0.5, 1.5)
ax.axis('off')

for i, word in enumerate(sentence):
    if i == 1:
        color = COLOR_CONTEXT
    elif i == 2:
        color = COLOR_PREDICT
    else:
        color = COLOR_NEUTRAL
    rect = FancyBboxPatch((i-0.4, 0), 0.8, 0.8, boxstyle="round,pad=0.08",
                          facecolor=color, edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(i, 0.4, word, ha='center', va='center', fontsize=10, fontweight='bold')

arrow = FancyArrowPatch((1.5, 0.4), (2.5, 0.4), arrowstyle='->', mutation_scale=15,
                       linewidth=2, color='black')
ax.add_patch(arrow)
ax.text(2, -0.3, "P(sat | cat)", ha='center', fontsize=9, fontweight='bold')

# Trigram (n=3)
ax = axes[2]
ax.set_title("Trigram (n=3): Two Previous Words", fontsize=12, fontweight='bold')
ax.set_xlim(-0.5, 6.5)
ax.set_ylim(-0.5, 1.5)
ax.axis('off')

for i, word in enumerate(sentence):
    if i in [0, 1]:
        color = COLOR_CONTEXT
    elif i == 2:
        color = COLOR_PREDICT
    else:
        color = COLOR_NEUTRAL
    rect = FancyBboxPatch((i-0.4, 0), 0.8, 0.8, boxstyle="round,pad=0.08",
                          facecolor=color, edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(i, 0.4, word, ha='center', va='center', fontsize=10, fontweight='bold')

for start in [0.5, 1.5]:
    arrow = FancyArrowPatch((start, 0.4), (2.5, 0.4), arrowstyle='->',
                           mutation_scale=15, linewidth=2, color='black', alpha=0.7)
    ax.add_patch(arrow)

ax.text(2, -0.3, "P(sat | The, cat)", ha='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

print("This chart shows how different n-gram models use different amounts of context.")

## Summary and Key Takeaways

### What We've Learned:

1. **N-gram Models**: Statistical approach to language modeling based on sequences of n tokens

2. **Perplexity**: Metric for evaluating language models (lower is better)
   - Measures how "surprised" the model is by test data
   - Related to cross-entropy

3. **Trade-offs**:
   - **Unigrams**: Simple but ignore context
   - **Bigrams**: Capture local dependencies
   - **Trigrams**: Better context but sparser data
   - **Higher-order**: Diminishing returns, data sparsity

4. **Smoothing Techniques**:
   - **Laplace (Add-1)**: Simple but can over-smooth
   - **Add-k**: Tunable smoothing parameter
   - **Interpolation**: Combine different n-gram orders

5. **Limitations of N-grams**:
   - Fixed context window
   - Exponential growth in parameters
   - No semantic understanding
   - Can't capture long-range dependencies

### Next Steps:
- Week 2: Word embeddings to capture semantic meaning
- Week 3: RNNs for variable-length contexts
- Week 4: Attention mechanisms for long-range dependencies
- Week 5: Transformers - the modern approach!

## Exercises

1. **Experiment with corpus size**: How does perplexity change with more training data?

2. **Compare smoothing techniques**: Implement Good-Turing or Kneser-Ney smoothing

3. **Character-level models**: Build n-gram models at the character level

4. **Cross-domain evaluation**: Train on one domain, test on another

5. **Optimize interpolation weights**: Use held-out data to find optimal lambdas

6. **Build a spell checker**: Use n-grams to detect and correct typos

7. **Language identification**: Use character n-grams to identify languages