# 5) KenLM with Kneser-Ney Smoothing

This notebook implements a language model using KenLM, an optimized n-gram library with Modified Kneser-Ney smoothing.

**Model Description:**
- Uses KenLM library (highly optimized C++ implementation)
- Modified Kneser-Ney smoothing (state-of-the-art for n-grams)
- Supports 5-gram with proper backoff
- Much faster and more memory-efficient than pure Python

**Expected Performance:**
- 10K data: ~25-30% accuracy
- 100K data: ~45-50% accuracy
- 1M data: ~55-60% accuracy
- Full (3.8M) data: ~58-65% accuracy

**Why KenLM?**
- Modified Kneser-Ney > Simple MLE (our previous models)
- Handles unseen n-grams much better
- Industry standard for n-gram models
- Used in Moses MT, speech recognition, etc.

## 5.1 Setup and Imports

First, we need to install KenLM. This requires:
1. Installing the KenLM library
2. Building the language model with `lmplz`
3. Querying the model with Python bindings

In [1]:
# Install KenLM
# This may take 2-3 minutes
!pip install https://github.com/kpu/kenlm/archive/master.zip -q

print("KenLM installed successfully!")

KenLM installed successfully!


In [2]:
import pandas as pd
import kenlm
import os
import subprocess
from collections import defaultdict
from typing import List, Dict
from tqdm import tqdm
import time

print("Imports successful!")
print(f"KenLM version: {kenlm.__version__ if hasattr(kenlm, '__version__') else 'installed'}")

Imports successful!
KenLM version: installed


## 5.2 Load Data

In [3]:
# Load training data
print("Loading training data...")
with open('train.src.tok', 'r', encoding='utf-8') as f:
    train_lines = [line.strip() for line in f.readlines()]

print(f"Total training sentences: {len(train_lines):,}")

# Load dev set
print("\nLoading development set...")
dev_df = pd.read_csv('dev_set.csv')
print(f"Development set size: {len(dev_df):,} predictions")
print(f"Columns: {list(dev_df.columns)}")

# Show sample
print("\nSample dev set entries:")
print(dev_df.head(3))

Loading training data...
Total training sentences: 3,803,957

Loading development set...
Development set size: 94,825 predictions
Columns: ['context', 'first letter', 'answer']

Sample dev set entries:
                                             context first letter   answer
0  south korea and the united states on monday wa...            d      day
1  after agreeing to drastically cut its car impo...            t      the
2  three soldiers were injured in a bombing ambus...            m  morning


## 5.3 Data Sampling

We'll use simple sequential sampling (first N sentences) as decided in EDA.

In [4]:
# Data sizes for experiments
DATA_SIZES = {
    'debug': 10_000,
    'dev': 100_000,
    'large': 1_000_000,
    'full': 3_803_957
}

def sample_data(train_lines: List[str], size_key: str = 'debug') -> List[str]:
    """
    Sample training data sequentially (simple, no shuffling).
    
    Args:
        train_lines: Full training corpus
        size_key: One of 'debug', 'dev', 'large', 'full'
    
    Returns:
        First N sentences from corpus
    """
    size = DATA_SIZES[size_key]
    if size >= len(train_lines):
        return train_lines
    return train_lines[:size]

# Start with debug size (10K) for fast testing
# Change to 'dev', 'large', or 'full' later
CURRENT_SIZE = 'debug'
# CURRENT_SIZE = 'dev'
# CURRENT_SIZE = 'large'
# CURRENT_SIZE = 'full'

train_data = sample_data(train_lines, CURRENT_SIZE)
print(f"Using {CURRENT_SIZE} dataset: {len(train_data):,} sentences")
print(f"\nFirst 3 training sentences:")
for i, sent in enumerate(train_data[:3]):
    print(f"{i+1}. {sent}")

Using debug dataset: 10,000 sentences

First 3 training sentences:
1. australia ' s current account deficit shrunk by a record 1 . 11 billion dollars - lrb - 1 . 11 billion us - rrb - in the june quarter due to soaring commodity prices , figures released monday showed .
2. at least two people were killed in a suspected bomb attack on a passenger bus in the strife - torn southern philippines on monday , the military said .
3. australian shares closed down 1 . 1 percent monday following a weak lead from the united states and lower commodity prices , dealers said .


## 5.4 Prepare Training Data for KenLM

KenLM requires:
1. Text file with one sentence per line
2. We'll add sentence boundaries `<s>` and `</s>` (KenLM adds these automatically, but we can control it)
3. Build model using `lmplz` command-line tool

In [6]:
# Save training data to file for KenLM
train_file = f'kenlm_train_{CURRENT_SIZE}.txt'

print(f"Writing training data to {train_file}...")
with open(train_file, 'w', encoding='utf-8') as f:
    for sentence in train_data:
        # KenLM automatically adds <s> and </s>, so we just write the sentence
        f.write(sentence + '\n')

print(f"Saved {len(train_data):,} sentences to {train_file}")
print(f"File size: {os.path.getsize(train_file) / (1024*1024):.2f} MB")

Writing training data to kenlm_train_debug.txt...
Saved 10,000 sentences to kenlm_train_debug.txt
File size: 1.75 MB


## 5.5 Build KenLM Model

We'll use `lmplz` to build a 5-gram model with Modified Kneser-Ney smoothing.

**Command explanation:**
- `-o 5`: Build 5-gram model
- `--discount_fallback`: Handle edge cases in discounting
- `-S 80%`: Use 80% of RAM for sorting (adjust if needed)
- `-T /tmp`: Use /tmp for temporary files

**Note:** This takes:
- 10K: ~5-10 seconds
- 100K: ~30-60 seconds
- 1M: ~5-10 minutes
- Full: ~30-60 minutes

In [8]:
# Build KenLM model
model_file = f'kenlm_{CURRENT_SIZE}_5gram.arpa'

print(f"Building 5-gram KenLM model...")
print(f"This may take a while for large datasets...\n")

start_time = time.time()

# Build model using lmplz
cmd = [
    'lmplz',
    '-o', '5',  # 5-gram
    '--discount_fallback',
    '-S', '80%',  # Use 80% RAM
    '-T', '/tmp',  # Temp directory
]

with open(train_file, 'r') as input_f:
    with open(model_file, 'w') as output_f:
        result = subprocess.run(
            cmd,
            stdin=input_f,
            stdout=output_f,
            stderr=subprocess.PIPE,
            text=True
        )

elapsed = time.time() - start_time

if result.returncode == 0:
    print(f"✓ Model built successfully in {elapsed:.2f} seconds")
    print(f"Model saved to: {model_file}")
    print(f"Model size: {os.path.getsize(model_file) / (1024*1024):.2f} MB")
else:
    print(f"✗ Error building model:")
    print(result.stderr)

Building 5-gram KenLM model...
This may take a while for large datasets...



FileNotFoundError: [Errno 2] No such file or directory: 'lmplz'

## 5.6 Load KenLM Model

Now we'll load the model into Python for querying.

In [None]:
# Load the model
print(f"Loading KenLM model from {model_file}...")
model = kenlm.Model(model_file)

print(f"✓ Model loaded successfully!")
print(f"Model order: {model.order}")

# Test the model
test_sentence = "the president of the united states"
score = model.score(test_sentence, bos=True, eos=True)
perplexity = model.perplexity(test_sentence)

print(f"\nTest sentence: '{test_sentence}'")
print(f"Log10 probability: {score:.4f}")
print(f"Perplexity: {perplexity:.4f}")

## 5.7 Build Vocabulary Index

We need to index vocabulary by first letter for efficient candidate filtering.

In [None]:
# Build vocabulary from training data
print("Building vocabulary index...")

vocab_by_first_char = defaultdict(set)

for sentence in tqdm(train_data, desc="Indexing vocabulary"):
    tokens = sentence.split()
    for token in tokens:
        first_char = token[0]
        vocab_by_first_char[first_char].add(token)

print(f"\nVocabulary indexed by first character:")
print(f"Total unique words: {sum(len(words) for words in vocab_by_first_char.values()):,}")
print(f"Number of first characters: {len(vocab_by_first_char)}")

# Show sample
print(f"\nSample: Words starting with 'a': {len(vocab_by_first_char['a']):,}")
print(f"Sample: Words starting with 't': {len(vocab_by_first_char['t']):,}")

## 5.8 Prediction Function

We'll score all candidate words and select the one with highest probability.

In [None]:
def predict(context: str, first_letter: str, model: kenlm.Model, vocab_by_first_char: Dict) -> str:
    """
    Predict next word given context and first letter constraint.
    
    Args:
        context: Previous words as string (e.g., "the cat sat on the")
        first_letter: Required first character of prediction
        model: KenLM model
        vocab_by_first_char: Dictionary mapping first char to set of words
    
    Returns:
        Predicted word (most likely word starting with first_letter)
    """
    # Get candidate words
    candidates = vocab_by_first_char.get(first_letter, set())
    
    if not candidates:
        # No words in vocabulary start with this letter
        return first_letter
    
    # Score each candidate
    best_word = None
    best_score = float('-inf')
    
    for word in candidates:
        # Create full sentence with candidate word
        full_sentence = context + ' ' + word if context else word
        
        # Score with KenLM
        # We use bos=True to add <s>, eos=False since we're predicting next word
        score = model.score(full_sentence, bos=True, eos=False)
        
        if score > best_score:
            best_score = score
            best_word = word
    
    return best_word if best_word else first_letter

print("Prediction function defined!")

## 5.9 Test Predictions

Let's test on a few manual examples before evaluating on dev set.

In [None]:
# Test examples
test_cases = [
    ("the cat sat on the", "m"),  # mat?
    ("president of the united", "s"),  # states?
    ("new york", "c"),  # city?
    ("in the", "m"),  # morning? middle?
    ("on", "m"),  # monday?
]

print("Testing predictions:\n")
for context, first_letter in test_cases:
    prediction = predict(context, first_letter, model, vocab_by_first_char)
    print(f"Context: '{context}'")
    print(f"First letter: '{first_letter}'")
    print(f"Prediction: {prediction}")
    print()

## 5.10 Evaluate on Dev Set

Now let's evaluate on the full development set (all 94,825 examples).

**Note:** This may take 5-10 minutes depending on dataset size.

In [None]:
# Evaluate on dev set
print(f"\nEvaluating on development set...")

# For testing, you can limit examples
max_examples = None  # Set to 1000 for quick testing
eval_df = dev_df.head(max_examples) if max_examples else dev_df

correct = 0
total = len(eval_df)
predictions = []

start_time = time.time()

for idx, row in tqdm(eval_df.iterrows(), total=total, desc="Predicting"):
    context = row['context']
    first_letter = row['first letter']
    answer = row['answer']
    
    # Predict
    prediction = predict(context, first_letter, model, vocab_by_first_char)
    predictions.append(prediction)
    
    # Check correctness
    if prediction == answer:
        correct += 1

accuracy = correct / total
elapsed = time.time() - start_time

print(f"\nResults:")
print(f"  Total examples: {total:,}")
print(f"  Correct: {correct:,}")
print(f"  Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"  Time: {elapsed:.2f} seconds ({elapsed/total*1000:.2f} ms/prediction)")

print(f"\nExpected accuracy for {CURRENT_SIZE} dataset: ~25-30%")
print(f"Actual accuracy: {accuracy*100:.2f}%")

## 5.11 Error Analysis

Let's look at some examples where the model got it right vs wrong.

In [None]:
# Add predictions to dev_df
dev_sample = eval_df.copy()
dev_sample['prediction'] = predictions
dev_sample['correct'] = dev_sample['prediction'] == dev_sample['answer']

# Show correct predictions
print("="*80)
print("CORRECT PREDICTIONS (Sample of 5)")
print("="*80)
correct_samples = dev_sample[dev_sample['correct']].head(5)
for idx, row in correct_samples.iterrows():
    print(f"\nContext: {row['context']}")
    print(f"First letter: '{row['first letter']}'")
    print(f"Prediction: {row['prediction']}")
    print(f"Answer: {row['answer']}")
    print(f"✓ CORRECT")

# Show incorrect predictions
print("\n" + "="*80)
print("INCORRECT PREDICTIONS (Sample of 5)")
print("="*80)
incorrect_samples = dev_sample[~dev_sample['correct']].head(5)
for idx, row in incorrect_samples.iterrows():
    print(f"\nContext: {row['context']}")
    print(f"First letter: '{row['first letter']}'")
    print(f"Prediction: {row['prediction']}")
    print(f"Answer: {row['answer']}")
    print(f"✗ INCORRECT")

## 5.12 Scaling Experiments

Now let's see how accuracy changes with more training data.

**Note:** This will take progressively longer:
- 10K: ~1-2 minutes total
- 100K: ~5-10 minutes total
- 1M: ~20-30 minutes total
- Full (3.8M): ~1-2 hours total

In [5]:
# Run experiments on different data sizes
# Uncomment to run scaling experiments
sizes_to_test = [
    'debug',   # 10K
    # 'dev',     # 100K
    # 'large',   # 1M
    # 'full',    # 3.8M
]

scaling_results = []

for size_key in sizes_to_test:
    print("\n" + "="*80)
    print(f"TRAINING ON {size_key.upper()} DATASET ({DATA_SIZES[size_key]:,} sentences)")
    print("="*80)
    
    # Sample data
    data = sample_data(train_lines, size_key)
    
    # Prepare training file
    train_file = f'kenlm_train_{size_key}.txt'
    with open(train_file, 'w', encoding='utf-8') as f:
        for sentence in data:
            f.write(sentence + '\n')
    
    # Build model
    model_file = f'kenlm_{size_key}_5gram.arpa'
    cmd = ['lmplz', '-o', '5', '--discount_fallback', '-S', '80%', '-T', '/tmp']
    
    with open(train_file, 'r') as input_f:
        with open(model_file, 'w') as output_f:
            subprocess.run(cmd, stdin=input_f, stdout=output_f, stderr=subprocess.PIPE)
    
    # Load model
    model = kenlm.Model(model_file)
    
    # Build vocabulary
    vocab = defaultdict(set)
    for sentence in data:
        for token in sentence.split():
            vocab[token[0]].add(token)
    
    # Evaluate (use sample for speed)
    eval_sample = dev_df.head(1000)  # Use 1000 for faster testing
    correct = 0
    for idx, row in eval_sample.iterrows():
        pred = predict(row['context'], row['first letter'], model, vocab)
        if pred == row['answer']:
            correct += 1
    
    accuracy = correct / len(eval_sample)
    
    # Store results
    scaling_results.append({
        'size': size_key,
        'num_sentences': len(data),
        'accuracy': accuracy
    })

# Show summary
print("\n" + "="*80)
print("SCALING RESULTS SUMMARY")
print("="*80)
print(f"{'Dataset':<15} {'# Sentences':<15} {'Accuracy':<15}")
print("-"*45)
for result in scaling_results:
    print(f"{result['size']:<15} {result['num_sentences']:<15,} {result['accuracy']*100:<14.2f}%")


TRAINING ON DEBUG DATASET (10,000 sentences)


FileNotFoundError: [Errno 2] No such file or directory: 'lmplz'

## 5.13 Next Steps

**Current Status:**
- ✅ Trigram model implemented (58.12% on full data)
- ✅ 4-gram model implemented
- ✅ 5-gram model implemented
- ✅ KenLM with Kneser-Ney implemented
- ✅ Best n-gram baseline complete

**KenLM vs Our Models:**
- Our 5-gram: Simple MLE with backoff
- KenLM: Modified Kneser-Ney smoothing (better generalization)
- Expected improvement: +3-8% over our 5-gram

**To improve further:**

1. **Neural Models**: LSTM/GRU
   - Expected: 60-70% accuracy
   - Can capture longer dependencies

2. **Fine-tune GPT-2**: Transformer-based
   - Expected: 65-75% accuracy
   - Best single model performance

3. **Ensemble**: Combine KenLM + LSTM + GPT-2
   - Expected: 70-80% accuracy
   - Weighted voting or stacking

**Next notebook:**
- `6_LSTM.ipynb` - Implement LSTM from scratch
- Or `7_GPT2.ipynb` - Fine-tune GPT-2