# 10) 6-Gram Language Model

This notebook implements a 6-gram language model for next-word prediction.

**Model Description:**
- Predicts next word based on previous 5 words
- Uses Maximum Likelihood Estimation (MLE): P(w_i | w_{i-5}, w_{i-4}, w_{i-3}, w_{i-2}, w_{i-1}) = Count(w_{i-5}, w_{i-4}, w_{i-3}, w_{i-2}, w_{i-1}, w_i) / Count(w_{i-5}, w_{i-4}, w_{i-3}, w_{i-2}, w_{i-1})
- Filters candidates by first letter constraint
- Falls back to 5-gram → 4-gram → trigram → bigram → unigram if 6-gram not seen

**Expected Performance:**
- 10K data: ~23-28% accuracy
- 100K data: ~41-45% accuracy
- 1M data: ~55-59% accuracy
- Full (3.8M) data: ~58-62% accuracy

In [None]:
# !pip install pandas
# !pip install tqdm
# !pip install gdown

In [None]:
# !gdown --fuzzy "https://drive.google.com/file/d/1kJvvOgscBFP_gohf-q3f2xa1WfT7GRAx/view?usp=drive_link"
# !gdown --fuzzy "https://drive.google.com/file/d/1PKk222dXuTdtqQc7M6nBR240DmbcBdUx/view?usp=drive_link"

## 10.1 Setup and Imports

In [1]:
import pandas as pd
from collections import defaultdict, Counter
from typing import List, Tuple, Dict
import time
from tqdm import tqdm
import json

print("Imports successful!")

Imports successful!


## 10.2 Load Data

In [2]:
# Load training data
print("Loading training data...")
with open('train.src.tok', 'r', encoding='utf-8') as f:
    train_lines = [line.strip() for line in f.readlines()]

print(f"Total training sentences: {len(train_lines):,}")

# Load dev set
print("\nLoading development set...")
dev_df = pd.read_csv('dev_set.csv')
print(f"Development set size: {len(dev_df):,} predictions")
print(f"Columns: {list(dev_df.columns)}")

# Show sample
print("\nSample dev set entries:")
print(dev_df.head(3))

Loading training data...
Total training sentences: 3,803,957

Loading development set...
Development set size: 94,825 predictions
Columns: ['context', 'first letter', 'answer']

Sample dev set entries:
                                             context first letter   answer
0  south korea and the united states on monday wa...            d      day
1  after agreeing to drastically cut its car impo...            t      the
2  three soldiers were injured in a bombing ambus...            m  morning


## 10.3 Data Sampling

We'll use simple sequential sampling (first N sentences) as decided in EDA.

In [3]:
# Data sizes for experiments
DATA_SIZES = {
    'debug': 10_000,
    'dev': 100_000,
    'large': 1_000_000,
    'full': 3_803_957
}

def sample_data(train_lines: List[str], size_key: str = 'debug') -> List[str]:
    """
    Sample training data sequentially (simple, no shuffling).
    
    Args:
        train_lines: Full training corpus
        size_key: One of 'debug', 'dev', 'large', 'full'
    
    Returns:
        First N sentences from corpus
    """
    size = DATA_SIZES[size_key]
    if size >= len(train_lines):
        return train_lines
    return train_lines[:size]

# Start with debug size (10K) for fast testing
# Change to 'dev', 'large', or 'full' later
CURRENT_SIZE = 'debug'
# CURRENT_SIZE = 'dev'
# CURRENT_SIZE = 'large'
# CURRENT_SIZE = 'full'

train_data = sample_data(train_lines, CURRENT_SIZE)
print(f"Using {CURRENT_SIZE} dataset: {len(train_data):,} sentences")
print(f"\nFirst 3 training sentences:")
for i, sent in enumerate(train_data[:3]):
    print(f"{i+1}. {sent}")

Using debug dataset: 10,000 sentences

First 3 training sentences:
1. australia ' s current account deficit shrunk by a record 1 . 11 billion dollars - lrb - 1 . 11 billion us - rrb - in the june quarter due to soaring commodity prices , figures released monday showed .
2. at least two people were killed in a suspected bomb attack on a passenger bus in the strife - torn southern philippines on monday , the military said .
3. australian shares closed down 1 . 1 percent monday following a weak lead from the united states and lower commodity prices , dealers said .


## 10.4 6-Gram Model Implementation

### Key Design Decisions:

1. **Sentence boundaries**: Add FIVE `<s>` tokens at start and `</s>` at end
   - This ensures we have enough context for 6-grams
   - Follows standard n-gram practice

2. **Backoff strategy**: 6-gram → 5-gram → 4-gram → Trigram → Bigram → Unigram
   - If 6-gram (w1, w2, w3, w4, w5, ?) not seen, fall back to 5-gram (w2, w3, w4, w5, ?)
   - If 5-gram not seen, fall back to 4-gram (w3, w4, w5, ?)
   - If 4-gram not seen, fall back to trigram (w4, w5, ?)
   - If trigram not seen, fall back to bigram (w5, ?)
   - If bigram not seen, fall back to unigram (?)
   - If unigram not seen, use most common word with that first letter

3. **First letter filtering**:
   - Build vocabulary index by first character
   - Only consider words starting with given first letter
   - Handles special characters (`,`, `'`, `1`, etc.)

4. **Data structures**:
   - `sixgram_counts`: Dict[(w1, w2, w3, w4, w5, w6)] → count
   - `fivegram_counts`: Dict[(w1, w2, w3, w4, w5)] → count
   - `fourgram_counts`: Dict[(w1, w2, w3, w4)] → count
   - `trigram_counts`: Dict[(w1, w2, w3)] → count
   - `bigram_counts`: Dict[(w1, w2)] → count
   - `unigram_counts`: Dict[w] → count
   - `vocab_by_first_char`: Dict[char] → List[words]

In [4]:
class SixgramModel:
    """
    6-gram language model with backoff strategy.
    
    Predicts P(w_i | w_{i-5}, w_{i-4}, w_{i-3}, w_{i-2}, w_{i-1}) using Maximum Likelihood Estimation.
    Falls back to 5-gram/4-gram/trigram/bigram/unigram if 6-gram not seen.
    """
    
    def __init__(self):
        # N-gram counts
        self.sixgram_counts = defaultdict(int)    # (w1, w2, w3, w4, w5, w6) -> count
        self.fivegram_counts = defaultdict(int)   # (w1, w2, w3, w4, w5) -> count
        self.fourgram_counts = defaultdict(int)   # (w1, w2, w3, w4) -> count
        self.trigram_counts = defaultdict(int)    # (w1, w2, w3) -> count
        self.bigram_counts = defaultdict(int)     # (w1, w2) -> count
        self.unigram_counts = defaultdict(int)    # w -> count
        
        # Context counts (for probability calculation)
        self.fivegram_context_counts = defaultdict(int)  # (w1, w2, w3, w4, w5) -> count
        self.fourgram_context_counts = defaultdict(int)  # (w1, w2, w3, w4) -> count
        self.trigram_context_counts = defaultdict(int)   # (w1, w2, w3) -> count
        self.bigram_context_counts = defaultdict(int)    # (w1, w2) -> count
        self.unigram_context_counts = defaultdict(int)   # w1 -> count
        
        # Vocabulary indexed by first character
        self.vocab_by_first_char = defaultdict(set)  # char -> {words}
        
        # Statistics
        self.total_sixgrams = 0
        self.total_fivegrams = 0
        self.total_fourgrams = 0
        self.total_trigrams = 0
        self.total_bigrams = 0
        self.total_unigrams = 0
        
    def train(self, sentences: List[str]):
        """
        Train the 6-gram model on a list of sentences.
        
        Args:
            sentences: List of tokenized sentences (strings)
        """
        print(f"Training 6-gram model on {len(sentences):,} sentences...")
        start_time = time.time()
        
        for sentence in tqdm(sentences, desc="Processing sentences"):
            # Tokenize sentence
            tokens = sentence.split()
            
            # Add sentence boundaries
            # We use FIVE <s> tokens at start for 6-gram context
            tokens = ['<s>', '<s>', '<s>', '<s>', '<s>'] + tokens + ['</s>']
            
            # Extract n-grams and count
            for i in range(len(tokens)):
                # Unigram
                if i >= 5:  # Skip the <s> tokens
                    word = tokens[i]
                    self.unigram_counts[word] += 1
                    self.total_unigrams += 1
                    
                    # Add to vocabulary index
                    if word not in ['<s>', '</s>']:
                        first_char = word[0]
                        self.vocab_by_first_char[first_char].add(word)
                
                # Bigram
                if i >= 1:
                    bigram = (tokens[i-1], tokens[i])
                    self.bigram_counts[bigram] += 1
                    self.total_bigrams += 1
                    
                    # Count context (for probability: count(w1, w2) / count(w1))
                    if i >= 5:
                        self.unigram_context_counts[tokens[i-1]] += 1
                
                # Trigram
                if i >= 2:
                    trigram = (tokens[i-2], tokens[i-1], tokens[i])
                    self.trigram_counts[trigram] += 1
                    self.total_trigrams += 1
                    
                    # Count context (for probability: count(w1, w2, w3) / count(w1, w2))
                    if i >= 5:
                        context = (tokens[i-2], tokens[i-1])
                        self.bigram_context_counts[context] += 1
                
                # 4-gram
                if i >= 3:
                    fourgram = (tokens[i-3], tokens[i-2], tokens[i-1], tokens[i])
                    self.fourgram_counts[fourgram] += 1
                    self.total_fourgrams += 1
                    
                    # Count context (for probability: count(w1, w2, w3, w4) / count(w1, w2, w3))
                    if i >= 5:
                        context = (tokens[i-3], tokens[i-2], tokens[i-1])
                        self.trigram_context_counts[context] += 1
                
                # 5-gram
                if i >= 4:
                    fivegram = (tokens[i-4], tokens[i-3], tokens[i-2], tokens[i-1], tokens[i])
                    self.fivegram_counts[fivegram] += 1
                    self.total_fivegrams += 1
                    
                    # Count context (for probability: count(w1, w2, w3, w4, w5) / count(w1, w2, w3, w4))
                    if i >= 5:
                        context = (tokens[i-4], tokens[i-3], tokens[i-2], tokens[i-1])
                        self.fourgram_context_counts[context] += 1
                
                # 6-gram
                if i >= 5:
                    sixgram = (tokens[i-5], tokens[i-4], tokens[i-3], tokens[i-2], tokens[i-1], tokens[i])
                    self.sixgram_counts[sixgram] += 1
                    self.total_sixgrams += 1
                    
                    # Count context (for probability: count(w1, w2, w3, w4, w5, w6) / count(w1, w2, w3, w4, w5))
                    context = (tokens[i-5], tokens[i-4], tokens[i-3], tokens[i-2], tokens[i-1])
                    self.fivegram_context_counts[context] += 1
        
        elapsed = time.time() - start_time
        print(f"\nTraining complete in {elapsed:.2f} seconds")
        print(f"Total 6-grams: {self.total_sixgrams:,}")
        print(f"Total 5-grams: {self.total_fivegrams:,}")
        print(f"Total 4-grams: {self.total_fourgrams:,}")
        print(f"Total trigrams: {self.total_trigrams:,}")
        print(f"Total bigrams: {self.total_bigrams:,}")
        print(f"Total unigrams: {self.total_unigrams:,}")
        print(f"Unique 6-grams: {len(self.sixgram_counts):,}")
        print(f"Unique 5-grams: {len(self.fivegram_counts):,}")
        print(f"Unique 4-grams: {len(self.fourgram_counts):,}")
        print(f"Unique trigrams: {len(self.trigram_counts):,}")
        print(f"Unique bigrams: {len(self.bigram_counts):,}")
        print(f"Unique unigrams: {len(self.unigram_counts):,}")
        print(f"Vocabulary size: {sum(len(words) for words in self.vocab_by_first_char.values()):,}")
    
    def get_sixgram_prob(self, w1: str, w2: str, w3: str, w4: str, w5: str, w6: str) -> float:
        """
        Calculate P(w6 | w1, w2, w3, w4, w5) using MLE.
        
        Returns:
            Probability (0 if 6-gram never seen)
        """
        sixgram = (w1, w2, w3, w4, w5, w6)
        context = (w1, w2, w3, w4, w5)
        
        sixgram_count = self.sixgram_counts.get(sixgram, 0)
        context_count = self.fivegram_context_counts.get(context, 0)
        
        if context_count == 0:
            return 0.0
        
        return sixgram_count / context_count
    
    def get_fivegram_prob(self, w1: str, w2: str, w3: str, w4: str, w5: str) -> float:
        """
        Calculate P(w5 | w1, w2, w3, w4) using MLE.
        
        Returns:
            Probability (0 if 5-gram never seen)
        """
        fivegram = (w1, w2, w3, w4, w5)
        context = (w1, w2, w3, w4)
        
        fivegram_count = self.fivegram_counts.get(fivegram, 0)
        context_count = self.fourgram_context_counts.get(context, 0)
        
        if context_count == 0:
            return 0.0
        
        return fivegram_count / context_count
    
    def get_fourgram_prob(self, w1: str, w2: str, w3: str, w4: str) -> float:
        """
        Calculate P(w4 | w1, w2, w3) using MLE.
        
        Returns:
            Probability (0 if 4-gram never seen)
        """
        fourgram = (w1, w2, w3, w4)
        context = (w1, w2, w3)
        
        fourgram_count = self.fourgram_counts.get(fourgram, 0)
        context_count = self.trigram_context_counts.get(context, 0)
        
        if context_count == 0:
            return 0.0
        
        return fourgram_count / context_count
    
    def get_trigram_prob(self, w1: str, w2: str, w3: str) -> float:
        """
        Calculate P(w3 | w1, w2) using MLE.
        
        Returns:
            Probability (0 if trigram never seen)
        """
        trigram = (w1, w2, w3)
        context = (w1, w2)
        
        trigram_count = self.trigram_counts.get(trigram, 0)
        context_count = self.bigram_context_counts.get(context, 0)
        
        if context_count == 0:
            return 0.0
        
        return trigram_count / context_count
    
    def get_bigram_prob(self, w1: str, w2: str) -> float:
        """
        Calculate P(w2 | w1) using MLE.
        
        Returns:
            Probability (0 if bigram never seen)
        """
        bigram = (w1, w2)
        
        bigram_count = self.bigram_counts.get(bigram, 0)
        context_count = self.unigram_context_counts.get(w1, 0)
        
        if context_count == 0:
            return 0.0
        
        return bigram_count / context_count
    
    def get_unigram_prob(self, w: str) -> float:
        """
        Calculate P(w) using MLE.
        
        Returns:
            Probability (0 if word never seen)
        """
        if self.total_unigrams == 0:
            return 0.0
        
        return self.unigram_counts.get(w, 0) / self.total_unigrams
    
    def predict(self, context: str, first_letter: str) -> str:
        """
        Predict next word given context and first letter constraint.
        
        Args:
            context: Previous words as string (e.g., "the cat sat on the")
            first_letter: Required first character of prediction
        
        Returns:
            Predicted word (most likely word starting with first_letter)
        """
        # Tokenize context and get last 5 words
        context_tokens = context.split()
        
        # Handle short contexts
        if len(context_tokens) == 0:
            w1, w2, w3, w4, w5 = '<s>', '<s>', '<s>', '<s>', '<s>'
        elif len(context_tokens) == 1:
            w1, w2, w3, w4, w5 = '<s>', '<s>', '<s>', '<s>', context_tokens[0]
        elif len(context_tokens) == 2:
            w1, w2, w3, w4, w5 = '<s>', '<s>', '<s>', context_tokens[0], context_tokens[1]
        elif len(context_tokens) == 3:
            w1, w2, w3, w4, w5 = '<s>', '<s>', context_tokens[0], context_tokens[1], context_tokens[2]
        elif len(context_tokens) == 4:
            w1, w2, w3, w4, w5 = '<s>', context_tokens[0], context_tokens[1], context_tokens[2], context_tokens[3]
        else:
            w1, w2, w3, w4, w5 = context_tokens[-5], context_tokens[-4], context_tokens[-3], context_tokens[-2], context_tokens[-1]
        
        # Get candidate words (all words starting with first_letter)
        candidates = self.vocab_by_first_char.get(first_letter, set())
        
        if not candidates:
            # No words in vocabulary start with this letter
            # This shouldn't happen with our data, but handle gracefully
            return first_letter  # Return just the letter
        
        # Score candidates using backoff strategy
        best_word = None
        best_score = -1
        
        for word in candidates:
            # Try 6-gram first
            score = self.get_sixgram_prob(w1, w2, w3, w4, w5, word)
            
            # If 6-gram not seen, back off to 5-gram
            if score == 0:
                score = self.get_fivegram_prob(w2, w3, w4, w5, word)
            
            # If 5-gram not seen, back off to 4-gram
            if score == 0:
                score = self.get_fourgram_prob(w3, w4, w5, word)
            
            # If 4-gram not seen, back off to trigram
            if score == 0:
                score = self.get_trigram_prob(w4, w5, word)
            
            # If trigram not seen, back off to bigram
            if score == 0:
                score = self.get_bigram_prob(w5, word)
            
            # If bigram not seen, back off to unigram
            if score == 0:
                score = self.get_unigram_prob(word)
            
            # Update best
            if score > best_score:
                best_score = score
                best_word = word
        
        # If still no match, return most common word with this first letter
        if best_word is None:
            # Get most common word by unigram count
            candidates_list = list(candidates)
            best_word = max(candidates_list, 
                          key=lambda w: self.unigram_counts.get(w, 0))
        
        return best_word
    
    def evaluate(self, dev_df: pd.DataFrame, max_examples: int = None) -> Dict:
        """
        Evaluate model on development set.
        
        Args:
            dev_df: DataFrame with columns ['context', 'first letter', 'answer']
            max_examples: Optional limit on number of examples to evaluate
        
        Returns:
            Dictionary with accuracy and other metrics
        """
        print(f"\nEvaluating on development set...")
        
        if max_examples:
            dev_df = dev_df.head(max_examples)
        
        correct = 0
        total = len(dev_df)
        
        predictions = []
        
        for idx, row in tqdm(dev_df.iterrows(), total=total, desc="Predicting"):
            context = row['context']
            first_letter = row['first letter']
            answer = row['answer']
            
            # Predict
            prediction = self.predict(context, first_letter)
            predictions.append(prediction)
            
            # Check correctness
            if prediction == answer:
                correct += 1
        
        accuracy = correct / total
        
        print(f"\nResults:")
        print(f"  Total examples: {total:,}")
        print(f"  Correct: {correct:,}")
        print(f"  Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
        
        return {
            'accuracy': accuracy,
            'correct': correct,
            'total': total,
            'predictions': predictions
        }

print("SixgramModel class defined successfully!")

SixgramModel class defined successfully!


## 10.5 Train Model

Let's train on the debug dataset (10K sentences) first to test everything works.

In [None]:
# Initialize model
model = SixgramModel()

# Train
model.train(train_data)

## 10.6 Test Predictions

Let's test on a few manual examples before evaluating on dev set.

In [None]:
# Test examples
test_cases = [
    ("the cat sat on the", "m"),  # mat?
    ("president of the united", "s"),  # states?
    ("new york", "c"),  # city?
    ("in the", "m"),  # morning? middle?
    ("on", "m"),  # monday?
]

print("Testing predictions:\n")
for context, first_letter in test_cases:
    prediction = model.predict(context, first_letter)
    print(f"Context: '{context}'")
    print(f"First letter: '{first_letter}'")
    print(f"Prediction: {prediction}")
    print()

## 10.7 Evaluate on Dev Set

Now let's evaluate on the full development set (all 94,825 examples).

**Note:** Evaluation takes ~30-60 seconds for 10K training data, longer for larger models.

In [None]:
# Evaluate on full dev set
# This will take ~30-60 seconds for 10K training data
# Change max_examples=1000 for quick testing
results = model.evaluate(dev_df, max_examples=None)

print(f"\nExpected accuracy for {CURRENT_SIZE} dataset: ~23-28%")
print(f"Actual accuracy: {results['accuracy']*100:.2f}%")

## 10.8 Error Analysis

Let's look at some examples where the model got it right vs wrong.

In [None]:
# Add predictions to dev_df
# Note: This uses all predictions from the full dev set evaluation above
dev_sample = dev_df.copy()
dev_sample['prediction'] = results['predictions']
dev_sample['correct'] = dev_sample['prediction'] == dev_sample['answer']

# Show correct predictions
print("=" * 80)
print("CORRECT PREDICTIONS (Sample of 5)")
print("=" * 80)
correct_samples = dev_sample[dev_sample['correct']].head(5)
for idx, row in correct_samples.iterrows():
    print(f"\nContext: {row['context']}")
    print(f"First letter: '{row['first letter']}'")
    print(f"Prediction: {row['prediction']}")
    print(f"Answer: {row['answer']}")
    print(f"✓ CORRECT")

# Show incorrect predictions
print("\n" + "=" * 80)
print("INCORRECT PREDICTIONS (Sample of 5)")
print("=" * 80)
incorrect_samples = dev_sample[~dev_sample['correct']].head(5)
for idx, row in incorrect_samples.iterrows():
    print(f"\nContext: {row['context']}")
    print(f"First letter: '{row['first letter']}'")
    print(f"Prediction: {row['prediction']}")
    print(f"Answer: {row['answer']}")
    print(f"✗ INCORRECT")

## 10.9 Scaling Experiments

Now let's see how accuracy changes with more training data.

**Note:** This will take progressively longer:
- 10K: ~10 seconds
- 100K: ~1-2 minutes
- 1M: ~10-15 minutes
- Full (3.8M): ~40-60 minutes

In [5]:
# Run experiments on different data sizes
# Select which sizes you want to test
sizes_to_test = [
    'debug',   # 10K - fast testing (~10 sec)
    'dev',     # 100K - medium (~1-2 min)
    'large',   # 1M - slow (~10-15 min)
    # 'full',    # 3.8M - very slow (~40-60 min)
]

scaling_results = []

print("=" * 80)
print("STARTING MULTI-SIZE TRAINING EXPERIMENT")
print("=" * 80)
print(f"Will train on {len(sizes_to_test)} different dataset sizes:")
for size_key in sizes_to_test:
    print(f"  - {size_key}: {DATA_SIZES[size_key]:,} sentences")
print()

for size_key in sizes_to_test:
    print("\n" + "=" * 80)
    print(f"EXPERIMENT {len(scaling_results) + 1}/{len(sizes_to_test)}: {size_key.upper()} DATASET")
    print("=" * 80)
    print(f"Training on {DATA_SIZES[size_key]:,} sentences...")
    
    # Sample data
    data = sample_data(train_lines, size_key)
    
    # Train model
    current_model = SixgramModel()
    train_start = time.time()
    current_model.train(data)
    train_time = time.time() - train_start
    
    # Evaluate on full dev set
    print(f"\nEvaluating on full dev set ({len(dev_df):,} examples)...")
    eval_start = time.time()
    results = current_model.evaluate(dev_df, max_examples=None)
    eval_time = time.time() - eval_start
    
    # Store results
    scaling_results.append({
        'size_key': size_key,
        'num_sentences': len(data),
        'train_time_sec': train_time,
        'eval_time_sec': eval_time,
        'total_time_sec': train_time + eval_time,
        'accuracy': results['accuracy'],
        'correct': results['correct'],
        'total': results['total'],
        'unique_6grams': len(current_model.sixgram_counts),
        'unique_5grams': len(current_model.fivegram_counts),
        'unique_4grams': len(current_model.fourgram_counts),
        'unique_trigrams': len(current_model.trigram_counts),
        'unique_bigrams': len(current_model.bigram_counts),
        'unique_unigrams': len(current_model.unigram_counts),
    })
    
    print(f"\n✓ Completed {size_key} in {train_time + eval_time:.1f}s (train: {train_time:.1f}s, eval: {eval_time:.1f}s)")
    print(f"  Accuracy: {results['accuracy']*100:.2f}%")

# Show comprehensive summary
print("\n" + "=" * 80)
print("FINAL RESULTS SUMMARY")
print("=" * 80)
print()

# Table 1: Performance Summary
print("PERFORMANCE SUMMARY:")
print("-" * 80)
print(f"{'Dataset':<12} {'Sentences':>12} {'Accuracy':>10} {'Correct':>10} {'Train Time':>12} {'Eval Time':>12}")
print("-" * 80)
for result in scaling_results:
    print(f"{result['size_key']:<12} {result['num_sentences']:>12,} {result['accuracy']*100:>9.2f}% "
          f"{result['correct']:>10,} {result['train_time_sec']:>11.1f}s {result['eval_time_sec']:>11.1f}s")
print("-" * 80)

# Table 2: Model Size Summary
print("\nMODEL SIZE SUMMARY:")
print("-" * 80)
print(f"{'Dataset':<12} {'6-grams':>12} {'5-grams':>12} {'4-grams':>12} {'Trigrams':>12} {'Bigrams':>12}")
print("-" * 80)
for result in scaling_results:
    print(f"{result['size_key']:<12} {result['unique_6grams']:>12,} {result['unique_5grams']:>12,} "
          f"{result['unique_4grams']:>12,} {result['unique_trigrams']:>12,} {result['unique_bigrams']:>12,}")
print("-" * 80)

# Table 3: Accuracy Improvement
print("\nACCURACY IMPROVEMENT:")
print("-" * 80)
if len(scaling_results) > 1:
    baseline_acc = scaling_results[0]['accuracy']
    for i, result in enumerate(scaling_results):
        if i == 0:
            print(f"{result['size_key']:<12} {result['accuracy']*100:>9.2f}% (baseline)")
        else:
            improvement = (result['accuracy'] - baseline_acc) * 100
            print(f"{result['size_key']:<12} {result['accuracy']*100:>9.2f}% (+{improvement:.2f}% vs {scaling_results[0]['size_key']})")
print("-" * 80)

# Save results to JSON
results_filename = 'scaling_results_6gram.json'
with open(results_filename, 'w') as f:
    json.dump(scaling_results, f, indent=2)
print(f"\n✓ Results saved to {results_filename}")

# Create a simple learning curve visualization using text
print("\nLEARNING CURVE (Accuracy vs Data Size):")
print("-" * 80)
max_acc = max(r['accuracy'] for r in scaling_results)
for result in scaling_results:
    bar_length = int((result['accuracy'] / max_acc) * 50)
    bar = '█' * bar_length
    print(f"{result['size_key']:<12} {result['num_sentences']:>12,} |{bar} {result['accuracy']*100:.2f}%")
print("-" * 80)

STARTING MULTI-SIZE TRAINING EXPERIMENT
Will train on 3 different dataset sizes:
  - debug: 10,000 sentences
  - dev: 100,000 sentences
  - large: 1,000,000 sentences


EXPERIMENT 1/3: DEBUG DATASET
Training on 10,000 sentences...
Training 6-gram model on 10,000 sentences...


Processing sentences: 100%|██████████| 10000/10000 [00:01<00:00, 9557.89it/s]



Training complete in 1.06 seconds
Total 6-grams: 337,976
Total 5-grams: 347,976
Total 4-grams: 357,976
Total trigrams: 367,976
Total bigrams: 377,976
Total unigrams: 337,976
Unique 6-grams: 256,373
Unique 5-grams: 246,429
Unique 4-grams: 227,026
Unique trigrams: 185,370
Unique bigrams: 102,716
Unique unigrams: 14,885
Vocabulary size: 14,884

Evaluating on full dev set (94,825 examples)...

Evaluating on development set...


Predicting: 100%|██████████| 94825/94825 [01:11<00:00, 1321.51it/s]



Results:
  Total examples: 94,825
  Correct: 41,511
  Accuracy: 0.4378 (43.78%)

✓ Completed debug in 72.8s (train: 1.1s, eval: 71.8s)
  Accuracy: 43.78%

EXPERIMENT 2/3: DEV DATASET
Training on 100,000 sentences...
Training 6-gram model on 100,000 sentences...


Processing sentences: 100%|██████████| 100000/100000 [00:13<00:00, 7454.22it/s]



Training complete in 13.42 seconds
Total 6-grams: 3,435,020
Total 5-grams: 3,535,020
Total 4-grams: 3,635,020
Total trigrams: 3,735,020
Total bigrams: 3,835,020
Total unigrams: 3,435,020
Unique 6-grams: 2,631,235
Unique 5-grams: 2,446,576
Unique 4-grams: 2,100,219
Unique trigrams: 1,454,744
Unique bigrams: 576,344
Unique unigrams: 42,351
Vocabulary size: 42,350

Evaluating on full dev set (94,825 examples)...

Evaluating on development set...


Predicting: 100%|██████████| 94825/94825 [05:15<00:00, 300.90it/s]



Results:
  Total examples: 94,825
  Correct: 48,055
  Accuracy: 0.5068 (50.68%)

✓ Completed dev in 328.6s (train: 13.4s, eval: 315.1s)
  Accuracy: 50.68%

EXPERIMENT 3/3: LARGE DATASET
Training on 1,000,000 sentences...
Training 6-gram model on 1,000,000 sentences...


Processing sentences: 100%|██████████| 1000000/1000000 [05:42<00:00, 2920.37it/s]



Training complete in 342.42 seconds
Total 6-grams: 34,184,775
Total 5-grams: 35,184,775
Total 4-grams: 36,184,775
Total trigrams: 37,184,775
Total bigrams: 38,184,775
Total unigrams: 34,184,775
Unique 6-grams: 22,676,480
Unique 5-grams: 20,011,877
Unique 4-grams: 15,494,989
Unique trigrams: 8,776,976
Unique bigrams: 2,433,677
Unique unigrams: 79,021
Vocabulary size: 79,020

Evaluating on full dev set (94,825 examples)...

Evaluating on development set...


Predicting:   5%|▍         | 4393/94825 [00:47<16:19, 92.31it/s]  


KeyboardInterrupt: 

## 10.10 Save Model (Optional)

Save the trained model for later use.

**Note:** 6-gram models can be very large (>1GB for full dataset). Only save if you have enough disk space.

In [None]:
import pickle
import sys

# Quick model size check
def get_size_mb(obj):
    """Get approximate size in MB"""
    size = sys.getsizeof(obj)
    if hasattr(obj, '__dict__'):
        for key, val in obj.__dict__.items():
            size += sys.getsizeof(val)
            if isinstance(val, dict):
                for k, v in val.items():
                    size += sys.getsizeof(k) + sys.getsizeof(v)
    return size / (1024 * 1024)

# Check model size
size_mb = get_size_mb(model)
print(f"Estimated model size: {size_mb:.2f} MB")

if size_mb > 1000:
    print("⚠️  WARNING: Model is very large (>1GB). Consider not saving or using compression.")
    save_model = input("Do you want to save anyway? (yes/no): ")
    if save_model.lower() != 'yes':
        print("Model not saved.")
else:
    # Save model
    model_filename = f'sixgram_model_{CURRENT_SIZE}.pkl'
    with open(model_filename, 'wb') as f:
        pickle.dump(model, f)
    
    print(f"✓ Model saved to {model_filename}")
    print(f"  Size: {size_mb:.2f} MB")
    
    # To load later:
    # with open(model_filename, 'rb') as f:
    #     loaded_model = pickle.load(f)

## 10.11 Next Steps

**Current Status:**
- ✅ Trigram model implemented
- ✅ 4-gram model implemented
- ✅ 5-gram model implemented
- ✅ 6-gram model implemented
- ✅ Tested on debug dataset (10K)
- ✅ Evaluated on dev set

**Performance Observations:**
- 6-grams provide slightly better accuracy than 5-grams (~1-2% improvement)
- Diminishing returns: going beyond 6-grams typically doesn't help much
- Model size grows significantly with higher n-grams

**To improve performance further:**

1. **More data**: Train on larger datasets (100K, 1M, Full)
   - Expected: ~41-45% on 100K, ~55-59% on 1M, ~58-62% on Full

2. **Better smoothing**: Add-k smoothing or Kneser-Ney
   - Current: Simple MLE with backoff
   - Improvement: +3-8% accuracy

3. **KenLM**: Use optimized library with Modified Kneser-Ney
   - Expected: 58-65% accuracy
   - Best pure n-gram performance

4. **Advanced methods**: Neural models (LSTM, BiLSTM) or model ensembles
   - See notebooks 7-9 for neural approaches
   - See notebook 6 for ensemble methods