# NLTK Complete Guide - Section 10: N-Grams & Language Models

This notebook covers:
- What are N-Grams?
- Generating N-Grams
- N-Gram Frequency Analysis
- Collocations
- Simple Language Models
- Text Generation

In [1]:
import nltk
import random
from collections import Counter, defaultdict

nltk.download('punkt', quiet=True)
nltk.download('gutenberg', quiet=True)
nltk.download('stopwords', quiet=True)

from nltk import ngrams, bigrams, trigrams
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import gutenberg, stopwords
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures

## 10.1 What are N-Grams?

**N-grams** are contiguous sequences of n items from text. They're the foundation of statistical language models and many NLP applications.

### The Core Idea

Given a sequence of words, an n-gram captures a "sliding window" of n consecutive items:

| Type | N | Window Size | Example ("I love NLP") |
|------|---|-------------|------------------------|
| Unigram | 1 | 1 word | ["I", "love", "NLP"] |
| Bigram | 2 | 2 words | [("I", "love"), ("love", "NLP")] |
| Trigram | 3 | 3 words | [("I", "love", "NLP")] |
| 4-gram | 4 | 4 words | Not enough words! |

### Why N-Grams Matter

1. **Capture Local Context**: "New York" means something different than "New" and "York" separately
2. **Statistical Patterns**: We can count how often word sequences appear
3. **Prediction**: If we see "New", what word likely follows? ("York", "Zealand", "Year"...)
4. **Simplicity**: No deep learning needed - just counting!

### The Trade-off: N Size

| Small N (1-2) | Large N (4+) |
|---------------|--------------|
| ✅ More training examples | ❌ Fewer training examples |
| ✅ Better coverage | ❌ Sparse data problem |
| ❌ Less context | ✅ More context |
| ❌ Less accurate predictions | ✅ More accurate (when data exists) |

**Sweet spot**: Trigrams (n=3) often balance context and data availability.

In [2]:
text = "I love natural language processing"
tokens = word_tokenize(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}\n")

# Generate n-grams
unigrams = list(ngrams(tokens, 1))
bi_grams = list(ngrams(tokens, 2))
tri_grams = list(ngrams(tokens, 3))
four_grams = list(ngrams(tokens, 4))

print(f"Unigrams (1): {unigrams}")
print(f"Bigrams (2):  {bi_grams}")
print(f"Trigrams (3): {tri_grams}")
print(f"4-grams (4):  {four_grams}")

Text: I love natural language processing
Tokens: ['I', 'love', 'natural', 'language', 'processing']

Unigrams (1): [('I',), ('love',), ('natural',), ('language',), ('processing',)]
Bigrams (2):  [('I', 'love'), ('love', 'natural'), ('natural', 'language'), ('language', 'processing')]
Trigrams (3): [('I', 'love', 'natural'), ('love', 'natural', 'language'), ('natural', 'language', 'processing')]
4-grams (4):  [('I', 'love', 'natural', 'language'), ('love', 'natural', 'language', 'processing')]


## 10.2 NLTK Convenience Functions

NLTK provides helper functions that wrap the general `ngrams()` function for common cases.

In [3]:
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text.lower())

print(f"Text: {text}\n")

# Using convenience functions
print("Bigrams (using bigrams()):")
for bg in bigrams(tokens):
    print(f"  {bg}")

print("\nTrigrams (using trigrams()):")
for tg in trigrams(tokens):
    print(f"  {tg}")

Text: The quick brown fox jumps over the lazy dog

Bigrams (using bigrams()):
  ('the', 'quick')
  ('quick', 'brown')
  ('brown', 'fox')
  ('fox', 'jumps')
  ('jumps', 'over')
  ('over', 'the')
  ('the', 'lazy')
  ('lazy', 'dog')

Trigrams (using trigrams()):
  ('the', 'quick', 'brown')
  ('quick', 'brown', 'fox')
  ('brown', 'fox', 'jumps')
  ('fox', 'jumps', 'over')
  ('jumps', 'over', 'the')
  ('over', 'the', 'lazy')
  ('the', 'lazy', 'dog')


## 10.3 N-Gram with Padding

### The Boundary Problem

Consider the sentence "I love NLP" with bigrams:
- `("I", "love")` ✅
- `("love", "NLP")` ✅

But what about:
- What comes **before** "I"? (sentence start)
- What comes **after** "NLP"? (sentence end)

Without handling boundaries, the model can't learn:
- How sentences typically **start**
- How sentences typically **end**

### The Solution: Padding Tokens

Add special markers:
- `<s>` = Start of sentence
- `</s>` = End of sentence

```
Original:  I love NLP
Padded:    <s> I love NLP </s>

Bigrams:   (<s>, I), (I, love), (love, NLP), (NLP, </s>)
```

Now the model learns:
- Sentences often start with "I", "The", "She", etc.
- Sentences often end with nouns, periods, etc.

In [4]:
from nltk.lm.preprocessing import padded_everygram_pipeline, pad_both_ends

text = "I love NLP"
tokens = word_tokenize(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}\n")

# Without padding
print("Bigrams without padding:")
print(list(bigrams(tokens)))

# With padding
print("\nBigrams with padding:")
padded = list(pad_both_ends(tokens, n=2))
print(f"Padded tokens: {padded}")
print(f"Padded bigrams: {list(bigrams(padded))}")

Text: I love NLP
Tokens: ['I', 'love', 'NLP']

Bigrams without padding:
[('I', 'love'), ('love', 'NLP')]

Bigrams with padding:
Padded tokens: ['<s>', 'I', 'love', 'NLP', '</s>']
Padded bigrams: [('<s>', 'I'), ('I', 'love'), ('love', 'NLP'), ('NLP', '</s>')]


## 10.4 N-Gram Frequency Analysis

Counting n-gram frequencies reveals patterns in text:
- **Common bigrams**: "of the", "in the", "to be" - often function words
- **Rare bigrams**: Usually content-specific or unusual combinations

This is the foundation of language models: **frequent patterns are more likely**.

In [5]:
# Load sample text
text = gutenberg.raw('austen-emma.txt')[:10000]  # First 10K chars
tokens = word_tokenize(text.lower())

# Filter to alphabetic tokens only
tokens = [t for t in tokens if t.isalpha()]

print(f"Total tokens: {len(tokens)}")
print(f"Sample: {tokens[:20]}")

Total tokens: 1777
Sample: ['emma', 'by', 'jane', 'austen', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home', 'and', 'happy']


In [6]:
# Bigram frequencies
bi_grams = list(bigrams(tokens))
bigram_freq = Counter(bi_grams)

print("Top 15 Most Common Bigrams:")
print("-" * 40)
for bg, count in bigram_freq.most_common(15):
    print(f"{bg[0]:<10} {bg[1]:<10} {count:>5}")

Top 15 Most Common Bigrams:
----------------------------------------
miss       taylor        13
of         her           11
it         was            9
she        had            9
of         the            7
was        a              7
he         was            7
for        her            6
her        own            6
her        father         6
in         the            5
of         a              4
a          very           4
had        been           4
thought    of             4


In [7]:
# Trigram frequencies
tri_grams = list(trigrams(tokens))
trigram_freq = Counter(tri_grams)

print("Top 15 Most Common Trigrams:")
print("-" * 50)
for tg, count in trigram_freq.most_common(15):
    print(f"{tg[0]:<10} {tg[1]:<10} {tg[2]:<10} {count:>5}")

Top 15 Most Common Trigrams:
--------------------------------------------------
it         was        a              3
how        she        had            3
a          mile       from           3
a          house      of             3
house      of         her            3
of         her        own            3
had        miss       taylor         2
miss       taylor     had            2
her        own        the            2
not        at         all            2
of         miss       taylor         2
of         the        family         2
only       half       a              2
half       a          mile           2
mile       from       them           2


## 10.5 Collocations

**Collocations** are word combinations that occur together more often than chance would predict. Unlike simple frequency counting, collocation finding uses **statistical measures** to identify meaningful phrases.

### Why Not Just Use Frequency?

High-frequency bigrams like "of the" or "in the" aren't interesting - they appear often simply because "the" is common. We want phrases where the words have a **special relationship**.

### Statistical Measures

| Measure | What It Finds | Good For |
|---------|---------------|----------|
| **PMI** (Pointwise Mutual Information) | Words that co-occur more than expected | Rare but meaningful phrases |
| **Chi-Square** | Statistical significance of co-occurrence | Technical/domain terms |
| **Likelihood Ratio** | How much more likely together vs apart | Balanced approach |

### PMI Formula (Under the Hood)

$$PMI(x, y) = \log_2 \frac{P(x, y)}{P(x) \cdot P(y)}$$

- If words are **independent**: $P(x,y) = P(x) \cdot P(y)$, so $PMI = 0$
- If words **attract**: $P(x,y) > P(x) \cdot P(y)$, so $PMI > 0$
- If words **repel**: $P(x,y) < P(x) \cdot P(y)$, so $PMI < 0$

In [8]:
# Load more text
text = gutenberg.raw('austen-emma.txt')
tokens = word_tokenize(text.lower())
tokens = [t for t in tokens if t.isalpha() and len(t) > 2]

print(f"Total tokens: {len(tokens):,}")

Total tokens: 121,268


In [9]:
# Find bigram collocations
bigram_finder = BigramCollocationFinder.from_words(tokens)

# Filter low-frequency bigrams
bigram_finder.apply_freq_filter(5)

# Get top collocations using PMI (Pointwise Mutual Information)
bigram_measures = BigramAssocMeasures()

print("Top 15 Bigram Collocations (PMI):")
print("-" * 40)
for colloc in bigram_finder.nbest(bigram_measures.pmi, 15):
    print(f"  {colloc[0]} {colloc[1]}")

Top 15 Bigram Collocations (PMI):
----------------------------------------
  sore throat
  brunswick square
  william larkins
  baked apples
  box hill
  sixteen miles
  maple grove
  hair cut
  south end
  colonel campbell
  protest against
  robert martin
  five couple
  vast deal
  ready wit


In [10]:
# Different scoring methods
print("Top 10 by Likelihood Ratio:")
for colloc in bigram_finder.nbest(bigram_measures.likelihood_ratio, 10):
    print(f"  {colloc[0]} {colloc[1]}")

print("\nTop 10 by Chi-Square:")
for colloc in bigram_finder.nbest(bigram_measures.chi_sq, 10):
    print(f"  {colloc[0]} {colloc[1]}")

Top 10 by Likelihood Ratio:
  frank churchill
  had been
  miss woodhouse
  have been
  any thing
  could not
  she had
  miss bates
  miss fairfax
  did not

Top 10 by Chi-Square:
  maple grove
  brunswick square
  box hill
  william larkins
  sore throat
  frank churchill
  colonel campbell
  robert martin
  baked apples
  great deal


In [11]:
# Trigram collocations
trigram_finder = TrigramCollocationFinder.from_words(tokens)
trigram_finder.apply_freq_filter(3)

trigram_measures = TrigramAssocMeasures()

print("Top 15 Trigram Collocations:")
print("-" * 50)
for colloc in trigram_finder.nbest(trigram_measures.pmi, 15):
    print(f"  {' '.join(colloc)}")

Top 15 Trigram Collocations:
--------------------------------------------------
  bad sore throat
  lovely woman reigns
  woman reigns alone
  beg your pardon
  monarch the seas
  box hill party
  but frozen maid
  husbands and wives
  laid down upon
  fair but frozen
  kitty fair but
  eating and drinking
  woman lovely woman
  pray take care
  like maple grove


## 10.6 Simple Language Model

### What is a Language Model?

A **language model** assigns probabilities to sequences of words. It answers: "How likely is this sentence?"

### The Chain Rule of Probability

For a sentence $W = w_1, w_2, ..., w_n$:

$$P(W) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1,w_2) \cdot ... \cdot P(w_n|w_1,...,w_{n-1})$$

**Problem**: We'd need to count every possible word history - impossible!

### The Markov Assumption (Key Insight!)

**Assume** the next word depends only on the previous $n-1$ words:

- **Bigram**: $P(w_i|w_1...w_{i-1}) \approx P(w_i|w_{i-1})$
- **Trigram**: $P(w_i|w_1...w_{i-1}) \approx P(w_i|w_{i-2}, w_{i-1})$

This is the **Markov assumption** - the future depends only on the recent past.

### Maximum Likelihood Estimation (MLE)

We estimate probabilities by **counting**:

$$P(w_n|w_{n-1}) = \frac{Count(w_{n-1}, w_n)}{Count(w_{n-1})}$$

Example: What's $P(\text{knightley}|\text{mr})$?

$$P(\text{knightley}|\text{mr}) = \frac{\text{Times "mr knightley" appears}}{\text{Times "mr" appears}}$$

### Building Our Own Bigram Model

Let's implement this from scratch to understand how it works:

In [12]:
class SimpleBigramModel:
    """Simple bigram language model"""
    
    def __init__(self):
        self.bigram_counts = defaultdict(Counter)
        self.unigram_counts = Counter()
    
    def train(self, tokens):
        """Train on a list of tokens"""
        # Count unigrams
        self.unigram_counts = Counter(tokens)
        
        # Count bigrams (word1 -> word2)
        for w1, w2 in bigrams(tokens):
            self.bigram_counts[w1][w2] += 1
    
    def probability(self, word, context):
        """P(word | context)"""
        if context not in self.bigram_counts:
            return 0
        
        total = sum(self.bigram_counts[context].values())
        return self.bigram_counts[context][word] / total
    
    def next_word_probs(self, context):
        """Get probabilities for all possible next words"""
        if context not in self.bigram_counts:
            return {}
        
        total = sum(self.bigram_counts[context].values())
        return {word: count/total 
                for word, count in self.bigram_counts[context].items()}
    
    def generate(self, start_word, length=10):
        """Generate text starting from a word"""
        words = [start_word]
        current = start_word
        
        for _ in range(length - 1):
            if current not in self.bigram_counts:
                break
            
            # Get next word probabilities
            probs = self.next_word_probs(current)
            if not probs:
                break
            
            # Choose next word weighted by probability
            next_words = list(probs.keys())
            weights = list(probs.values())
            current = random.choices(next_words, weights=weights)[0]
            words.append(current)
        
        return ' '.join(words)

In [13]:
# Train the model
text = gutenberg.raw('austen-emma.txt')
tokens = word_tokenize(text.lower())
tokens = [t for t in tokens if t.isalpha()]

model = SimpleBigramModel()
model.train(tokens)

print(f"Vocabulary size: {len(model.unigram_counts):,}")
print(f"Unique bigram contexts: {len(model.bigram_counts):,}")

Vocabulary size: 6,932
Unique bigram contexts: 6,931


### How the Model Works Internally

The `SimpleBigramModel` stores two data structures:

1. **`unigram_counts`**: How often each word appears
   ```
   {"the": 5000, "mr": 800, "emma": 500, ...}
   ```

2. **`bigram_counts`**: For each word, what words follow it
   ```
   {"mr": {"knightley": 200, "woodhouse": 150, "elton": 100, ...},
    "the": {"house": 50, "young": 40, ...}}
   ```

When we call `probability("knightley", "mr")`:
- Look up all words following "mr": 200 + 150 + 100 + ... = 500 total
- "knightley" appears 200 times after "mr"
- Return: 200/500 = 0.40 (40%)

In [14]:
# Check probabilities
context = "mr"
print(f"Words that follow '{context}':")
print("-" * 30)

probs = model.next_word_probs(context)
sorted_probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)

for word, prob in sorted_probs[:10]:
    print(f"  {word:<15} {prob:.2%}")

Words that follow 'mr':
------------------------------
  knightley       22.58%
  elton           22.58%
  weston          6.45%
  woodhouse       4.84%
  dixon           3.23%
  richard         3.23%
  churchill       3.23%
  perry           3.23%
  martin          1.61%
  robert          1.61%


In [15]:
# Generate text
print("Generated text samples:")
print("=" * 60)

start_words = ["the", "she", "he", "it", "mr"]

for start in start_words:
    generated = model.generate(start, length=12)
    print(f"\n'{start}' → {generated}")

Generated text samples:

'the' → the sofa for elton as they proceeded a sincerity it is beyond

'she' → she heard and much exultation i ever shall grant you cried emma

'he' → he would rather not let us to speak therefore every sacrifice for

'it' → it will be going this belief on so active half a moment

'mr' → mr knightley you heard you apprehended from him and the same they


### Text Generation: How It Works Step by Step

The `generate()` method uses **random sampling** weighted by probabilities:

```
Start: "mr"
Step 1: P(next | "mr") = {"knightley": 0.40, "woodhouse": 0.30, "elton": 0.20, ...}
        Random choice → "knightley" (40% chance)
        
Step 2: P(next | "knightley") = {"was": 0.25, "had": 0.20, "could": 0.15, ...}
        Random choice → "was" (25% chance)
        
Step 3: P(next | "was") = {"not": 0.15, "a": 0.12, "very": 0.10, ...}
        ...continue...
```

**Key insight**: Each word choice is **probabilistic**, so running generation multiple times gives different results!

## 10.7 NLTK's Language Model

NLTK provides a more sophisticated implementation with:
- Built-in padding handling
- Support for different n-gram sizes
- Various smoothing techniques (we'll use MLE - Maximum Likelihood Estimation)

### MLE vs Smoothed Models

| Model | Handles Unseen N-grams? | Use Case |
|-------|------------------------|----------|
| **MLE** | ❌ Returns 0 probability | When training data is comprehensive |
| **Laplace** | ✅ Adds 1 to all counts | Simple smoothing |
| **Kneser-Ney** | ✅ Advanced smoothing | Production systems |

For learning purposes, MLE is clearest to understand.

In [16]:
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

# Prepare training data
text = gutenberg.raw('austen-emma.txt')[:50000]
sentences = sent_tokenize(text)
tokenized_sents = [word_tokenize(s.lower()) for s in sentences]
tokenized_sents = [[t for t in s if t.isalpha()] for s in tokenized_sents]

# Remove empty sentences
tokenized_sents = [s for s in tokenized_sents if len(s) > 0]

print(f"Number of sentences: {len(tokenized_sents)}")
print(f"Sample: {tokenized_sents[0][:10]}")

Number of sentences: 378
Sample: ['emma', 'by', 'jane', 'austen', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse']


In [17]:
# Create training data with padding
n = 3  # trigram model
train_data, vocab = padded_everygram_pipeline(n, tokenized_sents)

# Train MLE (Maximum Likelihood Estimation) model
lm = MLE(n)
lm.fit(train_data, vocab)

print(f"Vocabulary size: {len(lm.vocab):,}")

Vocabulary size: 1,650


### What `padded_everygram_pipeline` Does

This function prepares training data by:

1. **Padding each sentence** with `<s>` and `</s>` markers
2. **Generating all n-grams** from 1 to n (hence "everygram")
3. **Building a vocabulary** of all unique words

For a trigram model (n=3), each sentence generates:
- Unigrams: (word1), (word2), ...
- Bigrams: (word1, word2), (word2, word3), ...
- Trigrams: (word1, word2, word3), ...

The model uses **backoff**: if it hasn't seen a trigram, it falls back to bigram probabilities.

In [18]:
# Score some words given context
print("P(word | context)")
print("-" * 40)

contexts = [
    (["she", "was"], "very"),
    (["she", "was"], "not"),
    (["mr"], "knightley"),
    (["mr"], "woodhouse"),
]

for context, word in contexts:
    prob = lm.score(word, context)
    print(f"P({word} | {' '.join(context)}) = {prob:.4f}")

P(word | context)
----------------------------------------
P(very | she was) = 0.0455
P(not | she was) = 0.0455
P(knightley | mr) = 0.0000
P(woodhouse | mr) = 0.0000


### Scoring Words: The `score()` Method

`lm.score(word, context)` returns $P(\text{word} | \text{context})$

For a trigram model with context `["she", "was"]`:
- It looks up: How often does "very" follow "she was"?
- Divides by: How often does "she was" appear?

If the exact trigram isn't found, it may back off to bigram or unigram.

### Understanding Padding Tokens in Generation

The `padded_everygram_pipeline` adds special boundary markers:
- `<s>` - Start of sentence
- `</s>` - End of sentence

**Why Generation Can Be Short**

When the model generates without a seed, it starts from `<s>`:
1. `<s>` → model picks a sentence-starting word
2. Eventually generates `</s>` (learned as "end of sentence")
3. After `</s>`, the most likely next token is... another `</s>`!

This is because in training data, `</s>` is followed by `<s>` (next sentence) or nothing. The model gets "stuck" generating end markers.

**Solution**: Use `text_seed` to start with real words, giving the model meaningful context.

### Why Trigram Models Need 2-Word Seeds

For an n-gram model, the context size is **n-1 words**:

| Model | Context Size | Good Seed |
|-------|-------------|-----------|
| Bigram (n=2) | 1 word | `["she"]` |
| Trigram (n=3) | 2 words | `["she", "was"]` |
| 4-gram (n=4) | 3 words | `["she", "was", "very"]` |

With insufficient context, the model can't find matching n-grams and may stop early.

In [19]:
# Generate text using NLTK's model
print("Generated text (NLTK MLE model):")
print("=" * 50)

def generate_clean_text(model, num_words=15, seed=None, text_seed=None):
    """Generate text with a starting word to avoid short outputs"""
    # Use text_seed to start generation with actual words (not <s>)
    raw = model.generate(num_words, text_seed=text_seed, random_seed=seed)
    # Filter out any padding tokens
    cleaned = [word for word in raw if word not in ['<s>', '</s>']]
    return ' '.join(cleaned)

# Use different starting words/phrases for better results
# Using 2-word seeds works better with trigram model
start_words = [
    ['she', 'was'],
    ['the', 'young'],
    ['mr', 'knightley'],
    ['it', 'was'],
    ['emma', 'could']
]

for i, seed_words in enumerate(start_words):
    text = generate_clean_text(lm, num_words=15, seed=i*10, text_seed=seed_words)
    print(f"{i+1}. ({' '.join(seed_words)}...) {text}")

Generated text (NLTK MLE model):
1. (she was...) sure whenever he does not read
2. (the young...) man had made highbury feel a sort of pride and importance which the connexion would
3. (mr knightley...) to mean
4. (it was...) most unlikely therefore that he had made his fortune entirely to make atonement to herself
5. (emma could...) not walk half so far


### Text Generation Deep Dive

The `lm.generate()` method works like this:

```python
def generate(num_words, text_seed, random_seed):
    # 1. Start with the seed words as context
    context = text_seed  # e.g., ["she", "was"]
    output = list(context)
    
    # 2. For each word to generate:
    for _ in range(num_words):
        # Get probability distribution over all words given context
        # P(word | context[-2:]) for trigram
        probs = get_next_word_distribution(context)
        
        # Randomly sample from distribution
        next_word = random_sample(probs)
        
        # Add to output and update context
        output.append(next_word)
        context = output[-(n-1):]  # Keep last n-1 words
    
    return output
```

**Key points**:
1. **Context window slides**: Only the last n-1 words matter
2. **Random sampling**: Same seed gives different outputs each run (unless `random_seed` is set)
3. **Probability-weighted**: Common continuations are chosen more often

## 10.8 Practical: N-Gram Text Analysis

Let's put it all together with a reusable analysis function.

In [20]:
def analyze_ngrams(text, n=2, top_k=10, remove_stopwords=True):
    """Comprehensive n-gram analysis"""
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha()]
    
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [t for t in tokens if t not in stop_words]
    
    # Generate n-grams
    grams = list(ngrams(tokens, n))
    freq = Counter(grams)
    
    return {
        'total_ngrams': len(grams),
        'unique_ngrams': len(freq),
        'top_ngrams': freq.most_common(top_k),
    }

In [21]:
# Analyze a text
text = """Machine learning is a subset of artificial intelligence.
Machine learning enables computers to learn from data.
Deep learning is a subset of machine learning.
Natural language processing uses machine learning.
Machine learning models can process natural language."""

print(f"Text:\n{text}\n")
print("=" * 50)

for n in [1, 2, 3]:
    result = analyze_ngrams(text, n=n, remove_stopwords=True)
    
    print(f"\n{n}-grams Analysis:")
    print(f"  Total: {result['total_ngrams']}")
    print(f"  Unique: {result['unique_ngrams']}")
    print(f"  Top {n}-grams:")
    for gram, count in result['top_ngrams']:
        print(f"    {' '.join(gram)}: {count}")

Text:
Machine learning is a subset of artificial intelligence.
Machine learning enables computers to learn from data.
Deep learning is a subset of machine learning.
Natural language processing uses machine learning.
Machine learning models can process natural language.


1-grams Analysis:
  Total: 28
  Unique: 16
  Top 1-grams:
    learning: 6
    machine: 5
    subset: 2
    natural: 2
    language: 2
    artificial: 1
    intelligence: 1
    enables: 1
    computers: 1
    learn: 1

2-grams Analysis:
  Total: 27
  Unique: 21
  Top 2-grams:
    machine learning: 5
    learning subset: 2
    natural language: 2
    subset artificial: 1
    artificial intelligence: 1
    intelligence machine: 1
    learning enables: 1
    enables computers: 1
    computers learn: 1
    learn data: 1

3-grams Analysis:
  Total: 26
  Unique: 26
  Top 3-grams:
    machine learning subset: 1
    learning subset artificial: 1
    subset artificial intelligence: 1
    artificial intelligence machine: 1
    in

## Summary

### N-Gram Functions

| Function | Description |
|----------|-------------|
| `ngrams(tokens, n)` | Generate n-grams of any size |
| `bigrams(tokens)` | Shortcut for `ngrams(tokens, 2)` |
| `trigrams(tokens)` | Shortcut for `ngrams(tokens, 3)` |
| `pad_both_ends(tokens, n)` | Add `<s>` and `</s>` markers |
| `padded_everygram_pipeline(n, sents)` | Full preprocessing for LM training |

### Collocation Finders

| Class | Use |
|-------|-----|
| `BigramCollocationFinder` | Find significant 2-word phrases |
| `TrigramCollocationFinder` | Find significant 3-word phrases |

### Collocation Measures

| Measure | Formula Intuition | Best For |
|---------|-------------------|----------|
| **PMI** | $\log \frac{P(x,y)}{P(x)P(y)}$ | Rare meaningful phrases |
| **Chi-Square** | Statistical test for independence | Technical terms |
| **Likelihood Ratio** | Ratio of hypotheses | Balanced results |

### Language Model Key Concepts

| Concept | Explanation |
|---------|-------------|
| **Markov Assumption** | Next word depends only on previous n-1 words |
| **MLE** | Estimate P(word\|context) by counting |
| **Padding** | `<s>` and `</s>` tokens mark sentence boundaries |
| **Context Window** | The n-1 words used to predict the next word |
| **Backoff** | Fall back to smaller n-grams if exact match not found |

### Text Generation Process

1. **Initialize** with seed words (n-1 words for n-gram model)
2. **Look up** probability distribution for next word given context
3. **Sample** randomly from distribution (weighted by probability)
4. **Slide** context window and repeat

### Limitations of N-Gram Models

| Limitation | Why It Happens |
|------------|----------------|
| **Short memory** | Only sees n-1 previous words |
| **Data sparsity** | Many valid n-grams never seen in training |
| **No semantics** | Treats words as arbitrary symbols |
| **Large storage** | Must store all observed n-grams |

### When to Use N-Grams

✅ **Good for**: Autocomplete, spell checking, simple text generation, keyphrase extraction, language identification

❌ **Not ideal for**: Long-form coherent text, understanding meaning, handling rare words

### Next Steps

- **Smoothing techniques**: Laplace, Good-Turing, Kneser-Ney
- **Neural language models**: Word2Vec, LSTM, Transformers (GPT)
- **Perplexity**: Evaluating language model quality