# Lab 8.5: N-Gram Language Model Implementation and Evaluation (Perplexity)

---

## Learning Objectives
- Understand the concept of language modeling in NLP  
- Build Unigram, Bigram, and Trigram language models  
- Apply smoothing techniques (Add-one / Laplace)  
- Calculate probabilities of sentences using N-gram models  
- Evaluate models using Perplexity  
- Compare performance across models  


## STEP 1 — Create a new notebook/script

Open Jupyter Notebook / Google Colab and name the file:


## STEP 2 — Import required libraries

- **re**: cleaning text (remove punctuation/numbers)  
- **math**: probability and perplexity computations  
- **Counter, defaultdict**: counting n-grams efficiently  
- **numpy**: numerical operations  
- **pandas**: creating frequency/probability tables  


In [1]:
import re
import math
from collections import Counter, defaultdict

import numpy as np
import pandas as pd


## STEP 3 — Load dataset

### Dataset Description (5–6 lines)

The dataset is a fictional story corpus of about **900–1200 words**.  
It describes a cartographer and a sailor who find a mysterious island and a hidden magical library.  
The story contains multiple sentences with common vocabulary and repeated patterns.  
This makes it suitable for training unigram, bigram, and trigram models.  
The dataset is used as a training corpus for language modeling tasks.  
We use this corpus to compute probabilities and perplexity.



In [2]:
corpus_text = """Once upon a time, in a quiet coastal town, there lived a young cartographer named Arin.
Arin spent his days drawing maps of the sea, measuring the distance between islands, and
listening to sailors describe storms that appeared without warning. The town was small,
but the harbor was busy, and every week new ships arrived with stories from faraway lands.

One evening, an old fisherman gave Arin a weathered journal wrapped in cloth.
The fisherman claimed the journal belonged to a captain who had vanished decades ago.
Its pages were filled with sketches of stars, strange symbols, and routes that did not match
any known chart. Arin felt a thrill of curiosity as he turned the pages, noticing a repeated
mark shaped like a crescent moon.

That night, Arin could not sleep. The journal described an island that appeared only during
certain tides. It spoke of a lighthouse that never went dark and a library hidden beneath its stones.
The next morning, Arin packed food, fresh ink, and his best compass. He asked his friend Mira,
a brave sailor with quick hands, to join him. Mira laughed and agreed, because she loved mysteries
more than calm routines.

They sailed at dawn when the sky was pale and the water was smooth. As they moved away from the harbor,
the wind changed direction, and the sea grew colder. The journal instructed them to follow a line
between two distant cliffs until the clouds formed a circle. Hours passed, and Arin worried the
instructions were only myths. But suddenly the mist parted, revealing dark rocks rising from the waves.

The island was silent. No birds flew above it, and no footprints marked the sand.
At the center stood a tall lighthouse made of black stone. Its light shone even in daylight,
spinning slowly like a watchful eye. Arin and Mira climbed the stairs and found a heavy door
covered with salt. It opened with a groan, and the air inside smelled of dust and old paper.

Below the lighthouse, they discovered a staircase leading underground. The steps were narrow,
and the walls were carved with the same crescent symbol from the journal. At the bottom, a chamber
expanded into a vast room lined with shelves. Thousands of books rested there, untouched by water
or time. A single lantern floated near the ceiling, glowing without flame.

Mira reached for a book and felt the cover warm beneath her fingers.
When she opened it, the pages were blank, but as Arin spoke aloud, words appeared like ink forming
from invisible hands. The room listened. Every sentence they whispered became part of the library.
Arin realized the place collected language itself, preserving stories that would otherwise be forgotten.

They tested the magic carefully. Mira told a tale about a storm that sang like a choir, and the book
captured it perfectly. Arin described the harbor, the smell of fish, and the sound of bells at noon.
The library accepted every detail. Yet the journal warned them: if they spoke a lie, the shelves would
shake and the sea would swallow the island.

Arin wondered why such a library existed. He found a map carved into the floor, showing currents and
paths that connected every coast. At the center of the map was the crescent symbol.
It was not only a mark; it was a key. The journal explained that the library was built by travelers
who believed that words guided ships as much as stars did.

Before leaving, Arin wrote a promise in his notebook: he would return to the town and teach others
to value stories. Mira, however, hesitated. She looked at the endless shelves and said softly that
some stories were dangerous. Arin agreed, but he also believed that silence could be more dangerous.

When the tide began to rise, they climbed back to the surface. The lighthouse beam pointed toward the sea,
and the air felt warmer. They sailed home quickly, watching the island fade into mist behind them.
In the harbor, the world looked ordinary again, but Arin carried the journal close to his chest.

Days later, Arin started a small school for sailors. He taught them how to read maps and how to listen
to one another. He collected legends and recorded them carefully. Sometimes, at night, he stared at the
moon and wondered if the library was still listening, waiting for new voices.

Mira continued sailing, but she often returned with fresh stories.
Together they made sure the town never forgot the power of language.
And far beyond the horizon, the lighthouse kept turning, guarding the hidden books beneath the waves."""

print("Total words in dataset:", len(corpus_text.split()))
print("\nSample Text (first 500 characters):\n")
print(corpus_text[:500])


Total words in dataset: 772

Sample Text (first 500 characters):

Once upon a time, in a quiet coastal town, there lived a young cartographer named Arin. 
Arin spent his days drawing maps of the sea, measuring the distance between islands, and 
but the harbor was busy, and every week new ships arrived with stories from faraway lands.

One evening, an old fisherman gave Arin a weathered journal wrapped in cloth. 
The fisherman claimed the journal belonged to a captain who h


## STEP 4 — Preprocess Text

Steps:
1. Convert to lowercase  
2. Remove punctuation and numbers  
3. Split into sentences  
4. Tokenize into words  
5. Add `<s>` and `</s>` sentence boundary tokens  


In [3]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^a-z\s\.]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def split_into_sentences(text):
    return [s.strip() for s in text.split('.') if s.strip()]

def tokenize_sentence(sentence):
    return sentence.split()

def preprocess_corpus(text):
    cleaned = clean_text(text)
    sentences = split_into_sentences(cleaned)
    tokenized_sentences = []
    for sent in sentences:
        tokens = tokenize_sentence(sent)
        tokens = ["<s>"] + tokens + ["</s>"]
        tokenized_sentences.append(tokens)
    return tokenized_sentences

tokenized_sentences = preprocess_corpus(corpus_text)

print("Total sentences:", len(tokenized_sentences))
print("Example tokens:", tokenized_sentences[0])


Total sentences: 59
Example tokens: ['<s>', 'once', 'upon', 'a', 'time', 'in', 'a', 'quiet', 'coastal', 'town', 'there', 'lived', 'a', 'young', 'cartographer', 'named', 'arin', '</s>']


## STEP 5 — Build N-Gram Models

We build:
- Unigram model
- Bigram model
- Trigram model


In [4]:
def get_ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

all_tokens = [token for sent in tokenized_sentences for token in sent]

unigram_counts = Counter(all_tokens)

bigram_counts = Counter()
for sent in tokenized_sentences:
    bigram_counts.update(get_ngrams(sent, 2))

trigram_counts = Counter()
for sent in tokenized_sentences:
    trigram_counts.update(get_ngrams(sent, 3))

vocab = set(all_tokens)
V = len(vocab)

print("Vocabulary size:", V)
print("Total unigram tokens:", sum(unigram_counts.values()))


Vocabulary size: 389
Total unigram tokens: 890


In [5]:
pd.DataFrame(unigram_counts.most_common(15), columns=["Word", "Count"])

Unnamed: 0,Word,Count
0,the,71
1,<s>,59
2,</s>,59
3,a,32
4,and,27
5,arin,16
6,of,12
7,to,12
8,that,12
9,was,10


In [6]:
pd.DataFrame(bigram_counts.most_common(15), columns=["Bigram", "Count"])

Unnamed: 0,Bigram,Count
0,"(<s>, the)",11
1,"(and, the)",8
2,"(<s>, arin)",7
3,"(the, journal)",7
4,"(<s>, mira)",5
5,"(at, the)",5
6,"(the, sea)",4
7,"(the, harbor)",4
8,"(<s>, he)",4
9,"(the, library)",4


In [7]:
pd.DataFrame(trigram_counts.most_common(15), columns=["Trigram", "Count"])

Unnamed: 0,Trigram,Count
0,"(<s>, the, journal)",3
1,"(the, harbor, the)",3
2,"(<s>, at, the)",3
3,"(<s>, they, sailed)",2
4,"(and, the, sea)",2
5,"(the, waves, </s>)",2
6,"(at, the, center)",2
7,"(and, the, air)",2
8,"(the, library, was)",2
9,"(<s>, once, upon)",1


## STEP 6 — Apply Smoothing (Laplace)

Without smoothing, unseen n-grams get probability 0, making sentence probability 0.  
Laplace smoothing adds 1 to each count, ensuring every possible word sequence has non-zero probability.  
This prevents model failure on unseen test sentences.


In [8]:
def unigram_prob(word, unigram_counts, total_tokens, V):
    return (unigram_counts[word] + 1) / (total_tokens + V)

def bigram_prob(w1, w2, bigram_counts, unigram_counts, V):
    return (bigram_counts[(w1, w2)] + 1) / (unigram_counts[w1] + V)

def trigram_prob(w1, w2, w3, trigram_counts, bigram_counts, V):
    return (trigram_counts[(w1, w2, w3)] + 1) / (bigram_counts[(w1, w2)] + V)


## STEP 7 — Sentence Probability Calculation

We compute probability of 5 sentences using unigram, bigram, and trigram models.


In [9]:
test_sentences = [
    "arin spent his days drawing maps of the sea",
    "the island was silent and no birds flew above it",
    "the library accepted every detail",
    "mira continued sailing but she often returned",
    "the lighthouse kept turning guarding the hidden books"
]

def preprocess_sentence(sentence):
    sentence = clean_text(sentence)
    tokens = tokenize_sentence(sentence)
    return ["<s>"] + tokens + ["</s>"]

def sentence_probability_unigram(tokens):
    total = sum(unigram_counts.values())
    prob = 1.0
    for w in tokens:
        prob *= unigram_prob(w, unigram_counts, total, V)
    return prob

def sentence_probability_bigram(tokens):
    prob = 1.0
    for i in range(len(tokens)-1):
        prob *= bigram_prob(tokens[i], tokens[i+1], bigram_counts, unigram_counts, V)
    return prob

def sentence_probability_trigram(tokens):
    prob = 1.0
    for i in range(len(tokens)-2):
        prob *= trigram_prob(tokens[i], tokens[i+1], tokens[i+2], trigram_counts, bigram_counts, V)
    return prob

rows = []
for sent in test_sentences:
    tokens = preprocess_sentence(sent)
    rows.append({
        "Sentence": sent,
        "Unigram Probability": sentence_probability_unigram(tokens),
        "Bigram Probability": sentence_probability_bigram(tokens),
        "Trigram Probability": sentence_probability_trigram(tokens)
    })

prob_df = pd.DataFrame(rows)
prob_df


Unnamed: 0,Sentence,Unigram Probability,Bigram Probability,Trigram Probability
0,arin spent his days drawing maps of the sea,4.129287e-24,8.49754e-23,1.192555e-21
1,the island was silent and no birds flew above it,7.999087000000001e-27,1.2759280000000001e-25,7.645961e-25
2,the library accepted every detail,6.481406e-15,1.972952e-13,3.431655e-12
3,mira continued sailing but she often returned,1.760944e-21,6.027122e-19,4.6163220000000007e-17
4,the lighthouse kept turning guarding the hidde...,6.882493000000001e-22,9.069513e-21,2.319937e-19


## STEP 8 — Perplexity Calculation

Perplexity measures how well a language model predicts a sentence.  
Lower perplexity means the model is better.


In [10]:
def perplexity(prob, N):
    return prob ** (-1/N)

rows = []
for sent in test_sentences:
    tokens = preprocess_sentence(sent)
    N = len(tokens)
    p1 = sentence_probability_unigram(tokens)
    p2 = sentence_probability_bigram(tokens)
    p3 = sentence_probability_trigram(tokens)

    rows.append({
        "Sentence": sent,
        "Unigram Perplexity": perplexity(p1, N),
        "Bigram Perplexity": perplexity(p2, N),
        "Trigram Perplexity": perplexity(p3, N)
    })

perplexity_df = pd.DataFrame(rows)
perplexity_df


Unnamed: 0,Sentence,Unigram Perplexity,Bigram Perplexity,Trigram Perplexity
0,arin spent his days drawing maps of the sea,133.607101,101.491084,79.824888
1,the island was silent and no birds flew above it,149.536301,118.717429,102.261932
2,the library accepted every detail,106.390879,65.310552,43.429537
3,mira continued sailing but she often returned,202.315085,105.786982,65.324818
4,the lighthouse kept turning guarding the hidde...,130.684903,100.98145,73.021773


## STEP 9 — Comparison and Analysis (8–10 sentences)

- Unigram model has the highest perplexity because it ignores word order.  
- Bigram model usually performs better because it captures dependency between consecutive words.  
- Trigram model captures more context, but it may suffer from sparsity due to limited training data.  
- When trigrams are unseen, probabilities become very small even with smoothing.  
- Bigram often gives the best balance between context and available data.  
- Laplace smoothing ensures unseen sequences do not get zero probability.  
- Without smoothing, perplexity would become infinite for unseen n-grams.  
- Overall, lower perplexity indicates a better prediction model.
