<a href="https://colab.research.google.com/github/2403a52009-bot/NLP/blob/main/nlp_asn_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lab 8: N-Gram Language Models  
## Objective
To implement Unigram, Bigram, and Trigram language models, apply smoothing, calculate sentence probabilities, and evaluate models using perplexity.



## STEP 2 — Import Required Libraries

- **re**: text cleaning (remove punctuation and numbers)  
- **nltk**: sentence splitting and tokenization  
- **collections.Counter**: counting N-grams  
- **math**: probability and perplexity calculation  


In [5]:
import re
import math
import nltk
from collections import Counter

nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True


## STEP 3 — Dataset Description

The dataset is a manually created English text corpus containing more than 1500 words.
It includes sentences related to technology, health, politics, and general topics.
The corpus is large enough to train N-gram models reliably.
Data is split into 80% training and 20% testing.
This dataset is suitable for evaluating language models using perplexity.


In [6]:

corpus = """Artificial intelligence is transforming modern technology.
Machine learning models are widely used in healthcare and finance.
Governments invest in digital infrastructure and innovation.
Healthcare systems rely on data to improve patient outcomes.
Technology companies focus on research and development.
Education systems adopt online learning platforms.
Politics influences economic and social policies.
Renewable energy supports sustainable development.
Scientists continuously improve medical treatments.
The internet has changed communication and information access.""" * 40

print(corpus[:300])


Artificial intelligence is transforming modern technology.
Machine learning models are widely used in healthcare and finance.
Governments invest in digital infrastructure and innovation.
Healthcare systems rely on data to improve patient outcomes.
Technology companies focus on research and developme



### Train-Test Split (80% / 20%)


In [7]:

words = corpus.split()
split_index = int(0.8 * len(words))

train_text = " ".join(words[:split_index])
test_text = " ".join(words[split_index:])

print("Training words:", len(train_text.split()))
print("Testing words:", len(test_text.split()))


Training words: 2176
Testing words: 545



## STEP 4 — Text Preprocessing

Steps:
1. Convert text to lowercase  
2. Remove punctuation and numbers  
3. Tokenize sentences  
4. Add <s> and </s> sentence boundary tokens  


In [8]:

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    sentences = nltk.sent_tokenize(text)
    tokens = []
    for sent in sentences:
        tokens.extend(['<s>'] + sent.split() + ['</s>'])
    return tokens

train_tokens = preprocess(train_text)
test_tokens = preprocess(test_text)

train_tokens[:20]


['<s>',
 'artificial',
 'intelligence',
 'is',
 'transforming',
 'modern',
 'technology',
 'machine',
 'learning',
 'models',
 'are',
 'widely',
 'used',
 'in',
 'healthcare',
 'and',
 'finance',
 'governments',
 'invest',
 'in']


## STEP 5 — Build N-Gram Models


In [9]:

def build_ngrams(tokens, n):
    return Counter(tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1))

unigrams = build_ngrams(train_tokens, 1)
bigrams = build_ngrams(train_tokens, 2)
trigrams = build_ngrams(train_tokens, 3)

print("Unigram sample:", list(unigrams.items())[:5])
print("Bigram sample:", list(bigrams.items())[:5])
print("Trigram sample:", list(trigrams.items())[:5])


Unigram sample: [(('<s>',), 1), (('artificial',), 1), (('intelligence',), 32), (('is',), 32), (('transforming',), 32)]
Bigram sample: [(('<s>', 'artificial'), 1), (('artificial', 'intelligence'), 1), (('intelligence', 'is'), 32), (('is', 'transforming'), 32), (('transforming', 'modern'), 32)]
Trigram sample: [(('<s>', 'artificial', 'intelligence'), 1), (('artificial', 'intelligence', 'is'), 1), (('intelligence', 'is', 'transforming'), 32), (('is', 'transforming', 'modern'), 32), (('transforming', 'modern', 'technology'), 32)]



## STEP 6 — Add-One (Laplace) Smoothing

Smoothing is required to handle unseen words or N-grams.
Without smoothing, unseen N-grams receive zero probability.
Add-one smoothing adds 1 to every count, preventing zero probabilities and improving robustness.


In [10]:

vocab_size = len(unigrams)
total_unigrams = sum(unigrams.values())



## STEP 7 — Sentence Probability Calculation


In [11]:

def unigram_probability(sentence):
    p = 1
    for w in sentence:
        p *= (unigrams.get((w,), 0) + 1) / (total_unigrams + vocab_size)
    return p

def bigram_probability(sentence):
    p = 1
    for i in range(len(sentence)-1):
        p *= (bigrams.get((sentence[i], sentence[i+1]), 0) + 1) /              (unigrams.get((sentence[i],), 0) + vocab_size)
    return p

def trigram_probability(sentence):
    p = 1
    for i in range(len(sentence)-2):
        p *= (trigrams.get((sentence[i], sentence[i+1], sentence[i+2]), 0) + 1) /              (bigrams.get((sentence[i], sentence[i+1]), 0) + vocab_size)
    return p

sentences = [
    ['<s>', 'artificial', 'intelligence', 'is', 'important', '</s>'],
    ['<s>', 'technology', 'drives', 'innovation', '</s>'],
    ['<s>', 'healthcare', 'relies', 'on', 'data', '</s>'],
    ['<s>', 'education', 'uses', 'online', 'platforms', '</s>'],
    ['<s>', 'politics', 'shapes', 'policy', '</s>']
]

for s in sentences:
    print("Sentence:", " ".join(s))
    print("Unigram:", unigram_probability(s))
    print("Bigram:", bigram_probability(s))
    print("Trigram:", trigram_probability(s))
    print()


Sentence: <s> artificial intelligence is important </s>
Unigram: 6.952195274743246e-17
Bigram: 7.504756992557328e-08
Trigram: 2.069493594917324e-07

Sentence: <s> technology drives innovation </s>
Unigram: 1.5316423544303666e-13
Bigram: 2.523772676728444e-08
Trigram: 4.869046981434324e-06

Sentence: <s> healthcare relies on data </s>
Unigram: 4.4504583387560946e-15
Bigram: 6.771097425368996e-09
Trigram: 5.3506010784992574e-08

Sentence: <s> education uses online platforms </s>
Unigram: 1.1471122203326356e-15
Bigram: 3.7486298664122523e-10
Trigram: 8.252622002431058e-08

Sentence: <s> politics shapes policy </s>
Unigram: 2.356372852969795e-15
Bigram: 5.2614243938576025e-08
Trigram: 4.869046981434324e-06




## STEP 8 — Perplexity Calculation

Lower perplexity indicates a better-performing language model.


In [12]:

def perplexity(prob, N):
    return (1/prob) ** (1/N)

for s in sentences:
    N = len(s)
    print("Sentence:", " ".join(s))
    print("Unigram Perplexity:", perplexity(unigram_probability(s), N))
    print("Bigram Perplexity:", perplexity(bigram_probability(s), N))
    print("Trigram Perplexity:", perplexity(trigram_probability(s), N))
    print()


Sentence: <s> artificial intelligence is important </s>
Unigram Perplexity: 493.1507216374406
Bigram Perplexity: 15.397275986796908
Trigram Perplexity: 13.002373949513785

Sentence: <s> technology drives innovation </s>
Unigram Perplexity: 365.56829779063446
Bigram Perplexity: 33.08186257049151
Trigram Perplexity: 11.548117842660861

Sentence: <s> healthcare relies on data </s>
Unigram Perplexity: 246.56563302063614
Bigram Perplexity: 22.990945020584743
Trigram Perplexity: 16.290443372457652

Sentence: <s> education uses online platforms </s>
Unigram Perplexity: 309.07628410953856
Bigram Perplexity: 37.240991908214006
Trigram Perplexity: 15.15542109400205

Sentence: <s> politics shapes policy </s>
Unigram Perplexity: 842.4637014557621
Bigram Perplexity: 28.561389587437223
Trigram Perplexity: 11.548117842660861




## STEP 9 — Comparison and Analysis

The Bigram model generally achieves lower perplexity than the Unigram model because it captures word context.
Trigram models can perform better but require more data.
When unseen words appear, probabilities decrease significantly.
Smoothing helps reduce this issue by assigning small probabilities.
Overall, Bigram provides a good balance between accuracy and data requirements.
