<a href="https://colab.research.google.com/github/Amulyanrao7777/NLP/blob/main/n_gram_LM_lab3_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# N-Gram Language Model  
## Smoothing, Perplexity, and Sentence Completion



## 1. Libraries Used


In [2]:

import nltk
from nltk.util import ngrams
from collections import Counter
import math
import random
import matplotlib.pyplot as plt

nltk.download('punkt_tab') #punctuation
nltk.download('brown') #corpus


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True


## 2. Load General-Purpose Corpus (Brown)


In [3]:

from nltk.corpus import brown

words = [w.lower() for w in brown.words() if w.isalpha()]
print("Total words:", len(words))


Total words: 981716



## 3. Trainâ€“Test Split


In [4]:

train_words = words[:50000]
test_words = words[50000:51000]
test_sentence = " ".join(test_words)

print("Train size:", len(train_words))
print("Test size:", len(test_words))

Train size: 50000
Test size: 1000



## 4. Build N-Gram Model


In [5]:

def build_ngram_model(tokens, n):
    ngram_counts = Counter()
    context_counts = Counter()

    for gram in ngrams(tokens, n):
        context = gram[:-1]
        ngram_counts[gram] += 1
        context_counts[context] += 1

    return ngram_counts, context_counts



## 5. Probability Functions


In [6]:

def mle_probability(ngram, ngram_counts, context_counts):
    context = ngram[:-1]
    if context_counts[context] == 0:
        return 0
    return ngram_counts[ngram] / context_counts[context]

def smoothed_probability(ngram, ngram_counts, context_counts, vocab_size):
    context = ngram[:-1]
    return (ngram_counts[ngram] + 1) / (context_counts[context] + vocab_size)



## 6. Perplexity


In [7]:

def sentence_log_prob(sentence, n, ngram_counts, context_counts, vocab_size, smooth=True):
    tokens = [w for w in nltk.word_tokenize(sentence.lower()) if w.isalpha()]
    log_prob = 0

    for gram in ngrams(tokens, n):
        if smooth:
            prob = smoothed_probability(gram, ngram_counts, context_counts, vocab_size)
        else:
            prob = mle_probability(gram, ngram_counts, context_counts)
            if prob == 0:
                return float('-inf')
        log_prob += math.log(prob)

    return log_prob


def perplexity(sentence, n, ngram_counts, context_counts, vocab_size, smooth=True):
    tokens = [w for w in nltk.word_tokenize(sentence.lower()) if w.isalpha()]
    N = len(tokens)

    log_prob = sentence_log_prob(sentence, n, ngram_counts, context_counts, vocab_size, smooth)

    if log_prob == float('-inf'):
        return float('inf')

    return math.exp(-log_prob / N)



## 7. Train Models


In [8]:

vocab = set(train_words)
vocab_size = len(vocab)

models = {}
for n in [1, 2, 3, 5, 10]:
    models[n] = build_ngram_model(train_words, n)

# vocabulary is a subset of corpus --> corpus has duplicates, vocabulary contains only unique words and no duplicates.



## 8. Perplexity Comparison


In [9]:

for n in models:
    counts, contexts = models[n]
    print(f"\n{n}-gram model")
    print("Non-smoothed:", perplexity(test_sentence, n, counts, contexts, vocab_size, False))
    print("Smoothed:", perplexity(test_sentence, n, counts, contexts, vocab_size, True))
#why is it that 1gram has the best probability


1-gram model
Non-smoothed: inf
Smoothed: 1423.2897990857884

2-gram model
Non-smoothed: inf
Smoothed: 4572.492418499444

3-gram model
Non-smoothed: inf
Smoothed: 7509.528848567658

5-gram model
Non-smoothed: inf
Smoothed: 7752.670458088694

10-gram model
Non-smoothed: inf
Smoothed: 7412.500217939968


in the above cell,

non smoothed-> before laplace smoothing (divided by 0) and inf->infinite

smoothed -> after laplace


## 9. Corpus Size Effect


In [10]:
# FIX: Use a FIXED vocabulary for fair corpus-size comparison

# Fix vocabulary from training data
fixed_vocab = set(train_words)
fixed_vocab_size = len(fixed_vocab)

pp_values = []

for label, corpus in [("5k", train_words[:5000]), ("50k", train_words[:50000])]:
    c_counts, c_ctx = build_ngram_model(corpus, 3)

    pp = perplexity(
        test_sentence,
        3,
        c_counts,
        c_ctx,
        fixed_vocab_size,   # IMPORTANT FIX
        smooth=True
    )

    pp_values.append(pp)
    print(label, "perplexity:", pp)


5k perplexity: 7839.723486406142
50k perplexity: 7509.528848567658



## 10. Sentence Completion


In [11]:

def complete_sentence(start_sentence, n, ngram_counts, context_counts, vocab, max_words=20):
    tokens = [w for w in nltk.word_tokenize(start_sentence.lower()) if w.isalpha()]
    output = tokens.copy()

    if len(tokens) >= n - 1:
        context = tuple(tokens[-(n-1):])
    else:
        context = tuple()

    for _ in range(max_words):
        candidates = []

        for word in vocab:
            gram = context + (word,)
            if gram in ngram_counts:
                candidates.append((word, ngram_counts[gram]))

        if not candidates and n > 1:
          context = context[1:]   # backoff to smaller context
          continue

        if not candidates:
          break

        words_, weights = zip(*candidates)
        next_word = random.choices(words_, weights=weights)[0]

        output.append(next_word)
        context = tuple(output[-(n-1):])

    return " ".join(output)



## 12. Generation Comparison


In [12]:

start = "the government announced that"

for n in [1, 2, 3, 5, 10]:
    counts, contexts = models[n]
    print(f"\n{n}-gram completion:")
    print(complete_sentence(start, n, counts, contexts, vocab, 50))



1-gram completion:
the government announced that

2-gram completion:
the government announced that a teacher now witnessing an engineering in front in the report the group friday emory university bringing them from his time i challenge was among the other seven men agree to arouse those of a second off kunkel bob day hundreds of rumor combined proceeds will sponsor of washington thousands

3-gram completion:
the government announced that it would force banks to violate their contractual obligations with depositors and undermine the confidence of bank customers if you destroy confidence in following molvar who kept reiterating her request that they tend to be not only reiterated the united states seek instead to detach the castro dragnet that the

5-gram completion:
the government announced that

10-gram completion:
the government announced that


In [13]:

start = "the government announced that they aren't allowing any foreign nationals to go to voyage on any of the coming months of 2026"

for n in [1, 2, 3, 5, 10]:
    counts, contexts = models[n]
    print(f"\n{n}-gram completion:")
    print(complete_sentence(start, n, counts, contexts, vocab,50))



1-gram completion:
the government announced that they are allowing any foreign nationals to go to voyage on any of the coming months of

2-gram completion:
the government announced that they are allowing any foreign nationals to go to voyage on any of the coming months of the jurors said no reason to be done even for popular prices on slim lines each month reporting weight than morton foods stock was driving the mantle and high rate of that it could use the three of the first major american naval announcement that he also criticized bernard gimbel

3-gram completion:
the government announced that they are allowing any foreign nationals to go to voyage on any of the coming months of were about or extra points in tries during three games on the republicans must hold a public hearing before the trial the announcement of a new record crowd for the gala richard newburger is chairman of the other uncles and aunts the rush butlers the homer robertsons and the players

5-gram completion:
the 

Higher N-grams do NOT mean better generation.
They mean more precision only when the context has been seen before.

Higher-order N-gram models fail to generate long continuations on unseen sentences due to data sparsity and exact context matching requirements.


## 13. Final Notes

- N-gram models suffer from sparsity  
- Smoothing is essential  
- Higher N does not always mean better
