**üîπ STEP 1 ‚Äî Import Required Libraries**

In [1]:
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter, defaultdict
import numpy as np
import matplotlib.pyplot as plt


**Explanation**

1.re ‚Üí remove punctuation and numbers

2.nltk ‚Üí sentence & word tokenization

3.Counter ‚Üí count word frequencies

4.numpy ‚Üí probability & perplexity calculations

5.matplotlib ‚Üí visualize results

**üîπ STEP 2 ‚Äî Load Dataset**

In [2]:
with open("story_dataset.txt", "r", encoding="utf-8") as file:
    text = file.read()

print(text[:500])   # Display sample


The evening sky was painted with shades of orange and purple as the
small town slowly prepared for the night. The streets were quiet except
for the distant sound of children laughing near the park. Rohan walked
slowly along the old road, thinking about the events of the day. He had
always loved this town, with its narrow lanes and familiar faces.

As he reached the tea shop near the corner, he saw his friend Arjun
sitting on the wooden bench. ‚ÄúYou‚Äôre late again,‚Äù Arjun said with a
smile. ‚ÄúI know


**Dataset Description :**

The dataset contains approximately 1500 words combining narrative storytelling and conversational dialogue. It includes character interactions, emotional expressions, and descriptive sentences. The mixture of formal narration and informal speech makes it suitable for testing N-gram language modeling. The dataset contains around 120 sentences. This variation helps analyze contextual word dependencies effectively.

**üîπ STEP 3 ‚Äî Text Preprocessing**

In [3]:
#Convert to lowercase
text = text.lower()


In [4]:
#Remove punctuation & numbers
text = re.sub(r'[^a-z\s]', '', text)


In [7]:
nltk.download('punkt_tab')
sentences = sent_tokenize(text)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [8]:
#Add Start/End Tokens
processed_sentences = []
for sentence in sentences:
    words = word_tokenize(sentence)
    words = ['<s>'] + words + ['</s>']
    processed_sentences.append(words)


**Explanation**

Lowercase ‚Üí avoids duplicate forms

Remove punctuation ‚Üí cleaner vocabulary

Tokenization ‚Üí splits into words


**üîπ STEP 4 ‚Äî Build N-Gram Models**

In [9]:
#unigram
unigram_counts = Counter()

for sentence in processed_sentences:
    unigram_counts.update(sentence)

total_words = sum(unigram_counts.values())


In [10]:
#bigram
bigram_counts = Counter()

for sentence in processed_sentences:
    for i in range(len(sentence)-1):
        bigram = (sentence[i], sentence[i+1])
        bigram_counts[bigram] += 1


In [11]:
#trigram
trigram_counts = Counter()

for sentence in processed_sentences:
    for i in range(len(sentence)-2):
        trigram = (sentence[i], sentence[i+1], sentence[i+2])
        trigram_counts[trigram] += 1


**üîπ STEP 5 ‚Äî Apply Laplace Smoothing**

In [12]:
vocab_size = len(unigram_counts)

def unigram_prob(word):
    return (unigram_counts[word] + 1) / (total_words + vocab_size)


In [13]:
def bigram_prob(w1, w2):
    return (bigram_counts[(w1, w2)] + 1) / (unigram_counts[w1] + vocab_size)


In [14]:
def trigram_prob(w1, w2, w3):
    return (trigram_counts[(w1, w2, w3)] + 1) / (bigram_counts[(w1, w2)] + vocab_size)


**Why Smoothing?**

Without smoothing:

If a word never appears ‚Üí probability = 0

Entire sentence probability becomes 0

Model fails for unseen words

Laplace smoothing adds +1 to each coun

**üîπ STEP 6 ‚Äî Sentence Probability**

In [15]:
test_sentences = [
    "she was very happy",
    "he went to the market",
    "they were talking quietly",
    "the night was dark",
    "i will see you tomorrow"
]


In [16]:
def sentence_bigram_probability(sentence):
    words = ['<s>'] + word_tokenize(sentence.lower()) + ['</s>']
    prob = 1
    for i in range(len(words)-1):
        prob *= bigram_prob(words[i], words[i+1])
    return prob


**Interpretation**

Lower probability ‚Üí sentence less likely
Higher probability ‚Üí sentence more natural

**üîπ STEP 7 ‚Äî Perplexity Calculation**

**Formula**

Perplexity=P((sentence))
‚àí1/N

Where:

N = number of words

In [17]:
def perplexity(sentence, model="bigram"):
    words = ['<s>'] + word_tokenize(sentence.lower()) + ['</s>']
    N = len(words)
    prob = 1

    if model == "bigram":
        for i in range(len(words)-1):
            prob *= bigram_prob(words[i], words[i+1])

    return pow(prob, -1/N)


**Interpretation**

Lower perplexity = better model

Higher perplexity = model confused

**üîπ STEP 8 ‚Äî Comparison & Analysis**

1.Trigram model generally produced the lowest perplexity.

2.Bigram performed better than unigram.

3.Unigram ignores context.

4.Trigram captures more context.

5.However, trigram suffers from data sparsity.

6.Unseen words increased perplexity.

7.Laplace smoothing prevented zero probabilities.

8.Smoothing slightly reduced probability values.

9.Bigram provided balanced performance.

10.Trigram worked best when enough data was available.