<a href="https://colab.research.google.com/github/2403A52058/NLP_LABASSIGNMENTS/blob/main/NLP_LAB(08)_2403A52058.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import Required Libraries

In [1]:
# Import nltk for NLP utilities like tokenization
import nltk

# Import re for text cleaning using regular expressions
import re

# Import Counter for counting words and n-grams
from collections import Counter

# Import numpy for mathematical operations
import numpy as np

# Download tokenizer resources (run once)
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Load Dataset

In [18]:
# Open the dataset text file in read mode
with open("/content/corpus.txt", "r", encoding="utf-8") as file:

    # Read entire content of file
    text = file.read()

# Print first 500 characters to verify data
print(text[:500])



Natural language processing is a field of artificial intelligence that focuses on the interaction between computers and human language. It enables machines to understand, interpret, and generate text in a meaningful way. Language models play a vital role in natural language processing by predicting the likelihood of a sequence of words. These models are widely used in applications such as speech recognition, machine translation, and text generation.

Language modeling is based on the idea that w


Text Preprocessing

In [20]:
# Convert all text to lowercase
text = text.lower()

# Remove punctuation and numbers
text = re.sub(r'[^a-z\s]', '', text)

# Tokenize text into words
tokens = nltk.word_tokenize(text)

# Print first 20 tokens to verify
print(tokens[:20])


['natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'language', 'it']


Train-Test Split

In [30]:
# Calculate split index for 80% training
split_index = int(0.8 * len(tokens))

# Split tokens into training data
train_tokens = tokens[:split_index]

# Split tokens into testing data
test_tokens = tokens[split_index:]


Build Unigram Model

In [31]:
# Count frequency of each word in training data
unigram_counts = Counter(train_tokens)

# Calculate total number of words
total_unigrams = sum(unigram_counts.values())

# Create unigram probability dictionary
unigram_probs = {}

# Loop through each word and its count
for word, count in unigram_counts.items():

    # Calculate probability of each word
    unigram_probs[word] = count / total_unigrams


Build Bigram Model

In [32]:
# Create bigrams from training tokens
bigrams = list(zip(train_tokens[:-1], train_tokens[1:]))

# Count bigram frequencies
bigram_counts = Counter(bigrams)

# Create bigram probability dictionary
bigram_probs = {}

# Loop through each bigram
for (w1, w2), count in bigram_counts.items():

    # Divide bigram count by unigram count
    bigram_probs[(w1, w2)] = count / unigram_counts[w1]


Build Trigram Model

In [33]:
# Create trigrams from training tokens
trigrams = list(zip(train_tokens[:-2], train_tokens[1:-1], train_tokens[2:]))

# Count trigram frequencies
trigram_counts = Counter(trigrams)

# Create trigram probability dictionary
trigram_probs = {}

# Loop through each trigram
for (w1, w2, w3), count in trigram_counts.items():

    # Divide trigram count by bigram count
    trigram_probs[(w1, w2, w3)] = count / bigram_counts[(w1, w2)]


Add-One (Laplace) Smoothing

In [34]:
# Vocabulary size
V = len(unigram_counts)

# Define smoothed bigram probability function
def smoothed_bigram_prob(w1, w2):

    # Get bigram count with default 0
    bigram_count = bigram_counts.get((w1, w2), 0)

    # Apply Laplace smoothing formula
    return (bigram_count + 1) / (unigram_counts[w1] + V)


Sentence Probability (Unigram)

In [35]:
# Function to compute unigram sentence probability
def unigram_sentence_prob(sentence):

    # Tokenize sentence
    words = nltk.word_tokenize(sentence.lower())

    # Initialize probability
    prob = 1

    # Loop through each word
    for word in words:

        # Multiply word probability (use small value if unseen)
        prob *= unigram_probs.get(word, 1e-6)

    return prob


Sentence Probability (Bigram)

In [36]:
# Function to compute bigram sentence probability
def bigram_sentence_prob(sentence):

    # Tokenize sentence
    words = nltk.word_tokenize(sentence.lower())

    # Initialize probability
    prob = 1

    # Loop through bigrams
    for i in range(len(words) - 1):

        # Multiply smoothed bigram probability
        prob *= smoothed_bigram_prob(words[i], words[i+1])

    return prob


Perplexity Calculation

In [37]:
# Function to calculate perplexity
def perplexity(sentence, model_func):

    # Tokenize sentence
    words = nltk.word_tokenize(sentence.lower())

    # Calculate sentence probability
    prob = model_func(sentence)

    # Compute perplexity formula
    return pow(1 / prob, 1 / len(words))


Test Sentences

In [38]:
# List of test sentences
sentences = [
    "language models are useful",
    "this is a simple example",
    "n gram models predict words",
    "machine learning is powerful",
    "unseen words cause problems"
]

# Loop through sentences and print perplexity
for s in sentences:

    print("Sentence:", s)

    print("Unigram Perplexity:", perplexity(s, unigram_sentence_prob))

    print("Bigram Perplexity:", perplexity(s, bigram_sentence_prob))

    print("-" * 50)


Sentence: language models are useful
Unigram Perplexity: 88.54595240526768
Bigram Perplexity: 34.31120589604889
--------------------------------------------------
Sentence: this is a simple example
Unigram Perplexity: 519.831225360002
Bigram Perplexity: 63.07148173600104
--------------------------------------------------
Sentence: n gram models predict words
Unigram Perplexity: 765.4594600044012
Bigram Perplexity: 60.658218456421935
--------------------------------------------------
Sentence: machine learning is powerful
Unigram Perplexity: 1749.3426073136952
Bigram Perplexity: 59.697599999987254
--------------------------------------------------
Sentence: unseen words cause problems
Unigram Perplexity: 13866.018957806076
Bigram Perplexity: 59.49551913027847
--------------------------------------------------
