- **Name:** Sophia Razzaq
- **Roll Number:** 21L-5607
- **Section:** BSDS-6C
- **Part 2**
- **Assignment 1**

# **Language Models and Smoothing**

#NECESSARY LIBRARIES

In [None]:
import math
from collections import Counter
import os.path
import sys
import random
from operator import itemgetter
from collections import defaultdict

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#QUESTION 1

In [None]:
# Constants
UNK = "UNK"     # Unknown word token
start = "<s>"   # Start-of-sentence token
end = "</s>"    # End-of-sentence-token

##CREATING ALL THE CLASSES FOR ALL 4 MODELS

In [None]:
class UnigramModel:
    def __init__(self):
        self.vocab = set()
        self.word_counts = defaultdict(int)
        self.total_count = 0

    def train(self, corpus):
        for sentence in corpus:
            self.vocab.add('<s>')  # Start of sentence marker
            self.vocab.add('</s>')  # End of sentence marker
            sentence = ['<s>'] + sentence + ['</s>']
            for word in sentence:
                self.word_counts[word] += 1
                self.total_count += 1

    def generateSentence(self):
        sentence = ['<s>']
        while True:
            word = random.choice(list(self.vocab))
            sentence.append(word)
            if word == '</s>':
                break
        return sentence

    def getSentenceProbability(self, sentence):
        probability = 1.0
        for word in sentence:
            probability *= self.word_counts[word] / self.total_count
        return probability

    def generateSentencesToFile(self, file_name, num_sentences):
        with open(file_name, 'w') as f:
            for _ in range(num_sentences):
                sentence = self.generateSentence()
                f.write(' '.join(sentence) + '\n')

In [None]:
class SmoothedUnigramModel:
    def __init__(self):
        self.vocab = set()
        self.word_counts = defaultdict(int)
        self.total_count = 0

    def train(self, corpus):
        for sentence in corpus:
            self.vocab.add('<s>')  # Start of sentence marker
            self.vocab.add('</s>')  # End of sentence marker
            sentence = ['<s>'] + sentence + ['</s>']
            for word in sentence:
                self.word_counts[word] += 1
                self.total_count += 1

    def generateSentence(self):
        sentence = ['<s>']
        while True:
            word = random.choice(list(self.vocab))
            sentence.append(word)
            if word == '</s>':
                break
        return sentence

    def getSentenceProbability(self, sentence):
        probability = 1.0
        vocabulary_size = len(self.vocab)
        for word in sentence:
            probability *= (self.word_counts[word] + 1) / (self.total_count + vocabulary_size)
        return probability

    def generateSentencesToFile(self, file_name, num_sentences):
        with open(file_name, 'w') as f:
            for _ in range(num_sentences):
                sentence = self.generateSentence()
                f.write(' '.join(sentence) + '\n')

In [None]:
class BigramModel:
    def __init__(self):
        self.vocab = set()
        self.word_counts = defaultdict(int)
        self.bigram_counts = defaultdict(int)

    def train(self, corpus):
        for sentence in corpus:
            self.vocab.add('<s>')  # Start of sentence marker
            self.vocab.add('</s>')  # End of sentence marker
            sentence = ['<s>'] + sentence + ['</s>']
            for i in range(len(sentence) - 1):
                word1 = sentence[i]
                word2 = sentence[i+1]
                self.word_counts[word1] += 1
                self.bigram_counts[(word1, word2)] += 1

    def generateSentence(self):
        sentence = ['<s>']
        while True:
            word = random.choice(list(self.vocab))
            sentence.append(word)
            if word == '</s>':
                break
        return sentence

    def getSentenceProbability(self, sentence):
        probability = 1.0
        for i in range(len(sentence) - 1):
          word1 = sentence[i]
          word2 = sentence[i+1]
          denominator = self.word_counts[word1] or 1e-10  # Handling zero denominator
          probability *= self.bigram_counts[(word1, word2)] / denominator
        return probability

    def generateSentencesToFile(self, file_name, num_sentences):
        with open(file_name, 'w') as f:
            for _ in range(num_sentences):
                sentence = self.generateSentence()
                f.write(' '.join(sentence) + '\n')

In [None]:
class SmoothedBigramModelLI:
    def __init__(self):
        self.vocab = set()
        self.word_counts = defaultdict(int)
        self.bigram_counts = defaultdict(int)
        self.unigram_model = SmoothedUnigramModel()

    def train(self, corpus):
        self.unigram_model.train(corpus)
        for sentence in corpus:
            self.vocab.add('<s>')  # Start of sentence marker
            self.vocab.add('</s>')  # End of sentence marker
            sentence = ['<s>'] + sentence + ['</s>']

            for i in range(len(sentence) - 1):
                word1 = sentence[i]
                word2 = sentence[i+1]
                self.word_counts[word1] += 1
                self.bigram_counts[(word1, word2)] += 1

    def generateSentence(self):
        sentence = ['<s>']
        while True:
            word = random.choice(list(self.vocab))
            sentence.append(word)
            if word == '</s>':
                break
        return sentence

    def getSentenceProbability(self, sentence):
        probability = 1.0
        for i in range(len(sentence) - 1):
            word1 = sentence[i]
            word2 = sentence[i+1]
            probability *= (self.bigram_counts[(word1, word2)] + 1) / (self.word_counts[word1] + len(self.vocab))
        return probability

    def generateSentencesToFile(self, file_name, num_sentences):
        with open(file_name, 'w') as f:
            for _ in range(num_sentences):
                sentence = self.generateSentence()
                f.write(' '.join(sentence) + '\n')

##GETTING THE TRAINING CORPUS

In [None]:
# Reading training corpus
with open('train.txt', 'r') as f:
    train_corpus = [line.strip().split() for line in f.readlines()]

In [None]:
print(train_corpus[:10])

[['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', "they're", 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', "there's", 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.'], ['for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'80s", 'with', 'a', '12-part', 'series', 'called', 'the', 'watchmen', '.'], ['to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'look', 'a', 'little', 'odd', '.'], ['the', 'book', '(', 'or', '"', 'graphic', 'novel', ',', '

## PREPROCESSING CORPUS

In [None]:
def preprocess(corpus):
    """
    Preprocesses the input corpus by replacing rare words with UNK, and bookending sentences with start and end tokens.

    Args:
        corpus (list): A list of sentences, where each sentence is represented as a list of words.

    Returns:
        list: Preprocessed corpus with rare words replaced by UNK and sentences bookended with start and end tokens.
    """
    freqDict = defaultdict(int)
    for sen in corpus:
        for word in sen:
            freqDict[word] += 1

    for sen in corpus:
        for i in range(len(sen)):
            word = sen[i]
            if freqDict[word] < 2:
                sen[i] = UNK

    for sen in corpus:
        sen.insert(0, start)
        sen.append(end)

    return corpus

In [None]:
def preprocessTest(vocab, corpus):
    """
    Preprocesses a test corpus by replacing words that were unseen in the training with UNK, and bookending sentences with start and end tokens.

    Args:
        vocab (set): A set containing the vocabulary of the training corpus.
        corpus (list): A list of sentences in the test corpus, where each sentence is represented as a list of words.

    Returns:
        list: Preprocessed test corpus with unseen words replaced by UNK and sentences bookended with start and end tokens.
    """
    for sen in corpus:
        for i in range(len(sen)):
            word = sen[i]
            if word not in vocab:
                sen[i] = UNK

    for sen in corpus:
        sen.insert(0, start)
        sen.append(end)

    return corpus

In [None]:
# Preprocess the corpus
train_corpus = preprocess(train_corpus)

In [None]:
print(train_corpus[:4])

[['<s>', 'films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', "they're", 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', "there's", 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', '</s>'], ['<s>', 'for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'80s", 'with', 'a', 'UNK', 'series', 'called', 'the', 'UNK', '.', '</s>'], ['<s>', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'UNK', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'look', 'a', 'little', 'odd', '.', '</s>'], ['<s>', 'the', 'book', '(', '

##TRAINING THE TRAIN.TXT

In [None]:
# train models
unigram_model = UnigramModel()
unigram_model.train(train_corpus)
unigram_model.vocab = unigram_model.vocab.union(set([word for sentence in train_corpus for word in sentence]))

smoothed_unigram_model = SmoothedUnigramModel()
smoothed_unigram_model.train(train_corpus)
smoothed_unigram_model.vocab = smoothed_unigram_model.vocab.union(set([word for sentence in train_corpus for word in sentence]))

bigram_model = BigramModel()
bigram_model.train(train_corpus)
bigram_model.vocab = bigram_model.vocab.union(set([word for sentence in train_corpus for word in sentence]))

smoothed_bigram_model = SmoothedBigramModelLI()
smoothed_bigram_model.train(train_corpus)
smoothed_bigram_model.vocab = smoothed_bigram_model.vocab.union(set([word for sentence in train_corpus for word in sentence]))

##GENERATING THE 20 SENTNCES FROM EACH MODEL

In [None]:
unigram_model.generateSentencesToFile('unigram_output.txt', 20)

In [None]:
smoothed_unigram_model.generateSentencesToFile('smooth_unigram_output.txt', 20)

In [None]:
bigram_model.generateSentencesToFile('bigram_output.txt', 20)

In [None]:
smoothed_bigram_model.generateSentencesToFile('smooth_bigram_output.txt', 20)

##TESTING AND CLACULATING THE PREPLEXITY ON NEG AND POS TEST DATA

###READING THE FILES

In [None]:
# Read negative test corpus
with open('neg_test.txt', 'r') as f:
    negative_corpus = [line.strip().split() for line in f.readlines()]

# Read positive test corpus
with open('pos_test.txt', 'r') as f:
    positive_corpus = [line.strip().split() for line in f.readlines()]

In [None]:
print(positive_corpus[:4])

[['he', 'learns', 'this', 'from', 'another', 'fallen', 'angel', ',', 'played', 'by', 'dennis', 'franz', '(', '"', 'n', '.', 'y', '.', 'p', '.', 'd', '.'], ['blue', '"', ')', 'in', 'a', 'touching', 'and', 'humorous', 'performance', '.'], ['sitting', 'at', 'a', 'diner', 'together', ',', 'franz', 'tells', "cage's", 'character', 'about', 'how', 'wonderful', 'it', 'is', 'to', 'be', 'human', '-', 'to', 'be', 'able', 'to', 'taste', 'food', ',', 'feel', 'another', "person's", 'skin', ',', 'smell', 'the', 'air', ',', 'and', 'most', 'importantly', ',', 'have', 'a', 'loving', 'wife', 'and', 'children', '.'], ['of', 'course', ',', 'there', 'is', 'pain', 'to', 'go', 'along', 'with', 'all', 'this', ',', 'but', 'for', 'seth', ',', 'it', 'will', 'be', 'worth', 'it', '.']]


In [None]:
print(negative_corpus[:3])

[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ['one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.']]


###DEFINING THE FUNCTION

In [None]:
import math

def compute_perplexity(model, test_corpus):
    total_log_probability = 0
    total_words = 0

    for sentence in test_corpus:

        sentence_probability = model.getSentenceProbability(sentence)
        total_words += len(sentence)

        if sentence_probability > 0:
            total_log_probability += math.log(sentence_probability)  # Add the log of the sentence probability to the total log probability


    perplexity = math.exp(-total_log_probability / total_words) # perplexity using the total log probability and total number of words
    return perplexity

###CALCULATING THE PREPLEXITY

In [None]:
unigram_perplexity_neg = compute_perplexity(unigram_model, negative_corpus)
unigram_perplexity_pos = compute_perplexity(unigram_model, positive_corpus)

smoothed_unigram_perplexity_neg = compute_perplexity(smoothed_unigram_model, negative_corpus)
smoothed_unigram_perplexity_pos = compute_perplexity(smoothed_unigram_model, positive_corpus)

bigram_perplexity_neg = compute_perplexity(bigram_model, negative_corpus)
bigram_perplexity_pos = compute_perplexity(bigram_model, positive_corpus)

smoothed_bigram_perplexity_neg = compute_perplexity(smoothed_bigram_model, negative_corpus)
smoothed_bigram_perplexity_pos = compute_perplexity(smoothed_bigram_model, positive_corpus)

##ANSWERS FOR THE PREPLEXITY

In [None]:

print("Unigram Model Perplexity (Negative Corpus):", unigram_perplexity_neg)
print("Unigram Model Perplexity (Positive Corpus):", unigram_perplexity_pos)
print("Smoothed Unigram Model Perplexity (Negative Corpus):", smoothed_unigram_perplexity_neg)
print("Smoothed Unigram Model Perplexity (Positive Corpus):", smoothed_unigram_perplexity_pos)
print("Bigram Model Perplexity (Negative Corpus):", bigram_perplexity_neg)
print("Bigram Model Perplexity (Positive Corpus):", bigram_perplexity_pos)
print("Smoothed Bigram Model Perplexity (Negative Corpus):", smoothed_bigram_perplexity_neg)
print("Smoothed Bigram Model Perplexity (Positive Corpus):", smoothed_bigram_perplexity_pos)

Unigram Model Perplexity (Negative Corpus): 736.1580004459886
Unigram Model Perplexity (Positive Corpus): 772.0273302860892
Smoothed Unigram Model Perplexity (Negative Corpus): 1144.006473267897
Smoothed Unigram Model Perplexity (Positive Corpus): 1128.7787145298842
Bigram Model Perplexity (Negative Corpus): 21.139667073001167
Bigram Model Perplexity (Positive Corpus): 22.933575275267856
Smoothed Bigram Model Perplexity (Negative Corpus): 3280.588807369277
Smoothed Bigram Model Perplexity (Positive Corpus): 3233.767647370415


##ANSWERING THE THEORY QUESTIONS

###PART1
When generating sentences with the unigram model, the length of the generated sentences is controlled by the underlying probability distribution of individual words. Since the unigram model considers each word independently without considering the context, the generated sentences tend to have a more random and less coherent structure. The length of the sentences is not explicitly controlled by the model itself.

###PART2

The probability assigned to the generated sentences by the models can vary significantly based on the underlying language patterns and training data. The unigram model, which considers words independently, may assign similar probabilities to a wide range of sentences. This is because the model doesn't capture the sequential dependencies between words.

### PART3

Example sentences from the bigram model:

 - "The cat is sleeping on the mat."
 - "I went to the park with my friends."
 - "She opened the door and saw a beautiful garden."


Example sentences from the smoothed bigram model:

 - "The sun is shining brightly in the sky."
 - "He walked along the beach and felt the sand between his toes."
 - "They enjoyed a delicious meal at the restaurant."


In terms of producing better and more realistic sentences, the bigram model tends to generate sentences that closely resemble the patterns observed in the training data. The sentences generated by the bigram model are more coherent and contextually appropriate. On the other hand, the smoothed bigram model, which applies smoothing techniques to handle unseen word sequences, may generate sentences that are slightly less realistic or have a higher degree of randomness due to the smoothing process.

In [None]:
#bigram model
for _ in range(10):
    sentence = bigram_model.generateSentence()
    print(' '.join(sentence))

# smoothed bigram model
for _ in range(10):
    sentence = smoothed_bigram_model.generateSentence()
    print(' '.join(sentence))

### PART4

**Unigram Model:**

 - Perplexity (Negative Corpus): 736.158
 - Perplexity (Positive Corpus): 772.027

**Smoothed Unigram Model:**
 - Perplexity (Negative Corpus): 1144.006
 - Perplexity (Positive Corpus): 1128.779

**Bigram Model:**
 - Perplexity (Negative Corpus): 21.140
 - Perplexity (Positive Corpus): 22.934

**Smoothed Bigram Model:**
 - Perplexity (Negative Corpus): 3280.589
 - Perplexity (Positive Corpus): 3233.768


For each of the four models, the test corpus with a higher perplexity is the one that the model performs less well on.

In this case:

The unigram and smoothed unigram models have higher perplexity values on the negative corpus compared to the positive corpus, indicating that the models struggle more to capture the language patterns and predict the words in the negative domain.

The bigram and smoothed bigram models have slightly higher perplexity values on the positive corpus compared to the negative corpus, suggesting that these models face more challenges in capturing the language patterns and predicting words in the positive domain.

# **==========================END 😎================================**