<a href="https://colab.research.google.com/github/0Nexus/0Nexus.github.io/blob/main/ngram_language_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from IPython.display import Math, HTML
display(HTML("<script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/"
               "latest.js?config=default'></script>"))

#Language models

As we noted in the lecture today, language models are models that are trained to predict the likelihood of a sequence of words or characters in a language. The primary goal of language models is to learn the structure and patterns of a language by analyzing large amounts of text.


We will begin our exploration by looking at the key fundamentals of the simplest language model.

We will start of with a toy passage from Simple English Wikipedia.

In [2]:
# our toy text (ideally we would train over large amounto of data, this is just a tiny snapshot)
input_text = """This is the front page of the Simple English Wikipedia.
                Wikipedias are places where people work together to write encyclopedias
                in different languages. We use Simple English words and grammar here.
                The Simple English Wikipedia is for everyone! That includes children and
                adults who are learning English. There are 227,530 articles on the Simple
                English Wikipedia. All of the pages are free to use."""

## Unigram language model

Recall that the unigram language model is the simplest version of the language model, where we use the chain rule of probabilities and make a (very bad) independence assumption: we consider only the frequency (normalise this and obtain the probability) of individual words in a text and assume that each word is independent of other words (or contexts) in the sentence.

To build a unigram model, the text is first split into individual words, and the frequency of each word is calculated. Then, the probability of each word is calculated by dividing its frequency by the total number of words in the text. This gives a probability distribution over the entire vocabulary of the language.

Given a new text, a unigram model can predict the likelihood of each word in the vocabulary based on its frequency. To generate a sentence using a unigram model, words are sampled randomly from the unigram distribution. Given sufficient amount of data this can result in sentences that are grammatically correct but may not make sense semantically.

In the code below: a sentence of length 10 is sampled from the unigram distribution using the choices() method from the random module.

In [8]:
import random

# Split the text into a list of words
words = input_text.split()

# Count the frequency of each word and calculate the unigram probabilities
unigram_counts = {}
for word in words:
    if word in unigram_counts:
        unigram_counts[word] += 1
    else:
        unigram_counts[word] = 1

num_words = len(words)
unigram_probs = {word: count / num_words for word, count in unigram_counts.items()}

# Sample a sentence of length 10 from the unigram distribution
sampled_sentence = []
for i in range(100):
    sampled_word = random.choices(list(unigram_probs.keys()), list(unigram_probs.values()))[0]
    sampled_sentence.append(sampled_word)

# Print out the sampled sentence
print("Sampled sentence:")
print(" ".join(sampled_sentence))

Sampled sentence:
This use children Simple Wikipedia All English Wikipedia. is Wikipedia. English. are English work and This Simple adults front people the All Wikipedias are where are the Wikipedia. who All of languages. English. everyone! page Wikipedia and children the pages of Simple is languages. write on are are free in the Simple to Simple here. Simple English Wikipedia to There English. 227,530 That includes All people Wikipedia here. words All are Simple There Simple use the use languages. of The Wikipedia includes Wikipedia. This Simple adults the English the Simple and grammar is are are words of Simple English is


### TODO:

1.   Try using a large sized input sample and see how the sampling changes.
2.   Change the length of the generated samples
3.   What happens to words that are not seen in the training set?
4.   There is something interesting happening in the `else` loop above. What is this? Why is this done? What is the effect of not having this loop?



## Bigram language model

Recall that the bigram language model predicts the probability of a word based on the probability of its preceding word. In other words, a bigram model considers the probability of a word based on the occurrence of its preceding word in the sentence.

Notice in the code below the bigram counts are calculated by looping over each adjacent pair of words in the text. The frequency of each bigram is stored in a dictionary called bigram_counts.

In [9]:
# Calculate bigram counts
bigram_counts = {}
for i in range(len(words)-1):
    current_word = words[i]
    next_word = words[i+1]
    bigram = (current_word, next_word)
    if bigram in bigram_counts:
        bigram_counts[bigram] += 1
    else:
        bigram_counts[bigram] = 1

# Calculate bigram probabilities
num_bigrams = sum(bigram_counts.values())
bigram_probs = {bigram: count / num_bigrams for bigram, count in bigram_counts.items()}

# Sample a sentence of length 10 from the bigram distribution
start_word = random.choice(words)
sampled_sentence = [start_word]
for i in range(100):
    current_word = sampled_sentence[-1]
    possible_bigrams = [bigram for bigram in bigram_probs.keys() if bigram[0] == current_word]
    if len(possible_bigrams) > 0:
        sampled_bigram = random.choices(possible_bigrams, [bigram_probs[bigram] for bigram in possible_bigrams])[0]
        next_word = sampled_bigram[1]
    else:
        next_word = random.choice(words)
    sampled_sentence.append(next_word)

# Print out the sampled sentence
print("Sampled sentence:")
print(" ".join(sampled_sentence))


Sampled sentence:
where people work together to write encyclopedias in different languages. We use Simple English Wikipedia. All of the Simple English Wikipedia. Wikipedias are free to use. Wikipedia. All of the front page of the Simple English Wikipedia. All of the Simple English Wikipedia. Wikipedias are free to write encyclopedias in different languages. We use Simple English Wikipedia. Wikipedias are places where people work together to write encyclopedias in different languages. We use Simple English words and adults who are 227,530 articles on the front page of the front page of the Simple English words and grammar here. The Simple English Wikipedia.


## Trigram language model

Now let's do trigram language model where a trigram model considers the probability of a word based on the occurrence of its two preceding words in the sentence.

It is very similar to the bigram language model and is an extension. It considers preceding two word context.


In [16]:
# Calculate trigram counts
trigram_counts = {}
for i in range(len(words)-2):
    current_word = words[i]
    next_word = words[i+1]
    third_word = words[i+2]
    trigram = (current_word, next_word, third_word)
    if trigram in trigram_counts:
        trigram_counts[trigram] += 1
    else:
        trigram_counts[trigram] = 1

# Calculate trigram probabilities
num_trigrams = sum(trigram_counts.values())
trigram_probs = {trigram: count / num_trigrams for trigram, count in trigram_counts.items()}

# Sample a sentence of length 10 from the trigram distribution
start_bigram = random.choices([(words[i], words[i+1]) for i in range(len(words)-1)], k=1)[0]
sampled_sentence = [start_bigram[0], start_bigram[1]]
for i in range(10):
    current_bigram = (sampled_sentence[-2], sampled_sentence[-1])
    possible_trigrams = [trigram for trigram in trigram_probs.keys() if trigram[:2] == current_bigram]
    if len(possible_trigrams) > 0:
        sampled_trigram = random.choices(possible_trigrams, [trigram_probs[trigram] for trigram in possible_trigrams])[0]
        next_word = sampled_trigram[2]
    else:
        next_word = random.choice(words)
    sampled_sentence.append(next_word)

# Print out the sampled sentence
print("Sampled sentence:")
print(" ".join(sampled_sentence))

Sampled sentence:
are places where people work together to write encyclopedias in different languages.


In [15]:
# Calculate trigram counts
quadgram_counts = {}
for i in range(len(words)-3):
    current_word = words[i]
    next_word = words[i+1]
    third_word = words[i+2]
    quad_word = words[i+3]
    quadgram = (current_word, next_word, third_word, quad_word)
    if quadgram in quadgram_counts:
        quadgram_counts[quadgram] += 1
    else:
        quadgram_counts[quadgram] = 1

# Calculate trigram probabilities
num_quadgrams = sum(quadgram_counts.values())
quadgram_probs = {quadgram: count / num_quadgrams for quadgram, count in quadgram_counts.items()}

# Sample a sentence of length 10 from the trigram distribution
start_trigram = random.choices([(words[i], words[i+1], words[i+2]) for i in range(len(words)-2)], k=2)[0]
sampled_sentence = [start_trigram[0], start_trigram[1],start_trigram[2]]
for i in range(10):
    current_trigram = (sampled_sentence[-3], sampled_sentence[-2], sampled_sentence[-1])
    possible_quadgram = [trigram for trigram in trigram_probs.keys() if trigram[:3] == current_bigram]
    if len(possible_quadgram) > 0:
        sampled_quadgram = random.choices(possible_quadgram, [quadgram_probs[quadgram] for quadgram in possible_quadgram])[0]
        next_word = sampled_quadgram[3]
    else:
        next_word = random.choice(words)
    sampled_sentence.append(next_word)

# Print out the sampled sentence
print("Sampled sentence:")
print(" ".join(sampled_sentence))

Sampled sentence:
Wikipedia. All of Wikipedia. pages free All pages pages the is of are


### TODO:

1.   Try both the bigram and the trigram models with the larger dataset, do you see a difference in the generated samples? What is happening? Is the quality improving? Why is this happening?   
2.   (Challenge excercise) Extend this for 4-gram language model. What does this involve?



## Evaluating language models: Perplexity

Recall that perplexity is a measure of how well a language model can predict a sequence of words. It is calculated as the inverse probability of the test set, normalized by the number of words in the test set, it is given as: $ppl(s) = 2^{-\scriptscriptstyle{\frac{\log P(s)}{N}}}$,

 where $s$ is a sentence, $P(s)$ is the probability of the sentence according to the LM.

 We will define this below.

In [6]:
import math

def calculate_perplexity(test_set, ngram_probs):
    # Split the test set into a list of words
    test_words = test_set.split()

    # Calculate the perplexity of the test set
    N = len(test_words)
    log_prob = 0
    for i in range(len(test_words)-(len(ngram_probs)-1)):
        ngram = tuple(test_words[i:i+len(ngram_probs)])
        if ngram in ngram_probs:
            log_prob += math.log(ngram_probs[ngram])
        else:
            # Add probability of unknown word
            log_prob += math.log(1 / (sum(ngram_probs.values()) + 1))
    perplexity = math.exp(-log_prob/N)

    return perplexity

This code calculates the perplexity of a test set using a trigram language model. The test set is defined as the string "this article is from Wikipedia". Remember, lower the perplexity -- better the quality of the LM (when trained on language related data).

Another way to think of perplexity is essentially asking to what extent is the LM surprised with a given sample of data.


In [19]:
# Define the test set as a string
test_set = "this article is from Wikipedia"

# Calculate perplexity of test set using the trigram language model
trigram_perplexity = calculate_perplexity(test_set, trigram_probs)
print("Trigram perplexity:", trigram_perplexity)
# Define the test set as a string
test_set = "this article is from Wikipedia"

# Calculate perplexity of test set using the trigram language model
quadgram_perplexity = calculate_perplexity(test_set, quadgram_probs)
print("quadgram perplexity:", quadgram_perplexity)
# Define the test set as a string
test_set = "this article is from Wikipedia"

# Calculate perplexity of test set using the trigram language model
bigram_perplexity = calculate_perplexity(test_set, bigram_probs)
print("bigram perplexity:", bigram_perplexity)

# Calculate perplexity of test set using the trigram language model
unigram_perplexity = calculate_perplexity(test_set, unigram_probs)
print("Unigram perplexity:", unigram_perplexity)




Trigram perplexity: 1.0
quadgram perplexity: 1.0
bigram perplexity: 1.0
Unigram perplexity: 1.0


### TODO

1. Can you compare the perplexity from each of the LMs that we have implemented?

2. If you have constructed 4-gram language model, please evaluate the perplexity of the model.

3. Can you go on in this way to {5, 6, 7, 9}-grams? Where is the bottleneck? What would extending the context window do?