## Evaluating Language Models with Perplexity

**Perplexity** is one of the most common metrics used to evaluate language models.  
It measures how well a model predicts a sequence of words — in other words, how *confident* or *uncertain* the model is when generating text.

A **lower perplexity** means the model assigns higher probabilities to the correct words (better performance),  
while a **higher perplexity** indicates greater uncertainty or poorer predictions.

---

## Mathematical Definition

Given a test sequence of $N$ words $w_1, w_2, \dots, w_N$ and a language model that assigns a conditional probability $P(w_i \mid w_{1:i-1})$ to each word, the **perplexity** is defined as

$ \text{Perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i \mid w_{1:i-1})} $, which can also be expressed in an equivalent multiplicative form:
$ \text{Perplexity} = \left( \prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{1:i-1})} \right)^{\frac{1}{N}} $.

Both formulations measure the average uncertainty of the model over the test sequence, and a lower value indicates that the model assigns higher probabilities to the correct words.

In [9]:
from nltk import bigrams
from collections import Counter, defaultdict

#Training text

training_text = ["Alice", "wonders","what","is","happening", "in", "Wonderland"]

#Generate bigram counts and unigram counts
bigram_counts = Counter(bigrams(training_text))
unigram_counts = Counter(training_text)
vocabulary_size = len(set(training_text)) #set bc we want to count each word only 1 time (useless here)

#function to calculate bigram probabilities 

def bigram_probability (bigram, bigram_counts, unigram_counts, vocabulary_size, alpha = 1) :
    """Calculate the probability of a bigram using Laplace Smoothing"""
    w1,w2 = bigram
    numerator = bigram_counts[bigram] + alpha
    denominator = unigram_counts[w1] +alpha * vocabulary_size
    return numerator/denominator




## Testing the model 

To compute perplexity, we first need the probablities for all the bigrams in a given test sequence (using Laplace smoothing ensures no bigram has a zero probability)

In [10]:
import numpy as np

#Function to calculate perplexity
def calculate_perplexity(test_text, bigram_counts, unigram_counts, vocabulary_size, alpha =1) :
    """Calculate the perplexity of a test text using a bigram language model."""

    N = len(test_text)
    log_probability_sum = 0

    for i in range(1,N):
        bigram = (test_text[i-1], test_text[i])
        probability = bigram_probability(bigram, bigram_counts, unigram_counts, vocabulary_size, alpha)
        log_probability_sum += np.log2(probability)
    
    perplexity = 2**(-log_probability_sum /N)
    return perplexity

#Test text (same as training, did this on purpose to compare with unseen words later)
test_text = ["Alice", "wonders","what","is","happening", "in", "Wonderland"] 

#Calculate perplexity
perplexity = calculate_perplexity(test_text, bigram_counts, unigram_counts, vocabulary_size)
print(f"Perplexity of the test text : {perplexity}")

Perplexity of the test text : 3.2813414240305514


In [11]:
#Test with unseen sequence
unseen_test_text = ["Alice", "dreams", "about", "Wonderland"]

#Calculate perplexity for unseen sequence
perplexity_unseen = calculate_perplexity(unseen_test_text, bigram_counts, unigram_counts, vocabulary_size)

print(f"Perplexity of the unseen test text : {perplexity_unseen}")

Perplexity of the unseen test text : 4.449605586254059


## Conclusion
As expected, perplexity is higher for the unseen sequence because bigram_probability returns lower probabilities than for the test_sequence which contains bigrams that are in the training sample.