# Language Model Metrics: Perplexity - Exercises

### Definition of Perplexity

The commonly used metric to evaluate language models is called *Perplexity*.

Assume data samples, i.e. sentences $x^{(i)}, i = 1, ..., N$ with every sentence consisting of a sequence of tokens (words) $x^{(i)} = x^{(i)}_1 x^{(i)}_2 ... x^{(i)}_{k_i}$, are given.
The perplexity of a model on the given data is defined as 

$ PPL(\theta) = \exp \left( - \frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{k_i} \log {P_{\theta}(x^{(i)}_{k}|x^{(i)}_1...x^{(i)}_{k-1})} \right) $

where $\theta$ are the model's parameters and $P_{\theta}(x^{(i)}_{k}|x^{(i)}_1...x^{(i)}_{k-1})$ is the probability that the model outputs $x^{(i)}_{k}$ given the previous sequence of tokens.

The following implementation of the Softmax function will be required to convert the model scores to probabilities.

It is already known from a previous tutorial.

In [None]:
import math

import numpy as np

def softmax(values):
    exp_values = np.exp(values)
    exp_values_sum = np.sum(exp_values)
    return exp_values/exp_values_sum

### Exercise: Implementation of Perplexity

Implement the computation of the perplexity function as defined above.

The function parameters are the scores output by the model (`token_scores`) of dimension (`n_samples`, `n_classes`), and the true class indices (`true_token_index`) which is an array of dimension (`n_samples`).

In case the model scores do not resemble a probability distribution, the softmax function is applied to the scores for each prediction first. 
The last function parameter `apply_softmax` indicates whether a the softmax function should be applied.

In [None]:
def perplexity(token_scores, true_token_index, apply_softmax=True):
    log_prob_sum = 0
    if apply_softmax:
        token_probabilities = [softmax(scores) for scores in token_scores]
    else:
        token_probabilities = token_scores
    
    ### YOUR SOLUTION HERE
    ### END OF SOLUTION
    
    return perplexity

In [None]:
### test implementation
assert np.isclose(perplexity([[1e8, -1e8]], [0], apply_softmax=False), 0)
assert np.isclose(perplexity([[1,0,0,0]], [0], apply_softmax=False), 1)
assert np.isclose(perplexity([[0.5, 0.5]], [0], apply_softmax=False), 2)
assert np.isclose(perplexity([[1, 2, 7], [2, -1, 0], [0, 1, 0], [1, 0.2, 0.2]], [2, 0, 1, 0], apply_softmax=True), 1.409032255704535)

### Exercise: Perplexity function using Cross Entropy Loss

You might have noticed that the perplexity function has a high similarity to the cross entropy loss which we have already seen in previous lectures and tutorials.
Remember:

$ \text{CrossEntropyLoss} = - \sum_{i=1}^{N} \sum_{k=1}^{k_i} \log {P_{\theta}(x^{(i)}_{k}|x^{(i)}_1...x^{(i)}_{k-1})} $

The cross entropy Loss is already implemented in PyTorch's class `torch.nn.CrossEntropyLoss` (compare previous tutorial).
If the cross entropy loss is initialized without any parameters, the returned results will already be normed by the number of samples in the data (this could be avoided by setting the named parameter `reduction='sum'` or `reduction='none'` but is not necessary in this case). 

Use this existing implementation to compute the perplexity score based on the cross entropy loss.

The parameters of the function `perplexity_ce_based` are identical to those of the previously implemented function `perplexity`.
The first step in the implementation is to convert the given arrays to tensors, so they can be input to PyTorch's cross entropy computation.

In [None]:
import torch

def perplexity_ce_based(token_scores, true_token_index):
    token_scores_tensor = torch.tensor(token_scores)
    true_token_index_tensor = torch.tensor(true_token_index).long()
    ### YOUR SOLUTION HERE
    ### END OF SOLUTION
    return perplexity

The following code cell initializes randomized numpy arrays which can be used to test your function implementations. 

If implemented correctly, the difference between the functions' return values should be extremely close to zero.

In [None]:
### randomized test case
scores = np.double(np.random.random((12, 8)))
true_classes = np.random.randint(0, 8, 12)

ppl_1 = perplexity(scores, true_classes, apply_softmax=True)
ppl_2 = perplexity_ce_based(scores, true_classes)

ppl_diff = np.abs(ppl_1 - ppl_2)
print(ppl_diff)

### Application of Perplexity

Next, we will apply the computation of a perplexity score to the n_gram model from the previous exercise. 

The following code cell once again defines a small sample corpus and computes the corresponding ngram-frequencies for n=3.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

n = 3

corpus = [
    "sos the fox is brown and quick eos",
    "sos the dog is brown and lazy eos",
    "sos the dog is very lazy eos",
    "sos the fox is very quick eos"
]

vectorizer = CountVectorizer(ngram_range=(n, n))  # Generate n-grams of size n
X = vectorizer.fit_transform(corpus)
n_grams = vectorizer.get_feature_names_out()

n_gram_freq = {}
for ngram, count in zip(n_grams, X.toarray().sum(axis=0)):
    n_gram_freq[ngram] = count

We use another CountVectorizer to get a list of all single tokens in the corpus vocabulary.
This will be needed to compute the probabilities for each token based on the n-gram frequencies subsequently.

In [None]:
vectorizer_single_tokens = CountVectorizer(ngram_range=(1, 1))
vectorizer_single_tokens.fit_transform(corpus)
tokens = vectorizer_single_tokens.get_feature_names_out()

We will use a single test sentence to compute the n-gram model's perplexity score.

In [None]:
eval_text = ["sos the fox is very quick eos"]

Next, we will implement a function which computes the probabilities for each word in the vocabulary to be the next token (even if the corresponding n-gram does not occur in the corpus).
These probabilities are required to compute a model perplexity score.

In [None]:
def get_next_token_probabilities(ngram_prefix, n_gram_freq, tokens):
    candidates = {ngram: count for ngram, count in n_gram_freq.items() if ngram.startswith(ngram_prefix)}
    freq_sum = sum(candidates.values())
    # probs will be the list containing the probabilities for each token in tokens to be predicted by the n-gram language model 
    probs = []
    ### YOUR SOLUTION HERE
    # for each token in tokens, compute the corresponding probability and append it to the list 'probs'
    ### END OF SOLUTION
    return probs

Now, we will iterate over all n-grams in the evaluation text and compute the probabilties for each token to be output as final token of the n-gram.
These probabilities are assembled in the array `eval_ngram_probabilities`.
At the same time, the corresponding indices of the true next token are stored in the array `eval_true_tokens`. 

In [None]:
eval_ngram_probabilities = []
eval_true_tokens = []

### YOUR SOLUTION HERE
### END OF SOLUTION

Based on the previously computed token probabilities, we can compute a perplexity score.

Note, that the softmax function should not be applied within the perplexity score computation as our model already outputs a probability distribution.
(Multiple subsequent applications of the softmax function leads to levelling of the different scores.)

In [None]:
ppl = perplexity(eval_ngram_probabilities, eval_true_tokens, apply_softmax=False)
print(ppl)