# Perplexity for language models


## introduction

### What is perplexity?

- Perplexity is a metrics used to evaluate the performance of language models, particularly in the context of predicting the next word in a sequence based on the preceding words.
- it quantifies how well a probability model predicts a sample and is defined mathematically as the exponentiation of the average negative log-likelihood of a sequence
- the lower the better → the model is more confident → (in the context of LLMs, the model is more likely to have consistent output)

### the formula of perplexity:

$$
\text{Perplexity} = 2^{-\frac{1}{N}\sum_{i=1}^N \log_2 p(x_i|x_1, \dots, x_{i-1})}
$$

where

- $x_i$ is the current token
- $x_1, \dots, x_{i-1}$ are all the tokens that come before $x_ i$
- $p(x_i|x_1, \dots, x_{i-1})$ is the conditional probability of $x_i$ is occurring, given that we've already seen the sequence $x_1, \dots, x_{i-1}$, it represents how likely the model thinks the token $x_i$ is to appear next, based on the preceding tokens
- $\log_2 p(x_i|x_1, \dots, x_{i-1})$ is the log likelihood of the conditional probability that was previously mentioned. we use logarithms for reason like (numerical stability, converting products to sums, etc.)
- $\sum_{i=1}^N \log_2 p(\dots)$ represents the negative log-likelihood of an entire sequence of tokens.
- $\frac{1}{N} \sum_{i=1}^{N} -\log_2 p(x_i|x_1, \dots, x_{i-1})$ is the cross-entropy between the true distribution of the data and our model's distribution
- $2^{...}$ "cancels out" the base 2 logarithm used in the sum, which creates a symmetry that allows for a particular interpretation of perplexity

### interpretation of perplexity

perplexity can be thought of as the weighted average number of choice the model has when predicting each token. a perplexity of 10 means that the model is as confused on average as if it had to choose uniformly between 10 possibilities for each token.

### perplexity vs cross-entropy

since perplexity and cross-entropy are directly related, why do we still need perplexity?

- **interpretability**: while cross-entropy provides a measure of the average uncertainty in predicting next word, perplexity translates this uncertainty into a more interpretable form (see above)
- **historical precedent**: in the context of language models, perplexity has been a standard metric since the early days, hence it is easier to compare new results with older benchmarks
- **scales**: perplexity is more sensitive to small changes in model performance, especially at lower values


## how to calculate perplexity

- Tokenization: segment the text into tokens.
- Log-Likelihood Calculation: for each token in the sequence, compute its log-likelihood based on the model's predictions, conditioned on the preceding tokens.
- Average Negative Log-Likelihood: determine the mean of these log-likelihoods across all tokens.
- Exponentiation: Finally, apply the exponential function to the average negative log-likelihood to derive the final perplexity score.


# example I: perplexity for GPT2

for autoregressive models like GPT2, the term "loss" and "negative log-likelihood" are often used interchangeably, because:

- the standard loss function used during training is the cross-entropy loss, which is calculated over the entire sequence
- for classification problems (which next-token prediction essentially is), the cross-entropy loss is equivalent to the negative log-likelihood of the correct token
- the loss returned by `outputs.loss` is the cross-entropy loss, which is mathematically equivalent to the average negative log-likelihood per token in the sequence


In [14]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch
import math

# Load pre-trained model and tokenizer
model_id = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_id)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

# Set the model to evaluation mode
model.eval()

# Example text
text = "This is an example sentence to evaluate."

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt")

# Calculate log likelihood
with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])
    log_likelihood = -outputs.loss.item()

# Calculate perplexity
perplexity = math.exp(log_likelihood)

print(f"Perplexity: {perplexity}")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Perplexity: 0.009715268409795624


# example II: perplexity for BERT models

BERT is a bidirectional model trained with a masked language model (MLM) objective, hence the calculation of its perplexity is different from that of GPT2.


In [31]:
import warnings

import torch
from transformers import BertForMaskedLM, BertTokenizer

warnings.filterwarnings("ignore")

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# Input text with masked tokens
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

# Get the predicted probabilities for the masked token
masked_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
predicted_probs = torch.softmax(predictions[0, masked_index], dim=-1)

# Calculate log probabilities for the actual token
actual_token = tokenizer.convert_tokens_to_ids("paris")  # Example actual token
log_probs = torch.log(predicted_probs[:, actual_token])

# Calculate perplexity
perplexity = torch.exp(-log_probs.mean()).item()
print(f"Perplexity: {perplexity}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Perplexity: 2.399293899536133


# example III: perplexity for ChatGPT


In [2]:
import numpy as np
from openai import OpenAI

client = OpenAI()


def get_response_and_perplexity(text, model="gpt-3.5-turbo"):

    # Prepare messages for the chat completion
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": text},
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
        max_tokens=100,
        logprobs=True,
        top_logprobs=5,
    )

    response_text = response.choices[0].message.content

    # extract log probabilities
    logprobs = response.choices[0].logprobs.content

    # calculate token-level probabilities
    token_logprobs = []
    for token_info in logprobs:
        if token_info.top_logprobs:
            next_token = token_info.token
            for lp in token_info.top_logprobs:
                # extract the probabilities of SELECTED token
                if lp.token == next_token:
                    token_logprobs.append(lp.logprob)
                    break
            else:
                token_logprobs.append(max(lp.logprob for lp in token_info.top_logprobs))

    # calculate cross-entropy
    cross_entropy = -np.mean(token_logprobs)

    # calculate perplexity
    perplexity = np.exp2(cross_entropy)

    return response_text, perplexity

In [3]:
test_text = "what is self-RAG"
response, perplexity = get_response_and_perplexity(test_text)
print(f"Response: {response}")
print(f"Perplexity: {perplexity}")

Response: I'm not familiar with the term "self-RAG." It could be a specific acronym or term used in a particular context. If you provide more information or context, I may be able to help you better.
Perplexity: 1.177751014395226


In [4]:
test_text = "who is superman"
response, perplexity = get_response_and_perplexity(test_text)
print(f"Response: {response}")
print(f"Perplexity: {perplexity}")

Response: Superman is a fictional superhero appearing in American comic books published by DC Comics. He was created by writer Jerry Siegel and artist Joe Shuster and first appeared in Action Comics #1 in 1938. Superman is known for his superhuman abilities, including super strength, flight, invulnerability, heat vision, and more. He is also known as Clark Kent, a journalist for the Daily Planet in the fictional city of Metropolis.
Perplexity: 1.0679336836927034


In [12]:
test_text = "which is larger, 9.11 or 9,9?"
response, perplexity = get_response_and_perplexity(test_text)
print(f"Response: {response}")
print(f"Perplexity: {perplexity}")

Response: 9.11 is larger than 9.9.
Perplexity: 1.0097724289290273


In [13]:
test_text = "which is larger, 9.11 or 9.9?"
response, perplexity = get_response_and_perplexity(test_text)
print(f"Response: {response}")
print(f"Perplexity: {perplexity}")

Response: 9.9 is larger than 9.11.
Perplexity: 1.00103644208786


In [7]:
test_text = "how many r are there in strawberry?"
response, perplexity = get_response_and_perplexity(test_text)
print(f"Response: {response}")
print(f"Perplexity: {perplexity}")

Response: There are 2 "r"s in the word "strawberry."
Perplexity: 1.133197498793775
