# Purpose

This notebook serves to understand how the perplexity of a model is calculated

# Preparation

## 1. Installing the dependencies

In [1]:
# %%cmd
# conda install -q -c nvidia cuda-python --yes
# conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia --yes
# conda install -q transformers --yes
# conda install -q plotly --yes
# conda install -q nbformat --yes

# Calculation

## 1. Loading the model

Using this small model [(SmolLM2-135M)](https://huggingface.co/HuggingFaceTB/SmolLM2-135M), we load it onto the device (preference GPU).

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

checkpoint = "HuggingFaceTB/SmolLM2-135M"
# checkpoint = "HuggingFaceTB/SmolLM2-1.7B"
device = "cuda" # for GPU usage or "cpu" for CPU usage

# Loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Loading the model onto the device
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
print(f'Model loaded with full precision ({model.get_memory_footprint() / 1e6:.2f} MB)')


# # Loading the model (torch.bfloat16) onto the device
# model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
# print(f'Model loaded with torch.bfloat16 ({model.get_memory_footprint() / 1e6:.2f} MB)')

Model loaded with full precision (538.06 MB)


In [3]:
# Testing the model 
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs, pad_token_id=tokenizer.eos_token_id, max_length=50)
print(tokenizer.decode(outputs[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Gravity is the force that holds the Earth and the Moon together.

The Moon is a satellite of the Earth. It is a rocky body that orbits the Earth. The Moon is the only natural satellite of the Earth.

The Moon


# 2. Defining Perplexity

Using some code found online, defining perplexity

$$\text { Perplexity }=\exp \left(-\frac{1}{t} \sum_i^t \log p_\theta\left(x_i \mid x_{\text {context }}\right)\right)$$
- $x_i$ represents the token that is generated
- $x_{\text {context }}$ represents the preceding tokens that the current generated token is conditioned on.

source: [https://docs.kolena.com/metrics/perplexity/](https://docs.kolena.com/metrics/perplexity/)

In [None]:
def soft_max(arr):
    return arr.exp() / arr.exp().sum()

def real_likelihood(pred_dist, real_token):
    pred_dist = soft_max(pred_dist)
    dist_real_token = pred_dist[0, real_token]
    return dist_real_token.log()

def perplexity(input_tokens) -> float:
    perplex = 0
    for n in range(1, input_tokens.shape[1]):
        test_tokens = input_tokens[:, :n].to(device)
        real_token = input_tokens[0, n]
        predicted_distribution = model.generate(
            test_tokens,
            max_length=n+1,
            output_scores=True,
            return_dict_in_generate=True,
            pad_token_id = tokenizer.eos_token_id
        )['scores'][0]
        perplex += real_likelihood(predicted_distribution, real_token)
    return torch.exp(-perplex / input_tokens.shape[1]).to('cpu').item()


def compute_perplexity(model, tokenizer, text: str | list[str]) -> float:
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer(text, return_tensors="pt", padding=True).to(device)
    loss = model(
        input_ids=inputs["input_ids"], labels=inputs["input_ids"], attention_mask=inputs["attention_mask"]
    ).loss
    return torch.exp(loss).item()

# 3. Loading the text to evaluate the model on

Using a few sentences in different languages, evaluate the model perplexity on them.

In [9]:
eval_texts = [
    {
        "language": "pt",
        "text": "O meu nome é João. Tenho 20 anos e gosto de comer. Hoje à noite vou comer batatas.",
        "desc": "Talking about myself."
    },
    {
        "language": "en",
        "text": "My name is John. I am 20 years old and I like to eat. Tonight I will eat potatoes.",
        "desc": "Talking about myself."
    },
    {
        "language": 'es',
        'text': 'Mi nombre es Juan. Tengo 20 años y me gusta comer. Esta noche comeré papas.',
        'desc': 'Talking about myself.'
    },
    {
        'language': 'fr',
        'text': 'Mon nom est Jean. J\'ai vingt ans et j\'aime manger. Ce soir, je vais manger des pommes de terre.',
        'desc': 'Talking about myself.'
    },
    # Talking about favorite books and authors
    {
        "language": "pt",
        "text": "Eu gosto de ler livros. Meu autor favorito é Machado de Assis.",
        "desc": "Talking about favorite books and authors"
    },
    {
        "language": "en",
        "text": "I enjoy reading books. My favorite author is Jane Austen.",
        "desc": "Talking about favorite books and authors"
    },
    {
        "language": "es",
        "text": "Me gusta leer libros. Mi autor favorito es Gabriel García Márquez.",
        "desc": "Talking about favorite books and authors"
    },
    {
        "language": "fr",
        "text": "J'aime lire des livres. Mon auteur préféré est Victor Hugo.",
        "desc": "Talking about favorite books and authors"
    },
    # Talking about visiting family
    {
        "language": "pt",
        "text": "Amanhã vou visitar meus avós. Faz tempo que não os vejo.",
        "desc": "Talking about visiting family"
    },
    {
        "language": "en",
        "text": "Tomorrow I'm visiting my grandparents. It's been a while since I last saw them.",
        "desc": "Talking about visiting family"
    },
    {
        "language": "es",
        "text": "Mañana visitaré a mis abuelos. Hace tiempo que no los veo.",
        "desc": "Talking about visiting family"
    },
    {
        "language": "fr",
        "text": "Demain, je vais rendre visite à mes grands-parents. Ça fait longtemps que je ne les ai pas vus.",
        "desc": "Talking about visiting family"
    },

    # Talking about weekend plans
    {
        "language": "pt",
        "text": "No fim de semana, quero ir à praia e relaxar um pouco.",
        "desc": "Talking about weekend plans"
    },
    {
        "language": "en",
        "text": "This weekend, I want to go to the beach and relax a bit.",
        "desc": "Talking about weekend plans"
    },
    {
        "language": "es",
        "text": "Este fin de semana quiero ir a la playa y relajarme un poco.",
        "desc": "Talking about weekend plans"
    },
    {
        "language": "fr",
        "text": "Ce week-end, je veux aller à la plage et me détendre un peu.",
        "desc": "Talking about weekend plans"
    }
]

In [10]:
for n, input_text in enumerate(eval_texts):
    input_tokens = tokenizer.encode(input_text['text'], return_tensors="pt").to(device)
    eval_texts[n]['perplexity'] = perplexity(input_tokens)
    eval_texts[n]['compute_perplexity'] = compute_perplexity(model, tokenizer, input_text['text'])
    print(f'lan: {input_text["language"]} | {input_text["desc"]} - perplexity = {input_text["perplexity"]}  | {input_text["compute_perplexity"]}')

lan: pt | Talking about myself. - perplexity = 32.49676513671875  | 36.00029373168945
lan: en | Talking about myself. - perplexity = 9.010916709899902  | 9.875325202941895
lan: es | Talking about myself. - perplexity = 20.19571304321289  | 22.251785278320312
lan: fr | Talking about myself. - perplexity = 14.724316596984863  | 15.866511344909668
lan: pt | Talking about favorite books and authors - perplexity = 39.04146194458008  | 45.7850227355957
lan: en | Talking about favorite books and authors - perplexity = 16.815593719482422  | 21.733985900878906
lan: es | Talking about favorite books and authors - perplexity = 25.389171600341797  | 30.100793838500977
lan: fr | Talking about favorite books and authors - perplexity = 45.63804244995117  | 54.74489212036133
lan: pt | Talking about visiting family - perplexity = 106.76671600341797  | 132.0194091796875
lan: en | Talking about visiting family - perplexity = 10.95315933227539  | 12.511032104492188
lan: es | Talking about visiting family 

# 4. A quick analysis

Quickly comparing perplexity between languages

In [None]:
import pandas as pd
pd.options.plotting.backend = "plotly"

df = pd.DataFrame(eval_texts)

In [None]:
df.pivot(index='desc', columns='language', values='perplexity').plot(kind='line')