## Explanation :

### PLL

Pseudo-log-likelihood (PLL) is a scoring method used in language modeling, taken from the pdf here : https://aclanthology.org/2020.acl-main.240.pdf. 

It calculates the probability of a word given its previous words in a sentence, by treating the remaining words as a set of distractors.


The PLL score is calculated as follows:
1.
For each word w in the sentence, PLL computes its log probability given the preceding words. This is done by taking the sum of the log probabilities of the word w given each of the preceding words in the sentence.
2.
The PLL score is then computed as the sum of these log probabilities, divided by the number of words in the sentence minus 1. The minus 1 is used to exclude the last word in the sentence, which has no succeeding words.


//
Lower PLL values indicate that the model assigns higher probabilities to the true tokens in the input text, which means that the model is better at predicting the masked tokens.

### Perplexity 

Perplexity is a measure of how well a probability distribution or model predicts a sample. 
It is calculated as the exponential of the cross-entropy loss between the predicted token probabilities and the true tokens. 


//
Lower perplexity values indicate that the model assigns higher probabilities to the true tokens in the input text, which means that the model is better at predicting the next token in a sequence.

##  A pytorch version

In [1]:
# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Define a function to calculate the Pseudo-log-likelihood (PLL) score for a given language model, tokenizer, and input text
def calculate_pll(model, tokenizer, text):
    # Tokenize the input text
    input_ids = tokenizer.encode(text, return_tensors='pt')
    # Run the model on the input text without calculating gradients
    with torch.no_grad():
        outputs = model(input_ids)
        predictions = outputs[0]
    # Define the loss function as cross-entropy loss
    loss = torch.nn.CrossEntropyLoss()
    # Initialize PLL to 0
    pll = 0
    # Iterate over each token in the input text
    for i in range(len(input_ids[0])):
        # Create a copy of the input ids and mask the current token
        masked_input_ids = input_ids.clone()
        masked_input_ids[0][i] = tokenizer.mask_token_id
        # Run the model on the masked input text without calculating gradients
        with torch.no_grad():
            outputs = model(masked_input_ids)
            predictions = outputs[0]
        # Calculate the cross-entropy loss between the predicted token probabilities and the true token and add it to PLL
        pll += loss(predictions[0][i].unsqueeze(0), input_ids[0][i].unsqueeze(0)).item()
    # Return the calculated PLL value
    return pll

In [2]:
# Define a function to calculate the perplexity for a given language model, tokenizer, and input text
def calculate_perplexity(model, tokenizer, text):
    # Tokenize the input text
    input_ids = tokenizer.encode(text, return_tensors='pt')
    # Run the model on the input text with labels and without calculating gradients
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
    # Calculate and return the perplexity as the exponential of the cross-entropy loss
    return torch.exp(loss).item()

In [3]:
# Example usage:
# Define the name of the pre-trained language model to use
model_name = 'bert-base-uncased'
# Load the tokenizer associated with the pre-trained language model
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the pre-trained language model for masked language modeling
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Define an example input text
text = "I love my video Games"
# Calculate PLL and perplexity for the given language model, tokenizer, and input text using the defined functions
pll = calculate_pll(model, tokenizer, text)
perplexity = calculate_perplexity(model, tokenizer, text)

# Print out the calculated PLL and perplexity values
print(f'PLL: {pll}')
print(f'Perplexity: {perplexity}')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


PLL: 82.82536670565605
Perplexity: 87.86251831054688


## A tf version

In [4]:
# Import necessary libraries
import tensorflow as tf
from transformers import TFAutoModelForMaskedLM, AutoTokenizer

# Define a function to calculate the Pseudo-log-likelihood (PLL) score for a given language model, tokenizer, and input text
def calculate_pll(model, tokenizer, text):
    # Tokenize the input text
    input_ids = tokenizer.encode(text, return_tensors='tf')
    # Run the model on the input text
    outputs = model(input_ids)
    predictions = outputs[0]
    # Define the loss function as sparse categorical cross-entropy loss
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    # Initialize PLL to 0
    pll = 0
    # Iterate over each token in the input text
    for i in range(len(input_ids[0])):
        # Create a copy of the input ids and mask the current token
        masked_input_ids = input_ids.numpy().copy()
        masked_input_ids[0][i] = tokenizer.mask_token_id
        masked_input_ids = tf.convert_to_tensor(masked_input_ids)
        # Run the model on the masked input text
        outputs = model(masked_input_ids)
        predictions = outputs[0]
        # Calculate the cross-entropy loss between the predicted token probabilities and the true token and add it to PLL
        pll += loss_fn(input_ids[0][i], predictions[0][i]).numpy()
    # Return the calculated PLL value
    return pll

In [5]:
# Define a function to calculate the perplexity for a given language model, tokenizer, and input text
def calculate_perplexity(model, tokenizer, text):
    # Tokenize the input text
    input_ids = tokenizer.encode(text, return_tensors='tf')
    # Run the model on the input text with labels
    outputs = model(input_ids, labels=input_ids)
    loss = outputs.loss
    # Calculate and return the perplexity as the exponential of the cross-entropy loss
    return tf.math.exp(loss).numpy()

In [6]:
# Example usage:
# Define the name of the pre-trained language model to use
model_name = 'bert-base-uncased'
# Load the tokenizer associated with the pre-trained language model
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the pre-trained language model for masked language modeling
model = TFAutoModelForMaskedLM.from_pretrained(model_name)

# Define an example input text
text = "I love my video Games"
# Calculate PLL and perplexity for the given language model, tokenizer, and input text using the defined functions
pll = calculate_pll(model, tokenizer, text)
perplexity = calculate_perplexity(model, tokenizer, text)

# Print out the calculated PLL and perplexity values
print(f'PLL: {pll}')
print(f'Perplexity: {perplexity}')

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


PLL: 82.82089997828007
Perplexity: [87.87434]
