# Measuring perplexity on fine-tuned versions of LLaMA2

Some fine-tuned Llama-2-7b models available through HuggingFace Hub are:

1.   [4i-ai/Llama-2-7b-alpaca-es](https://huggingface.co/4i-ai/Llama-2-7b-alpaca-es)
2.   [cherrybomb3649/llama-2-7b-imdb](https://huggingface.co/cherrybomb3649/llama-2-7b-imdb)
4.   [mrm8488/llama-2-coder-7b](https://huggingface.co/mrm8488/llama-2-coder-7b)
5.   [Harshvir/Llama-2-7B-physics](https://huggingface.co/Harshvir/Llama-2-7B-physics)
6.   [botch/Llama-2-7b-pubmed](https://huggingface.co/botch/Llama-2-7b-pubmed)
7.   [unionai/Llama-2-7b-hf-wikipedia](https://huggingface.co/unionai/Llama-2-7b-hf-wikipedia)

In [1]:
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from torch.nn import functional as F
import accelerate
import bitsandbytes  # Works with CUDA
import numpy as np
from tqdm import tqdm
import pandas as pd
import time

# device = torch.device("mps") if torch.backends.mps.is_built() else torch.device("cpu")  # To run on mac
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [2]:
# Get model and tokenizer
model_name = "4i-ai/Llama-2-7b-alpaca-es"
access_token = os.environ["HF_API_KEY"]

# Quantization: https://huggingface.co/docs/transformers/v4.33.2/en/main_classes/quantization
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=getattr(torch, "float16"), bnb_4bit_use_double_quant=True)

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=bnb_config,  token=access_token);  # In colab cache_dir can be set to a folder in GDrive
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, token=access_token);
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



## Input Sequences and predictions

In [3]:
prompts_ds = ["¿Qué significa DNA?", "¿Cuál es la capital de Francia?", "Encuentra la capital de España.", "¿Cuáles son los tres colores primarios?", "Genera una lista de 5 adjetivos que describan a una persona como valiente."]
# prompts_other = ["Que signifie l'ADN ?", "Quelle est la capitale de la France ?", "Trouver la capitale de l'Espagne.", "Quelles sont les trois couleurs primaires ?", "Générer une liste de 5 adjectifs qui décrivent une personne comme courageuse. "]
prompts_other = ["What is DNA? Answer in english.", "What is the capital of France?", "What is the capital of Spain?", "What are the three primary colors?", "Generate a list of 5 adjectives that describe a person as brave."]

prompts = prompts_ds + prompts_other
sources = ["ds" for _ in range(len(prompts_ds))] + ["other" for _ in range(len(prompts_other))]

## 1. Using loss to compute perplexity

In [4]:
# Get predictions from model
predictions = []
input_predictions = []
input_length = []
temp = 1

for p in prompts:
  prompt = "### Instruction:\n"+ p +"\n\n### Response:\n"
  model_inputs = tokenizer(prompt, return_tensors="pt").to(device)
  output = model.generate(**model_inputs, temperature=temp, max_new_tokens=20, do_sample=False, output_scores=True, return_dict_in_generate=True)

  predictions.append(tokenizer.decode(output.sequences[0, model_inputs.input_ids.shape[1]:], skip_special_tokens=True))
  input_predictions.append(tokenizer.decode(output.sequences[0], skip_special_tokens=True))
  input_length.append(model_inputs.input_ids.shape[1])



In [5]:
def compute_perplexity(seq, idx):
  # Compute perplexity from predictions using model callback
  # seq = prompt + prediction, idx = index where the prompt ends and prediction begins
  model_inputs = tokenizer(seq, return_tensors="pt").to(device)
  input_ids = model_inputs.input_ids.to(device)

  target_ids = input_ids.clone()
  target_ids[:, :idx] = -100  # Don't compute the loss over the input

  with torch.no_grad():
    outputs = model(input_ids, labels=target_ids)
    neg_log_likelihood = outputs.loss

    ppl = torch.exp(neg_log_likelihood)

  return ppl

perplexities = []
for prediction, idx in zip(input_predictions, input_length):
  ppl = compute_perplexity(prediction, idx)
  perplexities.append(ppl.item())

In [6]:
results = pd.DataFrame({"Prompt": prompts, "Predictions": predictions, "Source": sources, "Perplexity": perplexities})
display(results)

Unnamed: 0,Prompt,Predictions,Source,Perplexity
0,¿Qué significa DNA?,El ADN es una molécula de biología molecular q...,ds,1.629931
1,¿Cuál es la capital de Francia?,La capital de Francia es París.,ds,1.069187
2,Encuentra la capital de España.,La capital de España es Madrid.,ds,1.084588
3,¿Cuáles son los tres colores primarios?,"Los tres colores primarios son el rojo, el azu...",ds,1.113913
4,Genera una lista de 5 adjetivos que describan ...,"Valiente, corajeoso, audaz, heroico, valiente.",ds,1.83929
5,What is DNA? Answer in english.,DNA es una molécula que contiene la informació...,other,1.668605
6,What is the capital of France?,París es la capital de Francia.,other,1.290016
7,What is the capital of Spain?,Madrid es la capital de España.,other,1.177768
8,What are the three primary colors?,"Los tres colores primarios son el rojo, el azu...",other,1.153605
9,Generate a list of 5 adjectives that describe ...,"Valiente, coraje, audaz, valiente, valiente.",other,1.796446


## 2. Using PyTorch to compute perplexity
[Documentation](https://torchmetrics.readthedocs.io/en/stable/text/perplexity.html#)

In [7]:
import torch
from torchmetrics.text import Perplexity

predictions = []
perplexities = []
temp = 1

for i, p in enumerate(prompts):
  prompt = "### Instruction:\n"+ p +"\n\n### Response:\n"
  model_inputs = tokenizer(prompt, return_tensors="pt")

  output = model.generate(**model_inputs, temperature=temp, do_sample=False, output_scores=True, return_dict_in_generate=True)

  labels = output.sequences[:, model_inputs.input_ids.shape[1]:]
  logits = torch.stack(output.scores, dim=1)

  perp = Perplexity(ignore_index=-100)
  ppl = perp(logits, labels)

  predictions.append(tokenizer.decode(labels[0], skip_special_tokens=True))
  perplexities.append(ppl.item())



In [8]:
results = pd.DataFrame({"Prompt": prompts, "Predictions": predictions, "Source": sources, "Perplexity": perplexities})
display(results)

Unnamed: 0,Prompt,Predictions,Source,Perplexity
0,¿Qué significa DNA?,El ADN es una molécula de biología molecular q...,ds,1.704543
1,¿Cuál es la capital de Francia?,La capital de Francia es París.,ds,1.064529
2,Encuentra la capital de España.,La capital de España es Madrid.,ds,1.075414
3,¿Cuáles son los tres colores primarios?,"Los tres colores primarios son el rojo, el azu...",ds,1.119825
4,Genera una lista de 5 adjetivos que describan ...,"Valiente, corajeoso, audaz, heroico, valiente.",ds,1.775416
5,What is DNA? Answer in english.,DNA es una molécula que contiene la informació...,other,1.934921
6,What is the capital of France?,París es la capital de Francia.,other,1.256139
7,What is the capital of Spain?,Madrid es la capital de España.,other,1.162318
8,What are the three primary colors?,"Los tres colores primarios son el rojo, el azu...",other,1.152615
9,Generate a list of 5 adjectives that describe ...,"Valiente, coraje, audaz, valiente, valiente.",other,1.732764


## 3. Using model callback to compute perplexity

In [9]:
def generate_seq(input_ids, eos_token_id, max_len=20, temperature=1):
  inputs = input_ids
  sequences_stack = []
  logit_stack = []

  count = 0
  while True:
    inputs = inputs.to(device)
    with torch.no_grad():
      output = model(inputs).logits

    # Get predicted token: logits > softmax > argmax
    logit_stack.append(output[0, -1, :])
    probs = F.softmax(output[0, -1, :] / temperature, dim=-1)
    sequences_stack.append(torch.argmax(probs, dim=-1))
    # Add output to the next input sequence, prompt model autoregressively
    inputs = torch.cat((inputs, sequences_stack[-1].reshape(1, 1)), dim=1)

    if sequences_stack[-1].item() == eos_token_id or count > max_len:
      # Stop generating if the eos token is reached
      break

    count += 1

  logits = torch.stack(logit_stack, dim=0)
  logits = logits.reshape(1, logits.shape[0], logits.shape[1])  # Same format as the
  sequences = torch.stack(sequences_stack, dim=-1)
  sequences = sequences.reshape(1, sequences.shape[0])

  return sequences, logits

predictions = []
perplexities = []
temp = 1

for i, p in enumerate(prompts):
  prompt = "### Instruction:\n"+ p +"\n\n### Response:\n"
  model_inputs = tokenizer(prompt, return_tensors="pt").to(device)

  output_sequences, output_logits = generate_seq(model_inputs.input_ids, tokenizer.eos_token_id, temperature=temp)

  loss = F.cross_entropy(input=output_logits[0] / temp, target=output_sequences[0])
  ppl = torch.exp(loss)

  predictions.append(tokenizer.decode(output_sequences[0], skip_special_tokens=True))
  perplexities.append(ppl.item())

  # logits = torch.nn.functional.log_softmax(logits, dim=-1) # to get log probabilities from score
  # transition_scores = model.compute_transition_scores(output.sequences, output.scores, normalize_logits=True)

In [10]:
results = pd.DataFrame({"Prompt": prompts, "Predictions": predictions, "Source": sources, "Perplexity": perplexities})
display(results)

Unnamed: 0,Prompt,Predictions,Source,Perplexity
0,¿Qué significa DNA?,El ADN es una molécula de biología molecular q...,ds,1.735896
1,¿Cuál es la capital de Francia?,La capital de Francia es París.,ds,1.064531
2,Encuentra la capital de España.,La capital de España es Madrid.,ds,1.075364
3,¿Cuáles son los tres colores primarios?,"Los tres colores primarios son el rojo, el azu...",ds,1.120004
4,Genera una lista de 5 adjetivos que describan ...,"Valiente, corajeoso, audaz, heroico, valiente.",ds,1.776104
5,What is DNA? Answer in english.,DNA es una molécula que contiene la informació...,other,1.805926
6,What is the capital of France?,París es la capital de Francia.,other,1.257107
7,What is the capital of Spain?,Madrid es la capital de España.,other,1.16192
8,What are the three primary colors?,"Los tres colores primarios son el rojo, el azu...",other,1.152752
9,Generate a list of 5 adjectives that describe ...,"Valiente, coraje, audaz, valiente, valiente.",other,1.732207


## Perplexity of the prompts (questions)

In [11]:
perplexities = []

for prompt in prompts:
  # prompt = "### Instruction:\n"+ p +"\n\n### Response:\n"

  ppl = compute_perplexity(prompt, 0)
  perplexities.append(ppl.item())

In [12]:
results = pd.DataFrame({"Prompt": prompts, "Source": sources, "Perplexity": perplexities})
display(results)

Unnamed: 0,Prompt,Source,Perplexity
0,¿Qué significa DNA?,ds,693.11792
1,¿Cuál es la capital de Francia?,ds,36.610397
2,Encuentra la capital de España.,ds,96.876656
3,¿Cuáles son los tres colores primarios?,ds,26.673796
4,Genera una lista de 5 adjetivos que describan ...,ds,16.830029
5,What is DNA? Answer in english.,other,356.459351
6,What is the capital of France?,other,133.285233
7,What is the capital of Spain?,other,133.331436
8,What are the three primary colors?,other,234.822952
9,Generate a list of 5 adjectives that describe ...,other,24.460264


## Issue with logits returned by the generate function
Most logits but the max get fixed to -inf, do_sample=False stops that behavior.

In [13]:
predictions = []
perplexities = []

for i, p in enumerate(prompts):
  prompt = "### Instruction:\n"+ p +"\n\n### Response:\n"
  model_inputs = tokenizer(prompt, return_tensors="pt").to(device)

  output = model(model_inputs.input_ids, model_inputs.attention_mask).logits

  output_gen = model.generate(**model_inputs, temperature=0.0001, do_sample=False, output_scores=True, return_dict_in_generate=True)

  break



In [14]:
# Logits of the first output token, not counting the input
y = output[0, -1, :] # Logits from the model callback
y_gen = output_gen.scores[0][0, :]  # Logits from the generate function

# Decoding
print("Output with model callback (token id): ", y.argmax().item())
print("Output with generate (token id): ", y_gen.argmax().item())
print()

# Issue with logits
print("Logits returned by model callback: ")
print(y)
print("Tensor data type: ", y.type())
print("Max of the tensor: ", y.max())
print()
print("Scores returned by model generate: ")
print(y_gen)
print("Tensor data type: ", y_gen.type())
print("Max of the tensor: ", y_gen.max())

Output with model callback (token id):  6489
Output with generate (token id):  6489

Logits returned by model callback: 
tensor([ 0.1577,  2.0371,  8.9219,  ..., -0.1796, -0.2822,  2.3770],
       device='cuda:0', grad_fn=<SliceBackward0>)
Tensor data type:  torch.cuda.FloatTensor
Max of the tensor:  tensor(23.8750, device='cuda:0', grad_fn=<MaxBackward1>)

Scores returned by model generate: 
tensor([ 0.1577,  2.0371,  8.9219,  ..., -0.1796, -0.2822,  2.3770],
       device='cuda:0')
Tensor data type:  torch.cuda.FloatTensor
Max of the tensor:  tensor(23.8750, device='cuda:0')


## Using evaluation package to compute perplexity
**Issues:**
- It loads the model again, even after setting the environment variable to drive folder where model is cached.
- Killed by RAM

[Implementation](https://huggingface.co/spaces/evaluate-measurement/perplexity/blob/ac4135177bfee71b1efd7bd3aff62e456e30aef9/perplexity.py)

In [15]:
# !pip install -q -q evaluate

In [16]:
# import evaluate
# import os

# os.environ['TRANSFORMERS_CACHE'] = '/content/drive/MyDrive/Colab Notebooks/LLMs/cache'
# perplexity = evaluate.load("perplexity", module_type="metric")

# perplexities = []
# for prediction in predictions:
#   ppl = perplexity.compute(model_id=model_name, add_start_token=False, predictions=prediction)  # batch_size=1
#   print("OUTPUT TO THE FUNCTION: ", ppl)
#   perplexities.append(ppl)

# print(perplexities)