Metrics are needed to evaluate the performance of LLMs, similar to classic machine learning models. The evaluate library helps to understand the complexities of LLMs.

In [27]:
import torch
import evaluate
import transformers
from transformers import pipeline

In [32]:
%pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=51e6e9dcd34bd6ef8e001592481e6d5892895223a40cff2ae7e6df3c691a4a4a
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [6]:
# Load the metrics
accuracy = evaluate.load('accuracy')
precision = evaluate.load('precision')
recall = evaluate.load('recall')
f1 = evaluate.load('f1')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [8]:
# Obtain a description of each metric
print(accuracy.description)
print(precision.description)
print(recall.description)
print(f1.description)


Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative


Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation:
Precision = TP / (TP + FP)
where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).


Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation:
Recall = TP / (TP + FN)
Where TP is the true positives and FN is the false negatives.


The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
F1 = 2 * (precision * recall) / (precision + recall)



In [7]:
# See the required data types
print(f"The required data types for accuracy are: {accuracy.features}.")
print(f"The required data types for precision are: {precision.features}.")
print(f"The required data types for recall are: {recall.features}.")
print(f"The required data types for f1 are: {f1.features}.")

The required data types for accuracy are: {'predictions': Value('int32'), 'references': Value('int32')}.
The required data types for precision are: {'predictions': Value('int32'), 'references': Value('int32')}.
The required data types for recall are: {'predictions': Value('int32'), 'references': Value('int32')}.
The required data types for f1 are: {'predictions': Value('int32'), 'references': Value('int32')}.


In [12]:
from transformers.modeling_outputs import SequenceClassifierOutput
from torch import tensor

In [16]:
from transformers import BertTokenizer, BertForSequenceClassification

# Load pretrained tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [19]:
validate_labels = [0, 0, 1, 1, 1]
validate_text = ['This product works great, thank you! Just as described.',
 "I'm really impressed with the quality.",
 "I've only had the product for two days and it's already broken.",
 'Where is my item? The delivery is delayed. This is frustrating.',
 "I've reached out multiple times but no one is answering me."]

outputs = SequenceClassifierOutput(loss=None, logits=tensor([[0.3170, 0.8129],
        [0.2573, 0.7800],
        [0.3672, 0.7954],
        [0.4176, 0.8140],
        [0.4592, 0.7013]]), hidden_states=None, attentions=None)

In [20]:
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

# Extract the new predictions
predicted_labels = torch.argmax(outputs.logits, dim=1).tolist()

# Compute the metrics by comparing real and predicted labels
print(accuracy.compute(references=validate_labels, predictions=predicted_labels))
print(precision.compute(references=validate_labels, predictions=predicted_labels))
print(recall.compute(references=validate_labels, predictions=predicted_labels))
print(f1.compute(references=validate_labels, predictions=predicted_labels))

{'accuracy': 0.6}
{'precision': 0.6}
{'recall': 1.0}
{'f1': 0.75}


These metrics are a start but there is room for improvement, such as adding more data to the fine-tuning process.

Lets generate text and evaluate the perplexity score.

In [22]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [23]:
input_text= 'Current trends show that by 2030'

In [24]:
# Encode the input text, generate and decode it
input_text_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_text_ids, max_length=20)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print("Generated Text: ", generated_text)

# Load and compute the perplexity score
perplexity = evaluate.load("perplexity", module_type="metric")
results = perplexity.compute(model_id="gpt2", predictions=generated_text)
print("Perplexity: ", results['mean_perplexity'])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text:  Current trends show that by 2030, the number of people living in poverty will be at its lowest level


Downloading builder script: 0.00B [00:00, ?B/s]

  0%|          | 0/7 [00:00<?, ?it/s]

Perplexity:  3441.671438598633


BLEU translations

In [25]:
input_sentence_1 = "Hola, ¿cómo estás?"

reference_1 = [
     ["Hello, how are you?", "Hi, how are you?"]
     ]

input_sentences_2 = ["Hola, ¿cómo estás?", "Estoy genial, gracias."]

references_2 = [
     ["Hello, how are you?", "Hi, how are you?"],
     ["I'm great, thanks.", "I'm great, thank you."]
     ]

In [29]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

# Translate the first input sentence then calucate the BLEU metric for translation quality
translated_output = translator(input_sentence_1)

translated_sentence = translated_output[0]['translation_text']

print("Translated:", translated_sentence)

# Load the bleu metric
bleu = evaluate.load("bleu")

results = bleu.compute(predictions=[translated_sentence], references=reference_1)
print(results)

Device set to use cpu


Translated: Hey, how are you?


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

{'bleu': 0.7598356856515925, 'precisions': [0.8333333333333334, 0.8, 0.75, 0.6666666666666666], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 6, 'reference_length': 6}


In [30]:
# Translate the input sentences, extract the translated text, and compute BLEU score
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

translated_outputs = translator(input_sentences_2)

predictions = [translated_output['translation_text'] for translated_output in translated_outputs]
print(predictions)

results = bleu.compute(predictions=predictions, references = references_2)
print(results)

Device set to use cpu


['Hey, how are you?', "I'm great, thanks."]
{'bleu': 0.8627788640890415, 'precisions': [0.9090909090909091, 0.8888888888888888, 0.8571428571428571, 0.8], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 11, 'reference_length': 11}


The BLEU score is an easy way to compare the similarity between machine-generated translations and reference translations, but bear in mind it heavily relies on the quality and correctness of such reference translations.

Evaluating with ROUGE

ROUGE is commonly used to evaluate summarization tasks as it checks for similarities between predictions and references.

In [33]:
# Load the rouge metric
rouge = evaluate.load('rouge')

predictions = ["""Pluto is a dwarf planet in our solar system, located in the Kuiper Belt beyond Neptune, and was formerly considered the ninth planet until its reclassification in 2006."""]
references = ["""Pluto is a dwarf planet in the solar system, located in the Kuiper Belt beyond Neptune, and was previously deemed as a planet until it was reclassified in 2006."""]

# Calculate the rouge scores between the predicted and reference summaries
results = rouge.compute(predictions=predictions, references=references)
print("ROUGE results: ", results)

ROUGE results:  {'rouge1': np.float64(0.7719298245614034), 'rouge2': np.float64(0.6181818181818182), 'rougeL': np.float64(0.736842105263158), 'rougeLsum': np.float64(0.736842105263158)}


It looks like the model did a pretty decent job here

Evaluating with METEOR

METEOR excels at evaluating some of the more semantic features in text. It works similar to ROUGE by comparing a model-generated output to a reference output.

In [34]:
meteor = evaluate.load("meteor")

generated = ["The burrow stretched forward like a narrow corridor for a while, then plunged abruptly downward, so quickly that Alice had no chance to stop herself before she was tumbling into an extremely deep shaft."]
reference = ["The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well."]

# Compute and print the METEOR score
results = meteor.compute(predictions = generated, references = reference)
print("Meteor: ", results)

Downloading builder script: 0.00B [00:00, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Meteor:  {'meteor': np.float64(0.37180012567275916)}


The model may have gotten too creative here and lost some of the semantic features. In this case, you'd consider going back to your training or fine-tuning to see what could be improved.

Evaluating with EM

Exact Match helps us evaluate models when it comes to extractive question and answering but looking for exact matches

In [35]:
# Load the metric
exact_match = evaluate.load('exact_match')

predictions = ["It's a wonderful day", "I love dogs", "DataCamp has great AI courses", "Sunshine and flowers"]
references = ["What a wonderful day", "I love cats", "DataCamp has great AI courses", "Sunsets and flowers"]

# Compute the exact match and print the results
results = exact_match.compute(predictions=predictions, references=references)
print("EM results: ", results)

Downloading builder script: 0.00B [00:00, ?B/s]

EM results:  {'exact_match': np.float64(0.25)}


Checking toxicity

In [36]:
user_1 = ['Everyone that tried it love it', 'This artist is a true genius, pure talent']

user_2 = ["Nobody i've talked to likes this product", 'Terrible singer']
toxicity_metric = evaluate.load('toxicity', module_type='measurement')

Downloading builder script: 0.00B [00:00, ?B/s]



config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


In [37]:
# Calculate the individual toxicities
toxicity_1 = toxicity_metric.compute(predictions=user_1)
toxicity_2 = toxicity_metric.compute(predictions=user_2)
print("Toxicities (user_1):", toxicity_1['toxicity'])
print("Toxicities (user_2): ", toxicity_2['toxicity'])

# Calculate the maximum toxicities
toxicity_1_max = toxicity_metric.compute(predictions = user_1, aggregation="maximum")
toxicity_2_max = toxicity_metric.compute(predictions = user_2, aggregation="maximum")
print("Maximum toxicity (user_1):", toxicity_1_max['max_toxicity'])
print("Maximum toxicity (user_2): ", toxicity_2_max['max_toxicity'])

# Calculate the toxicity ratios
toxicity_1_ratio = toxicity_metric.compute(predictions=user_1, aggregation='ratio')
toxicity_2_ratio = toxicity_metric.compute(predictions=user_2, aggregation='ratio')
print("Toxicity ratio (user_1):", toxicity_1_ratio['toxicity_ratio'])
print("Toxicity ratio (user_2): ", toxicity_2_ratio['toxicity_ratio'])

Toxicities (user_1): [0.00013486333773471415, 0.00013348401989787817]
Toxicities (user_2):  [0.00013559251965489239, 0.0001377112348563969]
Maximum toxicity (user_1): 0.00013486333773471415
Maximum toxicity (user_2):  0.0001377112348563969
Toxicity ratio (user_1): 0.0
Toxicity ratio (user_2):  0.0


It is important to be critical of your metric results to add more context as to which score is a better analytics approach

Evaluating regard

The regard metric to assess polarity in two lists of sample texts associated to two groups of product reviewers with different domains, perspectives, and demographics.

In [38]:
group1 = ['abc are described as loyal',
          'abc are honest but kind']
group2 = ['abc are known for being confrontational',
          'abc are very blunt']


In [39]:
# Load the regard and regard-comparison metrics
regard = evaluate.load('regard')
regard_comp = evaluate.load("regard", "compare")

# Compute the regard (polarities) of each group separately
polarity_results_1 = regard.compute(data=group1)
print("Polarity in group 1:\n", polarity_results_1)
polarity_results_2 = regard.compute(data=group2)
print("Polarity in group 2:\n", polarity_results_2)

# Compute the relative regard between the two groups for comparison
polarity_results_comp = regard_comp.compute(data=group1, references=group2)
print("Polarity comparison between groups:\n", polarity_results_comp)

Downloading builder script: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Device set to use cpu
Device set to use cpu


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Polarity in group 1:
 {'regard': [[{'label': 'neutral', 'score': 0.958617091178894}, {'label': 'negative', 'score': 0.020242031663656235}, {'label': 'positive', 'score': 0.01440910529345274}, {'label': 'other', 'score': 0.006731771863996983}], [{'label': 'positive', 'score': 0.835424542427063}, {'label': 'other', 'score': 0.1241120994091034}, {'label': 'neutral', 'score': 0.030531246215105057}, {'label': 'negative', 'score': 0.009932072833180428}]]}
Polarity in group 2:
 {'regard': [[{'label': 'negative', 'score': 0.9745951890945435}, {'label': 'other', 'score': 0.01715262047946453}, {'label': 'neutral', 'score': 0.007746343966573477}, {'label': 'positive', 'score': 0.0005058045499026775}], [{'label': 'neutral', 'score': 0.7666085362434387}, {'label': 'negative', 'score': 0.10047462582588196}, {'label': 'positive', 'score': 0.0714685320854187}, {'label': 'other', 'score': 0.061448290944099426}]]}
Polarity comparison between groups:
 {'regard_difference': {'neutral': 0.10739672859199345

The first group's text is perceived much more positively (or neutral) than the second group's.