# 5. Choosing, Loading and Using Metrics for Automatic Evaluation

Welcome to the 5th session of the LLM tutorial!

Congratulations! Through the previous sessions, you've learned how to use LLMs to generate text in various creative and effective ways (e.g., using prompts, RAG, etc.).

In the final two sessions, we will focus on the evaluation of LLMs, exploring how to evaluate models efficiently and effectively.

The goal of **this session** is to learn how to use common automatic metrics to assess the performance of LLMs.

## 5.1 Introduction to Automatic Evaluation Metric

**Automatic metrics** are automated standards for evaluating the performance of NLP models **without requiring human intervention** during the evaluation process.

Based on different categorization methods, these metrics can be grouped into the following categories:
- **Reference-based** vs. **Reference-free**:
    - **Reference-based metrics** (e.g., BLEU, TER) compare the generated text against reference texts to assess similarity.
    - **Reference-free metrics** evaluate the generated text directly, without requiring reference texts. These can focus on:
        - **Fluency and linguistic quality** (e.g., Perplexity), which measures how well the generated text aligns with the statistical patterns of a language model's training data.
        - **Input alignment** (e.g., tasks like entailment detection or faithfulness metrics), which directly evaluate the relationship between the input and output without comparing to reference texts.
- **Design structure (Pattern-matched vs. Semantics-based)**:
    - **Pattern-matched metrics** assess surface-level overlap between generated and reference texts, such as n-gram matches (e.g. BLEU, ROUGE)
    - **Semantics-based metrics** evaluate the semantic similarity of the texts, focusing on meaning rather than exact word matching (e.g., BERTScore).
- **Task-specific metrics**:
    - These are tailored for specific NLP tasks, such as BLEU for machine translation, ROUGE for text summarization, and Accuracy for classification/retrieval problem.

It's important to emphasize that **there is no perfect metric**—each has its limitations. Therefore, it is crucial to select metrics based on the specific requirements and goals of your task.

In this session, we will explore how to use different metrics with Huggingface Evaluate Library to evaluate a machine translation model (EN-FR).

**Note.** Before you start running the code, please activate one GPU: runtime -> change runtime type -> T4 ([How to activate GPU in Colab?](https://saturncloud.io/blog/how-to-activate-gpu-computing-in-google-colab/))

## 5.2 Enviroment Setup

In [79]:
! pip install -q datasets
! pip install -q transformers
! pip install -q evaluate
! pip install -q tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

## 5.3 Load a EN-FR dataset
We will use a dataset with translations from English to French. You can preview the data [here](https://huggingface.co/datasets/Helsinki-NLP/opus_books/viewer/en-fr).

In [80]:
from datasets import load_dataset

# Load part of the dataset to speed up the evaluation (optional)
data = load_dataset("Helsinki-NLP/opus_books", "en-fr", split='train[:100]') # 100 items of the dataset

# Show data structure
print(data)

# Print the first instance of data
print(data[0])

Dataset({
    features: ['id', 'translation'],
    num_rows: 100
})
{'id': '0', 'translation': {'en': 'The Wanderer', 'fr': 'Le grand Meaulnes'}}


In [81]:
print("Shape of the data: ", data.shape)

Shape of the data:  (100, 2)


## 5.4 Load a LLM for translation

In [82]:
from transformers import AutoTokenizer
from transformers import pipeline
import torch
# Check if GPU is available
print("GPU Available:", torch.cuda.is_available())

# Set up model name (Feel free to try different models)
model_name = "google-t5/t5-small"

# Define translator
if torch.cuda.is_available():
    translator = pipeline("translation_en_to_fr", model=model_name, device=0)
else:
    translator = pipeline("translation_en_to_fr", model=model_name)

GPU Available: False


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Here is an example to translate a few sentences.

In [83]:
# Example list
texts = ["Good morning", "How are you?", "This is a test for batch translation."]

# Perform batch translation
translations = translator(texts)

# Print translations
for i, translation in enumerate(translations):
    print(f"Original: {texts[i]}")
    print(f"Translated: {translation['translation_text']}\n")

Original: Good morning
Translated: Bonjour

Original: How are you?
Translated: Comment êtes-vous?

Original: This is a test for batch translation.
Translated: Il s'agit d'un test pour la traduction par lots.



In [84]:
# Create a French translation for the English data
from tqdm import tqdm
def translation_en_to_fr(data, translator):
    # We first merge all english text into one list
    text_en_list = [i['translation']['en'] for i in data]
    ref_fr_list = [i['translation']['fr'] for i in data]

    # Process in batches with tqdm for progress
    batch_size = 16
    french_translation = []
    for i in tqdm(range(0, len(text_en_list), batch_size)):
        batch_texts = text_en_list[i:i+batch_size]
        batch_translations = translator(batch_texts)
        french_translation.extend(batch_translations)

    return text_en_list, ref_fr_list, french_translation

In [85]:
text_en, ref_fr, predict_fr = translation_en_to_fr(data, translator)

100%|██████████| 7/7 [02:15<00:00, 19.35s/it]


In [86]:
print("Original English Text: ", text_en[0])
print("Original French Text: ", ref_fr[0])

print("Predicted French Text: ", predict_fr[0])

Original English Text:  The Wanderer
Original French Text:  Le grand Meaulnes
Predicted French Text:  {'translation_text': 'Le Wanderer'}


In [87]:
for i in range(6, 10):
    print(f"English: {text_en[i]}")
    print(f"Reference: {ref_fr[i]}")
    print(f"Predicted: {predict_fr[i]['translation_text']}\n")

English: I still say 'our home,' although the house no longer belongs to us.
Reference: Je continue à dire « chez nous », bien que la maison ne nous appartienne plus.
Predicted: Je dirais encore « notre maison », même si la maison ne nous appartient plus.

English: We left that part of the country nearly fifteen years ago and shall certainly never go back to it.
Reference: Nous avons quitté le pays depuis bientôt quinze ans et nous n’y reviendrons certainement jamais.
Predicted: Nous avons quitté cette partie du pays il y a près de quinze ans et ne reviendrons certainement jamais à celle-ci.

English: We were living in the building of the Higher Elementary Classes at Sainte-Agathe's School.
Reference: Nous habitions les bâtiments du Cours Supérieur de Sainte-Agathe.
Predicted: Nous vivions dans l'édifice des classes élémentaires supérieures de l'école Sainte-Agathe's.

English: My father, whom I used to call M. Seurel as did other pupils, was head of the Middle School and also of the H

## 5.5 Evaluate

### 5.5.1 SacreBLEU

In [88]:
! pip install -q sacrebleu

import evaluate

metric = evaluate.load("sacrebleu")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [89]:
def postprocess_text(preds, refs):
    # Sometimes we need a postprocess
    preds = [pred['translation_text'].strip() for pred in preds]
    refs = [ref.strip() for ref in refs]

    return preds, refs

In [90]:
def compute_bleu_metrics(predict_fr, ref_fr):
    predict_fr, ref_fr = postprocess_text(predict_fr, ref_fr)

    print("Predicted: ", predict_fr[:5])
    print("Reference: ", ref_fr[:5])

    result = metric.compute(predictions=predict_fr, references=ref_fr)
    result = {"sacrebleu": result["score"]}

    return result

In [91]:
compute_bleu_metrics(predict_fr, ref_fr)

Predicted:  ['Le Wanderer', 'Alain-Fournier', 'Première partie', 'I', 'LE CONSEIL']
Reference:  ['Le grand Meaulnes', 'Alain-Fournier', 'PREMIÈRE PARTIE', 'CHAPITRE PREMIER', 'LE PENSIONNAIRE']


{'sacrebleu': 15.840510492761013}

### 5.5.2 METEOR

In [92]:
def compute_meteor_metrics(predict_fr, ref_fr):
    predict_fr, ref_fr = postprocess_text(predict_fr, ref_fr)

    metric = evaluate.load("meteor")
    result = metric.compute(predictions=predict_fr, references=ref_fr)

    return result

In [93]:
compute_meteor_metrics(predict_fr, ref_fr)

[nltk_data] Downloading package wordnet to /Users/tunji/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/tunji/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/tunji/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


{'meteor': 0.3671263141061779}

### 5.5.3 Bertscore

In [94]:
! pip install -q bert_score
def compute_bertscore_metrics(predict_fr, ref_fr):
    predict_fr, ref_fr = postprocess_text(predict_fr, ref_fr)
    metric = evaluate.load("bertscore")
    result = metric.compute(predictions=predict_fr, references=ref_fr, model_type="distilbert-base-uncased")

    return sum(result['precision'])/len(result['precision']), sum(result['recall'])/len(result['recall']), sum(result['f1'])/len(result['f1'])


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [95]:
precision, recall, f1 = compute_bertscore_metrics(predict_fr, ref_fr)
print(f"Precision: {precision}, Recall: {recall}, F1: {f1}")

Precision: 0.8877962577342987, Recall: 0.8896603620052338, F1: 0.8883174216747284


## 5.6 Exercise

In the previous sections, we demonstrated how to evaluate translation models using the Evaluate library. However, for real-world projects, **I strongly recommend consulting the official documentation or GitHub repositories of the metrics you intend to use.** The metrics integrated into the Evaluate library often face update lags—for instance, a metric might already be at version 1.5, while Evaluate still deploys version 1.1. Additionally, the Evaluate library supports only a limited selection of commonly used metrics.

In the next exercise, we encourage you to explore **reference-less** metrics to evaluate the performance of a translation model on the same En-Fr dataset. This means assessing the model's translation quality without relying on French reference translations. Instead, you will compare the model's French outputs directly with the original English inputs. You may need to explore libraries and resources beyond the Evaluate library to achieve this.

Below are some suggested reference-less metrics you can explore (you are also welcome to choose others):

- Prism: https://github.com/thompsonb/prism
- NMTScore: https://github.com/ZurichNLP/nmtscore/tree/master
- Cometkiwi: https://github.com/Unbabel/COMET
- SBERT (Multilingual embedding model + cosine similarity): https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#multilingual-models

Feel free to explore, experiment, and apply these or other suitable reference-less metrics to evaluate your translation models!


### Group A

- Oyetunji ABIOYE
- Mehsen AZIZI
- Mohammad AL TAKACH
- Hawawou  Oumarou Tchapchet

In [96]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-dot-v1")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [97]:

# en_embedding_instruction =
en_embedding = model.encode(text_en)
fr_embedding = model.encode(predict_fr)

# Compute cosine similarity
cosine_similarity = cos_sim(en_embedding, fr_embedding)


In [98]:
print(cosine_similarity.shape)

torch.Size([100, 100])


In [99]:
# Compute average cosine similarity
avg_cosine_similarity = cosine_similarity.mean()
print(f"Average Cosine Similarity: {avg_cosine_similarity}")

Average Cosine Similarity: 0.2794296443462372
