In [None]:
pip install sacrebleu

In [None]:
pip install rouge-score

In [None]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

# Evaluation Methods and Metrics for LLMs

Evaluation metrics are essential for measuring how well language models perform various tasks, such as text generation, classification, and translation. We will cover some of the most commonly used metrics: perplexity, accuracy, BLEU, ROUGE, and METEOR. You need to choose the metric that is most suited to the ues-case.

## Accuracy, Precision, Recall & F1-Score
All of these metrics are typically used for classification tasks.  

* **Accuracy** measures how many predictions made by the model are correct out of the total predictions. $$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

* **Precision** measures how many of the predicted positive labels were actually correct. $$\text{Precision} = \frac{TP}{TP + FP}
$$
* **Recall** measures how many of the actual positive labels were correctly predicted. $$\text{Recall} = \frac{TP}{TP + FN}$$

* **F1-score** is the harmonic mean of precision and recall, providing a single measure that balances both. $$\text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$


In [None]:
# Load a classification dataset
dataset = load_dataset("imdb", split="test")

In [None]:
# Evaluate accuracy
accuracy = load_metric("accuracy", trust_remote_code=True)
precision = load_metric("precision", trust_remote_code=True)
recall = load_metric("recall", trust_remote_code=True)
f1 = load_metric("f1", trust_remote_code=True)

In [None]:
def compute_scores(preds, labels):
    print("Accuracy: ",accuracy.compute(predictions=preds, references=labels),
         "\nPrecision: ",precision.compute(predictions=preds, references=labels),
         "\nRecall: ",recall.compute(predictions=preds, references=labels),
         "\nF1-score: ",f1.compute(predictions=preds, references=labels))

In [None]:
# Example predictions (replace with actual predictions from your model)
preds = [0, 1, 0, 1]
labels = [0, 1, 1, 1]
compute_scores(preds, labels)

## BLEU, ROUGE
Measuring performance on a text generation task is not as easy as with standard classification tasks such as sentiment analysis or named entity recognition. Take the example of translation; given a sentence like “I love dogs!” in English and translating it to Spanish there can be multiple valid possibilities, like “¡Me encantan los perros!” or “¡Me gustan los perros!” Simply checking for an exact match to a reference translation is not optimal; even humans would fare badly on such a metric because we all write text slightly differently from each other (and even from ourselves, depending on the time of the day or year!). Fortunately, there are alternatives.

Two of the most common metrics used to evaluate generated text are BLEU and ROUGE. Let’s take a look at what they do.

### BLEU
BLEU is a widely used metric, especially for machine translation. The idea of BLEU is to compare words or n-grams. It's is a precision-based metric, which means that when we compare the two texts we count the number of words in the generation that occur in the reference and divide it by the length of the generation. Because the precision score favours short generations, we need to compensate for that with the brevity penalty. One of the limitations of this metric is that it doesn’t take synonyms into account.  
This is the formula of the BLEU score:
$$\text{BLEU-N} = BR \times \left( \prod_{n=1}^{N} p_n \right)^{\frac{1}{N}}$$

The `bleu_metric` object is an instance of the `Metric` class, and works like an aggregator: you can add single instances with `add()` or whole batches via `add_batch()`. Once you have added all the samples you need to evaluate, you then call `compute()` and the metric is calculated. This returns a dictionary with several values, such as the precision for each n-gram, the length penalty, as well as the final BLEU score. Let’s look at an example:

In [None]:
bleu_metric = load_metric("sacrebleu") # sacrebleu doesn't expect the text to be tokenized

In [None]:
bleu_metric.add(
    prediction="the the the the the the", reference=["the cat is on the mat"])
results = bleu_metric.compute()
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

Lets see what the less obvious metrics mean here:

#### Counts:
The counts represent the number of n-grams (1-gram, 2-gram, etc.) in the prediction that also appear in the reference. These are the "matches" found between the predicted and reference text.  
  
- 1-gram counts: The number of single words in the prediction that are present in the reference.
- 2-gram counts: The number of consecutive pairs of words (bigrams) in the prediction that are present in the reference.
- ...

#### Totals:
The totals represent the total number of n-grams in the prediction (regardless of whether they match the reference).


- 1-gram totals: The total number of individual words in the prediction.
- 2-gram totals: The total number of consecutive pairs of words (bigrams) in the prediction.
- ...

#### Precisions:
The precision for each n-gram level (1-gram, 2-gram, etc.) is calculated as the ratio between counts (matched n-grams) and totals (total n-grams in the prediction).

#### Brevity Penalty (BP):
This penalizes short predictions compared to the reference length.

We can see the precision of the 1-gram is 2/6, whereas the precisions for the 2/3/4-grams are all 0. The overall score should become 0, but `bleu_metric`applies some smoothing, so the score doesn't drop to 0 just because one n-gram gets 0 precision. If you would like to get the exact value accoding to the formula you need to use `results = bleu_metric.compute(smooth_method="floor", smooth_value=0)`

In [None]:
bleu_metric.add(
    prediction="the cat is on mat", reference=["the cat is on the mat"])
results = bleu_metric.compute()
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

We observe that the precision scores are much better. The 1-grams in the prediction all match, and only in the precision scores do we see that something is off. For the 4-gram there are only two candidates, `["the", "cat", "is", "on"]` and `["cat", "is", "on", "mat"]`, where the last one does not match, hence the precision of 0.5.

### ROUGE
The ROUGE score was specifically developed for applications like summarization where high recall is more important than just precision. The approach is very similar to the BLEU score in that we look at different n-grams and compare their occurrences in the generated text and the reference texts. The difference is that with ROUGE we check how many n-grams in the reference text also occur in the generated text.

Let's look at an example:

In [None]:
rouge_metric = load_metric("rouge")

In [None]:
dataset = load_dataset("cnn_dailymail", "3.0.0")
print(f"Features: {dataset['train'].column_names}")

The dataset has three columns: article, which contains the news articles, highlights with the summaries, and id to uniquely identify each article. Let’s look at an excerpt from an article:

In [None]:
sample = dataset["train"][1]
print(f"""
Article (excerpt of 500 characters, total length: {len(sample["article"])}):
""")
print(sample["article"][:500])
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])

In [None]:
sample_text = dataset["train"][1]["article"][:2000]
# We'll collect the generated summaries of each model in a dictionary
summaries = {}

A convention in summarization is to separate the summary sentences by a newline. We could add a newline token after each full stop, but this simple heuristic would fail for strings like “U.S.” or “U.N.” The Natural Language Toolkit (NLTK) package includes a more sophisticated algorithm that can differentiate the end of a sentence from punctuation that occurs in abbreviations:

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

In [None]:
set_seed(42)
pipe = pipeline("text-generation", model="gpt2-xl")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, truncation=True, clean_up_tokenization_spaces=True)
summaries["gpt2"] = "\n".join(
    sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :]))

In [None]:
reference = dataset["train"][1]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

In [None]:
for model_name in summaries:
    rouge_metric.add(prediction=summaries[model_name], reference=reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
    records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())

In the Hugging Face Datasets implementation, two variations of ROUGE are calculated: one calculates the score per sentence and averages it for the summaries (ROUGE-L), and the other calculates it directly over the whole summary (ROUGE-Lsum). ROUGE-1 refers to the overlap of unigrams (single words) between the system-generated summary and the reference summary. ROUGE-2 refers to the overlap of bigrams (two consecutive words) between the system-generated summary and the reference summary.

### Human Evaluation
Human evaluation may be necessary for tasks like summarization, where metrics like BLEU and ROUGE may not fully capture the quality of the generated text.

## Perplexity
Perplexity is a commonly used metric for evaluating language models, particularly those involved in tasks like text generation, machine translation, or language modeling. It measures how well a language model predicts a sample of text and is directly related to the probability assigned by the model to the test data. Or in other words perplexity quantifies how uncertain or "perplexed" the model is about the next word in a sequence. A lower perplexity indicates that the model is better at predicting the next word. A perplexity close to 1 indicates perfect predictions. A higher perplexity means the model is more "confused." So, if a model has a perplexity of 10, this can be interpreted as the model being as uncertain as if it were choosing the next word from a set of 10 equally likely possibilities.
$$\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i)\right)$$
Hugging Face does not provide a built in perplexity metric like we know it from accuracy, BLEU, or ROUGE.

## Conclusion
This notebook shows how to evaluate different aspects of an LLM using multiple metrics. Each task may require different metrics depending on the output format and objectives.

In [None]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)