# Summary evaluation

Today we'll take a look at how we can evaluate the quality of model-generated summaries in different ways.

## Install packages

Tip: You might need to restart the jupyter kernel after installation.

In [None]:
%pip install rouge_score
%pip install bert_score 
%pip install blanc 
%pip install nltk 
%pip install sentencepiece 
%pip install protobuf 
%pip install transformers 
%pip install datasets 
%pip install spacy
%pip install evaluate
!python -m spacy download en_core_web_sm

## Load the data

We'll use a small slice of the English part of the `xlsum` dataset from the `datasets` library. You can take a look at what kind of data this includes [here](https://huggingface.co/datasets/csebuetnlp/xlsum).

In [None]:
from datasets import load_dataset

ds = load_dataset("csebuetnlp/xlsum", "english", split='train[:1%]')

In [None]:
ds

The articles are in the `text` column and the summaries are in the `summary` column. Let's extract them and take a look at a few examples.

In [None]:
articles = ds["text"][0:10]
articles

In [None]:
reference_summaries = ds["summary"][0:10]
reference_summaries

Discuss:
- Based on these examples, what do you think of the quality of the dataset?
- Do you foresee any potential pitfalls for evaluation, based on your observations?

Let's take a look into the density of the summaries.

In [None]:
from utils.fragments import Fragments

fragment = [Fragments(summary, article, lang="en") for summary, article in zip(reference_summaries, articles)]
density = [frag.density() for frag in fragment]

In [None]:
len(list(filter(lambda x: x <= 1.5, density))) / len(density)

If you remember, summaries with density values below 1.5 are considered abstractive, meaning these seem to be highly abstractive summaries.
However, the density values are not a perfect measure of abstractive quality:
- Can you think of a way we might be able to "game" the density metric?

## Generating summaries
Now let's generate some summaries using a pre-trained model. We'll use the `mt5-small` model from the `transformers` library.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/mt5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, min_length=10, max_length=50)

To make everything a bit easier for ourselves, let's make a function which:
1. Takes an input text
2. Tokenises the text (remember to set the padding and truncation arguments to True)
3. Generates a summary based on the tokenised input (and prompt, if you're so inclined)
4. Decodes the generated summary from tokens into words, and
5. Returns the output

(Hint: there is one potential solution in the class_8_solution notebook, if you're in need :-)).

Now let's use that function to generate some summaries for the articles in the dataset.

In [None]:
your_pipeline_function(articles[0])

In [None]:
generated_summaries = [your_pipeline_function(article) for article in articles]

In [None]:
generated_summaries

## Evaluation
Now let's evaluate the quality of the generated summaries with some commonly used metrics.

In [None]:
from evaluate import load

rouge = load("rouge")
rouge.compute(references=reference_summaries, predictions=generated_summaries)

We can also take a look at the ROUGE scores for the individual summaries:

In [None]:
rouge.compute(references=reference_summaries, predictions=generated_summaries, use_aggregator=False)

The BERTScore metric does not use an aggregator, but we can average the scores ourselves to get an overall score.

In [None]:
bertscore = load("bertscore")
bertscores = bertscore.compute(references=reference_summaries, predictions=generated_summaries, lang="en")
bertscores

In [None]:
import numpy as np

np.mean(bertscores["precision"]), np.mean(bertscores["recall"]), np.mean(bertscores["f1"])

In [None]:
import nltk

nltk.download('punkt_tab')

We can also try a reference-free metric, such as BLANC, in case we do not have access to reference summaries, or we do not want to rely on them due to quality, etc.

In [None]:
import blanc

blanc = blanc.BlancHelp()
blanc.eval_pairs(articles, generated_summaries)

Discuss:
- What do these values tell us about the quality of the generated summaries?
- What are the strenghts and weaknesses of using reference-free metrics?
- What are the potential weaknesses of using a less known metric?

## Exercise

Now, the summaries we generated aren't exactly great, likely because the mt5 model was not fine-tuned for that purpose.
- Try to generate 10 new summaries using a model that has been fine-tuned for summarisation (e.g., our old friend, flan-t5-small)
- When you have the summaries, evaluate them using the same quantitative metrics as before
- Then try to conduct a qualitative evaluation of the summaries - in your groups, decide on some evalaution criteria (e.g., ranking, "stars", etc.), evaluate the summaries based on these criteria, and compare your results within the group and with the quantitative metrics

### Bonus exercise
Try to create a LLM judge that can evaluate the quality of the summaries based on the criteria you defined.
- Load in a generative pre-trained model from huggingface
- Prompt it with your evaluation criteria
- Compare its evaluation with your own