# Summary evaluation

Today we'll take a look at how we can evaluate the quality of model-generated summaries in different ways.

## Install packages

In [2]:
%%python -m spacy download en_core_web_sm
%%pip install rouge_score bert_score blanc nltk sentencepiece protobuf transformers

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Load the data

We'll use a small slice of the English part of the `xlsum` dataset from the `datasets` library. You can take a look at what kind of data this includes [here](https://huggingface.co/datasets/csebuetnlp/xlsum).

In [1]:
from datasets import load_dataset

ds = load_dataset("csebuetnlp/xlsum", "english", split='train[:1%]')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
ds

Dataset({
    features: ['id', 'url', 'title', 'summary', 'text'],
    num_rows: 3065
})

The articles are in the `text` column and the summaries are in the `summary` column. Let's extract them and take a look at a few examples.

In [9]:
articles = ds["text"][0:10]
articles

 'Atlantis Resources unveiled the marine energy device at Invergordon ahead of it being shipped to Kirkwall. Trials on the device will now be run at the European Marine Energy Centre test site off Eday. The device stands 22.5m (73ft) tall, weighs 1,300 tonnes and has two sets of blades on a single unit. It could generate enough power for 1,000 homes.',
 'Police were called to the scene outside the Coral shop on Compton Road in Harehills just before 14:00 BST. The man was taken to hospital for treatment but his condition is not known. West Yorkshire Police said the area has been cordoned off and officers remain at the scene. The force has appealed for information.',
 'Anthony ZurcherNorth America reporter@awzurcheron Twitter With tensions rising between the US and Iran, the long-term consequences will largely depend on the nature of Iran\'s response to the attack and the intensity of any conflict that follows. If the end result is a US withdrawal from Iraq, the politics of the situation

In [10]:
reference_summaries = ds["summary"][0:10]
reference_summaries

['Winds could reach gale force in Wales with stormy weather set to hit the whole of the country this week.',
 'The massive tidal turbine AK1000 has been installed in 35m (114.8ft) of water at a test site in Orkney.',
 'A man has been stabbed in broad daylight outside a betting shop in Leeds.',
 'It was inevitable that the fallout from the US airstrike that killed Iranian General Qasem Soleimani would spill into presidential politics. Everything spills into presidential politics these days, and this is without a doubt a major story.',
 'Week four of social distancing is starting to take its toll.',
 'A 37-year-old man has been arrested as part an ongoing investigation into criminality linked to the North Antrim Ulster Defence Association (UDA).',
 'Electric buses will soon be running on the roads in Coventry.',
 'A Jersey deputy is calling on the number of States members to be reduced more than current proposals.',
 'About 200 posts are to go at the Boots site in Nottingham.',
 'A degre

Discuss:
- Based on these examples, what do you think of the quality of the dataset?
- Do you foresee any potential pitfalls for evaluation, based on your observations?

Let's take a look into the density of the summaries.

In [None]:
from utils.fragments import Fragments

fragment = [Fragments(summary, article, lang="en") for summary, article in zip(reference_summaries, articles)]
density = [frag.density() for frag in fragment]



In [55]:
len(list(filter(lambda x: x <= 1.5, density))) / len(density)

1.0

If you remember, summaries with density values below 1.5 are considered abstractive, meaning these seem to be highly abstractive summaries.
However, the density values are not a perfect measure of abstractive quality:
- Can you think of a way we might be able to "game" the density metric?

## Generating summaries
Now let's generate some summaries using a pre-trained model. We'll use the `mt5-small` model from the `transformers` library.

In [16]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/mt5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, min_length=10, max_length=50)



To make everything a bit easier for ourselves, let's make a function which:
1. Takes an input text
2. Tokenises the text (remember to set the padding and truncation arguments to True)
3. Generates a summary based on the tokenised input (and prompt, if you're so inclined)
4. Decodes the generated summary from tokens into words, and
5. Returns the output

(Hint: there is one potential solution in the class_8_solution notebook, if you're in need :-)).

In [17]:
# DEL 
def your_pipeline_function(input_text):
    output = model.generate(tokenizer(input_text, return_tensors="pt")["input_ids"])
    return tokenizer.decode(output[0])

Now let's use that function to generate some summaries for the articles in the dataset.

In [18]:
your_pipeline_function(articles[0])



'<pad> <extra_id_0>.com.au./ <extra_id_10>.</s>'

In [19]:
generated_summaries = [your_pipeline_function(article) for article in articles]

In [20]:
generated_summaries

['<pad> <extra_id_0>.com.au./ <extra_id_10>.</s>',
 '<pad> <extra_id_0>.com <extra_id_10>.com an <extra_id_11>.</s>',
 '<pad> <extra_id_0>.com.uk.com <extra_id_10>.com</s>',
 '<pad> <extra_id_0> the debate. Politics.com</s>',
 '<pad> <extra_id_0> Jamie Kennedy says: "It was hard.</s>',
 '<pad> <extra_id_0>. ) <extra_id_37>.graves. <extra_id_4> - Criminal Investigations</s>',
 '<pad> <extra_id_0> - Coventry - Coventry</s>',
 '<pad> <extra_id_0> - St Helier - St Helier</s>',
 '<pad> <extra_id_0> - Consumer Reports - Business</s>',
 '<pad> <extra_id_0>.com.uk.com/courses.</s>']

## Evaluation
Now let's evaluate the quality of the generated summaries with some commonly used metrics.

In [46]:
from evaluate import load

rouge = load("rouge")
rouge.compute(references=reference_summaries, predictions=generated_summaries)

{'rouge1': 0.025069921179760996,
 'rouge2': 0.0,
 'rougeL': 0.02265446224256293,
 'rougeLsum': 0.025069921179760996}

We can also take a look at the ROUGE scores for the individual summaries:

In [47]:
rouge.compute(references=reference_summaries, predictions=generated_summaries, use_aggregator=False)

{'rouge1': [0.0,
  0.0,
  0.0,
  0.08695652173913043,
  0.0,
  0.052631578947368425,
  0.1111111111111111,
  0.0,
  0.0,
  0.0],
 'rouge2': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 'rougeL': [0.0,
  0.0,
  0.0,
  0.08695652173913043,
  0.0,
  0.052631578947368425,
  0.1111111111111111,
  0.0,
  0.0,
  0.0],
 'rougeLsum': [0.0,
  0.0,
  0.0,
  0.08695652173913043,
  0.0,
  0.052631578947368425,
  0.1111111111111111,
  0.0,
  0.0,
  0.0]}

The BERTScore metric does not use an aggregator, but we can average the scores ourselves to get an overall score.

In [51]:
bertscore = load("bertscore")
bertscores = bertscore.compute(references=reference_summaries, predictions=generated_summaries, lang="en")
bertscores

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.7897378206253052,
  0.765849232673645,
  0.7976936101913452,
  0.7899143695831299,
  0.8089548945426941,
  0.7626368999481201,
  0.8046826124191284,
  0.7890909314155579,
  0.7653211951255798,
  0.8175321221351624],
 'recall': [0.8295824527740479,
  0.7967321872711182,
  0.8265303373336792,
  0.8113272190093994,
  0.852196991443634,
  0.8178002834320068,
  0.873009204864502,
  0.8301175832748413,
  0.8252493143081665,
  0.8258839845657349],
 'f1': [0.8091698884963989,
  0.7809855341911316,
  0.8118559718132019,
  0.8004776239395142,
  0.83001309633255,
  0.7892558574676514,
  0.8374545574188232,
  0.8090844750404358,
  0.7941563129425049,
  0.8216868042945862],
 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.44.2)'}

In [53]:
import numpy as np

np.mean(bertscores["precision"]), np.mean(bertscores["recall"]), np.mean(bertscores["f1"])

(0.7891413688659668, 0.8288429558277131, 0.8084140121936798)

We can also try a reference-free metric, such as BLANC, in case we do not have access to reference summaries, or we do not want to rely on them due to quality, etc.

In [64]:
import blanc

blanc = blanc.BlancHelp()
blanc.eval_pairs(articles, generated_summaries)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 498/498 [00:23<00:00, 21.04it/s]


[-0.05405405405405406,
 0.05555555555555555,
 0.0,
 0.014005602240896359,
 0.014492753623188406,
 0.06060606060606061,
 0.044444444444444446,
 0.0,
 0.0,
 0.07407407407407407]

Discuss:
- What do these values tell us about the quality of the generated summaries?
- What are the strenghts and weaknesses of using reference-free metrics?
- What are the potential weaknesses of using a less known metric?

## Exercise

Now, the summaries we generated aren't exactly great, likely because the mt5 model was not fine-tuned for that purpose.
- Try to generate 10 new summaries using a model that has been fine-tuned for summarisation (e.g., our old friend, flan-t5-small)
- When you have the summaries, evaluate them using the same quantitative metrics as before
- Then try to conduct a qualitative evaluation of the summaries - in your groups, decide on some evalaution criteria (e.g., ranking, "stars", etc.), evaluate the summaries based on these criteria, and compare your results within the group and with the quantitative metrics

### Bonus exercise
Try to create a LLM judge that can evaluate the quality of the summaries based on the criteria you defined.
- Load in a generative pre-trained model from huggingface
- Prompt it with your evaluation criteria
- Compare its evaluation with your own

In [62]:
import nltk

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/au594328/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True