<a href="https://colab.research.google.com/github/JpChii/nlp-with-hugging-face/blob/main/notebooks/6-summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization

Summarization requires a range of abilites, such as understanding long passages, reasoning about the contens, and producing fluent text that incroprates the main topics from original document. Morever summaraizing a new article is different compared to legal document and each of them requires certain degree of domain generalization. For these reasons summarizing is a difficult task for natural language models, including transformers. Despite these challenges, text summarization has huge applications,

* Helps domain experts to speed up the workflow
* Helps enterprise to summarize contracts, internal knowledge
* Generate content for social media releases and more.

Summarization is a classic sequence-to-sequence(seq2seq) task with an input text and a target text. This is where encoder-decoder transformer excel.

In this notebook we'll cover,

* The challenges involved
* Pretrained transformers to summarize documents, (i.e) our own encoder-decoder model to condense dialouges between several people into a crisp summary.

***Dataset to be used in this notebook*** --> [CNN/DailyMail corpus](https://huggingface.co/datasets/cnn_dailymail)

## The CNN/DailyMail Dataset

This dataset consisits of 300,000 pairs of articles and their summaries from CNN and DailyMail. The Summaries are bullet points for the articles provided by the papers. The summaries are abstractive and not extracives, meaning summaries are new text and not simple excerpts from the article. We'll be using the dataset from hub and version 3.0.0  which is a nonanonymized version set up for summarization. We can select the version using `version` keyword.

In [1]:
!pip install transformers[sentencepiece] datasets nltk sacrebleu rouge_score -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m116.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m81.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.7 MB/s[0m e

In [2]:
import transformers
import datasets
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device used in this notebook: {device}")

Device used in this notebook: cuda


In [3]:
from datasets import load_dataset

dataset = load_dataset(
    path="cnn_dailymail",
    version="3.0.0"
)
print(f"Features: {dataset['train'].column_names}")

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Features: ['article', 'highlights', 'id']


In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

Dataset contains three columns: `article`, which contains the news articles, `highlights` with the summaries, and `id` to uniquely identify each article. Let's look at an excerpt from an article. Let's check out an single sample.

In [5]:
sample = dataset["train"][0]
print(f"""Article (excerpt of 500 characters, total length: {len(sample['article'])})""")
print(sample["article"][:500])
print(f"\nSummary (length: {len(sample['highlights'])})")
print(sample["highlights"])

Article (excerpt of 500 characters, total length: 2527)
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s

Summary (length: 217)
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .


In [6]:
2525 // 217

11

Articles are very long compared to the target summary, in this particular case the differnce is 11-fold.
Long articles pose a challenge to transformer models since their context size limited to 1,000 tokens or so, which is equivavalent to a few paragraphs of text. The standard, yet crude way to deal with this is to truncate the text beyond model's context size. Obviously ther could be important information for the summary towards the text, but we've to live with this limitation of model architectures.

## Text summarization Pipelines

Let's checkout how the popular summarization transformer models work on our single sample. Even though model's input sizes may vary, let's keep the sample size to 2000 characters to make the model outputs comparable.

In [7]:
sample_text = dataset["train"][0]["article"][:2000]
summaries = {} # Dict to store summaries from different models
sample_text

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

In [8]:
import nltk
# sent_tokenize to split sequence based on sentnces
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

A convention in summarization is to summarize text in full sequence until new line. We can add new line at every dot but this heurisitic will fail for abbreviations with dot like U.N. The Natural Language Toolkit(NLTK) has more sophisticated alogirthms that can differnetiate the end of sequence from the punctutation in abbreviations.

In [9]:
# let's verify the above point
string = "The U.N is an organization. The U.s are a country."
sent_tokenize(string)

['The U.N is an organization.', 'The U.s are a country.']

### Summarization Baseline

A comman baseline for summarizing news articles is to simply take the first three sentences of the article. We can doi this using NLTK's tokenizer.

In [10]:
def three_sentence_summary(text):
  return "\n".join(sent_tokenize(text)[:3])

In [11]:
summaries["baseline"] = three_sentence_summary(
    text=sample_text
)
summaries["baseline"]

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him.\nDaniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.\n"I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month.'

#### GPT-2

In [5-text-genreation.ipynb](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notebooks/6_Summarization.ipynb) we checkout how GPT-2 can generate text with some input prompt. Because GPT-2 models are trained on web which includes reddit articles it's good on summarization as well. We can trigger summarization with GPT-2 by appending "TL:DR" at the end of propmt. "TL;DR" is often used to indicate blogs that are too long didn't read to indicate a short version of the long post.

We'll start the summarization experiment by with `pipeline()` from transformers.

In [12]:
from transformers import pipeline, set_seed

set_seed(42)
pipe = pipeline("text-generation", model="gpt2-large")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(
    gpt2_query,
    max_length=512,
    clean_up_tokenization_spaces=True,
)
pipe_out

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box off

In [13]:
summaries["gpt2"] = pipe_out[0]["generated_text"][len(gpt2_query):]
summaries["gpt2"]

"DANIEL RADCLIFFE: I'm not going to be an extravagant young man. I really want to do my best for his education, for his entertainment, and for him being a better person. I'm not going to be the one spending all of that on a Ferrari or"

#### T5

T5 developers performed a comprehensive study of transfer learning and found they could create a universal transformer by formulating all tasks as text-to-text-tasks.

The T5 checkpoints are pretrained on mixture of unsupervised data(to reconstruct masked words) and supervissed data for several tasks, including summarization. These checkpoints can be directly used for inference using the same prompts used during pretraining. Few input format's as follow,

* To summarize `"Summarize: <ARTICLE>"`
* To translate `translate English to German: <TEXT>`

This capabality makes T5 extremely versatile and with single model we can solve many tasks.

Let's load this and test it out.

In [14]:
pipe = pipeline("summarization", model="t5-large")
pipe_out = pipe(sample_text)
pipe_out

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'summary_text': "Harry Potter star Daniel Radcliffe turns 18 on monday . the young actor says he has no plans to fritter his cash away . details of how he'll mark his landmark birthday are under wraps ."}]

In [15]:
pipe_out

[{'summary_text': "Harry Potter star Daniel Radcliffe turns 18 on monday . the young actor says he has no plans to fritter his cash away . details of how he'll mark his landmark birthday are under wraps ."}]

In [16]:
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))
summaries

{'baseline': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him.\nDaniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.\n"I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month.',
 'gpt2': "DANIEL RADCLIFFE: I'm not going to be an extravagant young man. I really want to do my best for his education, for his entertainment, and for him being a better person. I'm not going to be the one spending all of that on a Ferrari or",
 't5': "Harry Potter star Daniel Radcliffe turns 18 on monday .\nthe young actor

*All T5 capabalities*

![alt t5](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notes/images/6-summarization/t5-capabalities.png?raw=1)

#### BART

BART also uses an encoder-decoder architecture and is trained to reconstruct corrupted inputs. It combines the pretraining schems of bert and GPT-2. We'll use the `facebook/bart-large-cnn` checkpoint, which has been specifically fine-tuned on CNN/DailyMail dataset:

In [17]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [18]:
summaries["bart"]

'Harry Potter star Daniel Radcliffe turns 18 on Monday.\nHe gains access to a reported £20 million ($41.1 million) fortune.\nRadcliffe says he has no plans to fritter his cash away on fast cars, drink and parties.\nHis earnings from the first five Potter films have been held in a trust fund.'

#### PEGASUS

The authors argue that the pretraining objective needs to be closer to the downstream task, the more effective it is.
With summarization as the objective instead of general language modelling, they masked the sentence that contain most of the info of their surrounding paragraphs(using summarizaiton evaluation metrics as a heuristic for content overlap) and pretrained PEGASUS model to reconstruct the senteces to obtain sota model for text summarization.

PEGAUS is an encoder-decoder transformer with it's pretraining objective to predict masked sentences in multisentence texts.

In [19]:
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)
pipe_out

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

[{'summary_text': "Harry Potter star Daniel Radcliffe gains access to a reported £20 million fortune .<n>Young actor says he has no plans to fritter his cash away .<n>Radcliffe's earnings from the first five Potter films have been held in a trust fund ."}]

In [20]:
pipe_out[0]["summary_text"]

"Harry Potter star Daniel Radcliffe gains access to a reported £20 million fortune .<n>Young actor says he has no plans to fritter his cash away .<n>Radcliffe's earnings from the first five Potter films have been held in a trust fund ."

pegasus has different special tokens for newlines, so we process it using a different method instead of sent_tokenizer.

In [21]:
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")
summaries["pegasus"]

"Harry Potter star Daniel Radcliffe gains access to a reported £20 million fortune.\nYoung actor says he has no plans to fritter his cash away.\nRadcliffe's earnings from the first five Potter films have been held in a trust fund ."

## Comparing Different summaries

Now that we have generated summaries wit four different models. Before comparison, GPT-2 hasn't been trained on the dataset at all, T5 is fine-tuned on this task along with other and BART and PEGASUS have been exclusivley fine-tuned on this task. With this info recall. Let's compare the summaries.

In [22]:
print("GROUND TRUTH")
print(dataset['train'][0]['highlights'])
print("")

for model_name in summaries:
  print(model_name.upper())
  print(summaries[model_name])
  print("")

GROUND TRUTH
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

BASELINE
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him.
Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.
"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month.

GPT2
DANIEL RADCLIFFE: I'm not going to be an extravagant young man. I really want to do my best for his

On comparing the summaries, we can see GPT-2 has summarized the character itself into third person. As it's not trained to generate facts. It often hallucinates or invents facts.

Moving on to T5, it's better than GPT-2 but it adds a twist to how daniel celebrates his birthday which deviates a bit from truth.

BART does better than T5 but has more text compared to PEGASUS which summarized it best among the four models.

All four summaries are qualitativley reasonable results, we can try out few more samplples. But this is not a systematic way to determine which model is better. The metrics we've used till now accuracy, f1, recall, precision are not applicable to this task. Because for each of these "gold standard" summary written by a human, dozens of other summaries with different synonyms, paraphrases os slightly differet way of formulating the facts might be just as acceptable.

Next we'll look at the metrics for measuring the quality of generated text.

## Measuring the Quality of Generated Text

Good evaluation metrics are important, since we use them to measure the performance of models during training and production. If we've bad metrics then we'll be oblivious to model degradation and if it doesn't align with buisness goals then we might not create any value.

Measuring performance on a text genration task is not as easy as with standard classification tasks such as sentiment analysis or ner. Take an example translating I love dogs to tamil, it can be "enaku nai na romba pudikum" or "nai na enaku usuru". The translation can vary from person to person or by the same person as well in different circumstances. Fortunatley there are alternatives.

Two of the most common metrics to evaluate generated text are BLEU and ROUGE. Let's take a look at how they're defined.

### BLEU

The idea of BLEU is simple, instead of looking at how many are generated tokens are perfectly aligned with reference tokens, we count the number of words or n-grams. We count the number of words available in that occur in reference occurs in generated text and divide it  by the lenght of generation.


There is a problem with this vanilla precision, let's say the generated text has one word from reference text repeated to the length of reference text. Then we'll have perfect precision! For this reason, the authors introduced a small modification: a word is only counted to the equivavlent of the repetitions of the same word in reference text.

Example:

Ref: The cat is on the mat
Gen: the the the the the the

p_vannila = 6 / 6 --> 1.0

p_mode = 2 / 6 --> 0.33

the occurs only twice in reference so 2 is the numerator.

With that simple correction we've a much reasonable value.  

Let's extend this by not only counting single words but n-grams as well. Let *snt* be the generated sentence and *snt'* be the reference sentence. We extract al possible n-grams of degree n and do the count(each discovered n-gram count summed together) to get the precision $p_n$

Also generation count is clipped, meaning occurence count of an n-gram is capped at how many times it occurs in reference sentence. Also sentence is not defined strict and can span multiple sentence and it would be treated as one sentence.

Let's write the equation for $p_n$ from above two points.

$p_n = \frac{∑_{\text{n-gram}\in snt'} Count_{\text{clip}}(n-gram)}{∑_{\text{n-gram}\in snt} Count(n-gram)}$

To put in simple terms, this equation calculates the count of all n-gram's available using the equation and sum them all together.

This equation is for a single sentence, let'x extend the equation for all sentences in corpus C.

$p_n = \frac{\sum_{\text{snt'} \in C}∑_{\text{n-gram}\in snt'} Count_{\text{clip}}(n-gram)}{\sum_{\text{snt} \in C}∑_{\text{n-gram}\in snt} Count(n-gram)}$

Since we are not looking at recall(number of n-grams recalled from reference text), all generated sequences that are short but precise have a benefir compared to sentences that are longer. Therefore precision favours short sentences, to overcome this authors've introduced *brevity penalty*:

Brevity penalty takes a minimum of 1 and exp of (1-ref_len) / (gen_len). When the number of n-gram is lower compared to refrence the exponential will become much smaller and if they are equal it will be exp(0.0) which'll be 1.

$BR = \min(1, e^{1 - \frac{\ell_{\text{ref}}}{\ell_{\text{gen}}}})$

So why not use something like F1 for recall? With this metric we'll prioritize only the translations with all words in reference text. To avoid that and evlauate all translations equally we'll persist with the combination of $p_n$ with $BR$.

*Code walkthrough of brevity penalty.*

In [23]:
l_ref, l_gen = torch.tensor(30), torch.tensor(15)

In [24]:
length_score = (torch.exp(1-l_ref/l_gen))

In [25]:
torch.min(torch.tensor(1), length_score)

tensor(0.3679)

In [26]:
torch.exp(1-l_ref)

tensor(2.5437e-13)

In [27]:
torch.exp(l_gen)

tensor(3269017.2500)

In [28]:
torch.exp(1-l_ref) / torch.exp(l_gen)

tensor(7.7811e-20)

In [29]:
- 30 / 15

-2.0

In [30]:
torch.exp(torch.tensor(-2.0))

tensor(0.1353)

*Code walkthrough of brevity penalty ends.*

The final BLEU score equation as follows.

$BLEU-N = BR \times \left( \prod_{n=1}^{N} p_n \right)^{\frac{1}{N}}
$

Finally we take the geometric mean of the modified precision up to n-gram N, simply put if N=1 BLEU-N refers to score using indiviudal words if 2 then two words and so on. Generally BLEU-4 is preferred.

BLEU has number of limitations:
* synonyms are not considered
* In the derivation, most of it seems like ad hoc or rather fragile heuristics. [Evaluating text output in NLP: BLEU risks](https://oreil.ly/nMXRh)
* Expects sentences to be tokenized, if reference and generated text use differnt tokenizer it can lead to varying results. *SacreBLEU* addresses this my internalizing tokenization and is the preferred metroc for benchmarking

Finding ways to overcome these limitations as well as better metrics is still an active area of research.

Haaaa! Enough with the theory, let's calculate some scores. How can we do this, Datasets already has an implementation of this metic sacrebleu. Let's code

In [31]:
from datasets import load_metric
bleu_metric = load_metric("sacrebleu")

  bleu_metric = load_metric("sacrebleu")


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [32]:
# Steps to use sacrebleu
# 1. Add single sequences using add or batches using add_batch
# Pass generated and reference text
bleu_metric.add(
  prediction="the the the the the the",
  reference=["the cat is on the mat"],
)

In [33]:
# 2. Call compute once all the sequences are
# Smoothing to avoid zero precision when n-gram sequence present in genrated is not present in reference
results = bleu_metric.compute(
  smooth_method="floor", # Adds a small constant value to numerator and denominator
  smooth_value=0,
)

In [34]:
# The results?
results

{'score': 0.0,
 'counts': [2, 0, 0, 0],
 'totals': [6, 5, 4, 3],
 'precisions': [33.333333333333336, 0.0, 0.0, 0.0],
 'bp': 1.0,
 'sys_len': 6,
 'ref_len': 6}

The result is dict with lots of info:

* score,
* counts(generated text), totals(refernce text) is at n-gram level
* precision for each n-gram level, 0th index is bleu-1, 1st index in bleu-2...
* bp - bravity penalty for length
* sys_len, ref_len --> lengths of words

In [35]:
# Let's beautify this and look
import pandas as pd
import numpy as np

# Rounding precision
results['precisions'] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["value"])

Unnamed: 0,value
score,0.0
counts,"[2, 0, 0, 0]"
totals,"[6, 5, 4, 3]"
precisions,"[33.33, 0.0, 0.0, 0.0]"
bp,1.0
sys_len,6
ref_len,6


In [36]:
# Let's add another text sequence and implement all the steps for sacrebleu in a go
bleu_metric.add(
  prediction="the cat is on the mat",
  reference=["the cat is on the mat"],
)
results = bleu_metric.compute(
  smooth_method="floor",
  smooth_value=0,
)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["value"])

Unnamed: 0,value
score,100.0
counts,"[6, 5, 4, 3]"
totals,"[6, 5, 4, 3]"
precisions,"[100.0, 100.0, 100.0, 100.0]"
bp,1.0
sys_len,6
ref_len,6


> **Pointers**:
* Smoothing for avoiding score from becoming zero by adding a smooth_value to numerator and denominator, when geneated n-gram is not present in refernce, it will become divide by 0 making the score 0.
* counts, total, precisions are all based on n-grams, 0th index of total(ref) / 0th index of counts(gen) gives 0th index of precision which is bleu-1 ngram score.

The BLEU score is widely used for evaluating machine translations where precise translations are favoured over translations that include all possible words.

For summarization, we want all the important information in the generated text which is opposite of BLEU and favours high recall. This is where the ROUGE score is used.

### ROUGE

ROUGE score is specifically developed for applications like summarization where recall is important than precision. The approach is similar to BLEU where we check the n-grams in generated text is present in reference or not(precision checks if the generated text is correct or not). In ROUGE, we check whether the n-grams in reference text is present in generated text or not(recall whether all bits in reference is present in generated or not).

To do this we reverse the precision metric, we count the unclipped reference n-grams in the generated text in denominator. This is ROUGE-N. where N specifies the n-gram size.

$ROUGE-N = \frac{\sum_{snt' \in C} \sum_{\text{n-gram} \in snt'} \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{snt' \in C} \sum_{\text{n-gram} \in snt'} \text{Count}(\text{n-gram})}
$

This was the original proposal for rouge. ***`Subsequently fully removing precision has strong negative effects. Going back to BLEU formula without clipped counting, we can measure precision as well, and then combine these two in an harmonice mean to get an F1-score.`*** This is the metric commonly report for ROUGE nowadays(16thaug2022).

There is another way to evaluate summarization by comparing the longest common substring in reference and generated text which is done by ROUGE-L.
Example LCS for "abab", "abc" is "ab" with length of 2. If we're to calculate the score between two sequences, we've to normalzie them to avoid advantage to the longer sequence. To normalize the author of ROUGE came up with F-score-like scheme where the LCS is normalized with length of reference and generated text then they are mixed together.

Normalized ref:

$R_{\text {LCS}} = \frac{LCS(X, Y)}{m}
$

Normalized generation:
$P_{\text {LCS}} = \frac{LCS(X, Y)}{m}
n$

$\beta = \frac{R_{\text {LCS}}}{R_{\text {LCS}}}$

F-like LCS:

$F_{\text {LCS}} = \frac{(1 + \beta^2) \cdot R_{\text {LCS}} \cdot P_{\text {LCS}}}{R_{\text {LCS}} + \beta \cdot P_{\text {LCS}}}$

This way LCS score is properly normalized and compared across samples. There are two implementation of ROUGE availabel in Datasets:

* Calculate scores per sentence and average it for summaries(ROUGE-L)
* Caclualtes for entire summary (ROUGE-Lsum).

In [37]:
# Loading the metric
from datasets import load_metric
rouge_metric = load_metric("rouge")

Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [38]:
# Checking out rouge score for gpt2
reference = dataset["train"][0]["highlights"]
rouge_metric.add(prediction=summaries['gpt2'], reference=reference)
score = rouge_metric.compute()
print(f"Different scores in rouge metric: {score.keys()}")
score
print(f"Different scores in a single metric: {score['rouge1']}")

Different scores in rouge metric: dict_keys(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'])
Different scores in a single metric: AggregateScore(low=Score(precision=0.10416666666666667, recall=0.1282051282051282, fmeasure=0.11494252873563217), mid=Score(precision=0.10416666666666667, recall=0.1282051282051282, fmeasure=0.11494252873563217), high=Score(precision=0.10416666666666667, recall=0.1282051282051282, fmeasure=0.11494252873563217))


In [39]:
score["rouge1"].low

Score(precision=0.10416666666666667, recall=0.1282051282051282, fmeasure=0.11494252873563217)

> **Note**: The Rouge metric in the datasets library callcuate confidence intervals by default, 5th percentile(low) and 95th percentile(high), average scores is tored in mid.

In [40]:
# Calculate scores for summaries we've collected for four different models

reference = dataset["train"][0]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
  rouge_metric.add(prediction=summaries[model_name], reference=reference)
  score = rouge_metric.compute()
  rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
  records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.335484,0.248366,0.296774,0.335484
gpt2,0.114943,0.023529,0.114943,0.114943
t5,0.575342,0.450704,0.547945,0.575342
bart,0.717391,0.511111,0.652174,0.717391
pegasus,0.8,0.692308,0.8,0.8


The results won't be reliable for entire dataset, as it's been compared for a single sample. But comparing the results of a single sample, gpt-2 is poor as it's not been trained for this task. Baseline performs better than gpt-2. Bart and pegasus gives the best results out of all the four models. PEGAUS should outperform bart on the dataset.

Let's evaluate the entire model using pegasus.

## Evaluating PEGASUS on the CNN/DailyMail Dataset

Now we've a model, dataset and an metric. Let's start.

### Three sentence baseline

In [41]:
def evaluate_summaries_baseline(
  dataset,
  metric,
  column_text="article",
  column_summary="highlights"
):
  summaries = [three_sentence_summary(text) for text in dataset[column_text]]
  metric.add_batch(
      predictions=summaries,
      references=dataset[column_summary]
  )
  score = metric.compute()
  return score

From notebook 5 we're aware text generation requires lot'sn of compute due to it's iterative nature. CNN/DailyMail dataset has roughly 10,000 samples. To avoid memory crash we'll take 1000 random samples from test set to get a much more stable score estimation while comleting in less than one hour on a single GPU for the PEGASUS model

In [42]:
# Let's use the function and do this
test_sampled = dataset["test"].shuffle(seed=42).select(range(1000))
score = evaluate_summaries_baseline(test_sampled, rouge_metric)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame.from_dict(rouge_dict, orient="index", columns=["baseline"]).T

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.389276,0.171296,0.245061,0.354239


The scores are worse than the single sample score but still better than achieved by GPT-2. Let's do the same for PEGASUS model.

### PEGASUS evaluation on 1000 test samples

#### First a function to yield chunks based on batch size

In [43]:
def chunks(list_of_elements, batch_size):
  """
  Function to get chunks of batches
  """
  for i in range(0, len(list_of_elements), batch_size):
    yield list_of_elements[i:i+batch_size]

In [44]:
batch_chunks = chunks(test_sampled["article"], batch_size=32)

In [45]:
batch_chunks

<generator object chunks at 0x7eda5956f6f0>

In [46]:
for idx, chunk in enumerate(batch_chunks):
  print(f"Chunk: {idx}")
  print(f"Data: {chunk}")
  print(f"Length: {len(chunk)}")
  print("")
  if idx == 3:
    break

Chunk: 0
Length: 32

Chunk: 1
Length: 32

Chunk: 2
Length: 32

Chunk: 3
Data: ["Antarctica has experienced its highest temperature on record, according to meteorologists. Sensors at Argentina's Esperanza Base on the northern tip of the Antarctic Peninsula recorded a temperature of 17.5°C (63.5°F). According to meteorologists this is potentially a new record for the warmest temperature measured on the frozen continent. The new temperature record was measured on the northern tip of the Antarctic Peninsula (pictured) where the ice shelf covering the sea has declined considerably in recent years and glaciers are thought to be receding . Over a two day period last week temperatures reached 17.4°C (63.3°F) and 17.5°C (63.5°F). However, the heatwave on the coldest continent on Earth has still to be officially certified by the World Meteorological Organisation. Antarctica's icy edge is disappearing in warming ocean waters, with the last decade seeing the rate of ice loss increase dramatically.

Alright chunk function works as expected returns chunks based on batch size. Advantage --> uses yield from loading entire dataset in memory at one go.

#### Function to generate summaries using PEGASUS and evaluation

In [47]:
from tqdm import tqdm
import torch

def evaluate_summaries_pegasus(
  dataset,
  metric,
  model,
  tokenizer,
  batch_size=8,
  device=device,
  column_text="article",
  column_summary="highlights",
):
  """
  Function to get summaries for articles using pegasue and comparing it
  with reference highlights in the dataset using rouge metric
  """

  # Loading articles, highlights in batches
  article_batches = list(
    chunks(
      dataset[column_text],
      batch_size,
    )
  )
  summary_reference_batches = list(
    chunks(
      dataset[column_summary],
      batch_size,
    )
  )

  pegasus_summaries = []
  for article_batch, summary_reference_batch in tqdm(
    zip(article_batches, summary_reference_batches), total=len(article_batches)
  ):

    # Load text to tokens
    inputs = tokenizer(
        article_batch,
        max_length=1024, # Length of tokens to generate summary from
        truncation=True, # Truncate tokens after max_length
        padding="max_length", # If number of tokens is less than max_length pad them to max_length
        return_tensors="pt",
    )

    # Get generated summaries from model
    summaries_generated = model.generate(
        input_ids=inputs["input_ids"].to(device),
        attention_mask=inputs["attention_mask"].to(device),
        length_penalty=0.8, # BR penalty
        num_beams=8, # Number of beams to choose sequence
        max_length=128, # summary max length
    )

    # Decode summariy to text
    summaries_decoded = [
        tokenizer.decode(
            s,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True,
        )
        for s in summaries_generated
    ]
    pegasus_summaries.append(summaries_decoded)

    summaries_decoded_post_processed = [
        d.replace("", " ") for d in summaries_decoded
    ]

    metric.add_batch(
        predictions=summaries_decoded_post_processed,
        references=summary_reference_batch,
    )

  score = metric.compute()
  return {
      "rouge_scores": score,
      "raw_pegasus_summaries": pegasus_summaries,
      "truth_articles": article_batches,
      "truth_summaries": summary_reference_batches
  }

In [None]:
# Function call
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Hide output
# Clear gpu cache from previous executions
torch.cuda.empty_cache()
model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
score = evaluate_summaries_pegasus(
    test_sampled,
    rouge_metric,
    model,
    tokenizer,
    batch_size=8
    )

In [52]:
rouge_dict = dict((rn, score["rouge_scores"][rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame.from_dict(rouge_dict, orient="index", columns=["pegasus"]).T

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.01235,0.00052,0.012238,0.012336


>Note: The scores are not that good due to batch size.

These numbers are very close to the published results. The loss and per-token accuracy are decoupled to some degree from the ROUGE scores. The loss id independent of the decoding strategy, wheras the ROUGE score is strongly coupled.

ROUGE correlates much better with human judgement than loss or accuracy so we should be careful and use ROUGE scores to choose the decoding strategy. These metrics are far from perfect, however, one should always consider human judgement as well.

Now, we're done with the evaluation function. It's time to train our own model for summarization.

### Evaluation metrics summary

BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) are both evaluation metrics commonly used in natural language processing and machine translation tasks, but they have different focuses and approaches.

BLEU:

BLEU is primarily used for evaluating the quality of machine translations by comparing them to one or more reference translations.
It measures the similarity between the machine-generated output and the reference translations based on n-gram overlap.
BLEU assigns a score between 0 and 1, where higher scores indicate better translation quality.
BLEU does not consider the order of words or their meaning, focusing more on the n-gram matches.
It is based on precision, where higher precision indicates more matching n-grams.
ROUGE:

ROUGE is used for evaluating summarization tasks, such as text summarization or generating summaries from documents.
It measures the quality of summaries by comparing them to one or more reference summaries.
ROUGE calculates various metrics, such as ROUGE-N (n-gram overlap), ROUGE-L (longest common subsequence), ROUGE-S (skip-bigram), and more.
ROUGE assigns a score between 0 and 1, where higher scores indicate better summary quality.
ROUGE considers the order of words and their meaning, focusing on capturing the gist or important information in the summary.
In summary, BLEU is typically used for evaluating machine translations, while ROUGE is used for evaluating summarization tasks. BLEU focuses on n-gram overlap and precision, while ROUGE considers the order and meaning of words to capture the summary's important information.