# Text Summarisation

In [None]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version="3.0.0")
print(f"Features: {dataset['train'].column_names}")

In [None]:
sample = dataset["train"][1]
print(f"Article (excerpt of 500 characters, total length: {len(sample['article'])})")
print(sample['article'][:500])
print(f"\nSummary (length: {len(sample['highlights'])})")
print(sample['highlights'])

Difficulty is the large number of tokens in long form; simple but crude way is to truncate texts beyond model's capacity context size. May be important information at end of text but for now live with this limitation of model architectures.

## Text Summarisation Pipelines

Look at some qualitative examples from different model architectures.

In [None]:
sample_text = dataset["train"][1]["article"][:2000]
# collect generated summaries of each model in a dictionary
summaries = {}

In [None]:
# convention to separate sentences by newline; NLTK has a more sophisticated algo
# that can differentiate end of sentence from punctuation that occurs in abbreviations
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

string = "The U.S. are a country. The U.N. is an organisation"
sent_tokenize(string)

If we run out of memory, we can replace models with smaller; like `gpt`, or `t5-small`.

In [None]:
# simple baseline is to take first three sentences of the article
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

summaries["baseline"] = three_sentence_summary(sample_text)

In [None]:
# GPT2: can use to generate summaries by appending "TL;DR" at end of input text
# create a text generation pipeline and load the large GPT-2 model
from transformers import pipeline, set_seed

set_seed(42)
pipe = pipeline("text-generation", model="gpt2")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
# slice off query input and keep result in python dict for later comparison
summaries["gpt2"] = "\n".join(
    sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query):]))

**T5**: Could create a universal transformer architecture by formulating all tasks as text-to-text tasks; reconstruct masked words, supervised data, summarisation etc.. And could perform summarisation withoutfine-tuning by using the same prompts during pretraining.

The input format to summarise is `"summarize:<ARTICLE>"` and for translate it looks like `"translate English to German: <TEXT>"` making T5 extremely versatile and can slve many tasks with a single model.

In [None]:
# can load T5 directly with pipeline() function; also taking care of formatting 
# inputs in text-to-text format so no need to prepend with "summarize"
pipe = pipeline("summarization", model="t5-large")
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

In [None]:
# BART: encoder-decoder trained to reconstruct corrupted inputs
# can use facebook/bart-large-cnn checkpoint; specifically fine-tuned on CNN/DailyMail dataset
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

**PEGASUS** is encoder-decoder transformer, like BART. Pretraining objective is to predict masked sentences in multisentence texts. The closer the pretraining objective is to downstream task, the more effective it is. 

With the aim of finding a pretraining objective more closer to summarization than general language modeling, they automatically identified, in a large corpus, sentences containing most of the content of their surrounding paragraphs (using summarization evaluation metrics as a heuristic for content overlap) and pretrained the PEGASUS model to reconstruct these sentences. Thereby achieving a SOTA model for text-summarization.

In [None]:
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")

### Compare Structures

Bear in mind: One model has not been trained on the dataset at all (GPT-2); one has been fine-tuned on the task among others (T5), and two have been exclusively fine-tuned on this task (BART and PEGASUS). Look at summaries:

In [None]:
print("GROUND TRUTH")
print(dataset["train"][1]["highlights"])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("")

Things to note:
- GPT2 is different from the others; it summarises the characters. Often GPT2 "hallucinates" or invents facts as it was not explicitly trained to generate truthful summaries
- We see overlap between all the models; with PEGASUS's output bearing the most striking resemblance

We need a more systemic metric to measure results; define a metric that we can use to measure for all models on some benchmark dataset. This is not so easy as for each "gold standard" summary written by humans, there could be countless variations that are equally acceptable. Next we look at some common metrics for measuring the quality of generated text.

## Measuring the Quality of Generated Text

If we have bad metrics, may be blind to model degredation. If misaligned with business goals then we may not create any value. Some options:
- **BLEU**: Look at words or n-grams for matches. We count the number of words that occur in the generated text and divide by the length of a reference text. 
    - There is a problem if the same word is generated over and over again, until the length of the reference then we would get perfect precision
    - So there's a modification, word counted as many times as it occurs in the reference; so has a more reasonable value
    - Can also take n-grams also; with the counter in the numerater clipped, so we cap how many times the occurrence count of an n-gram appears in a reference sentence. Note that definition of sentence is not strict; if we had generated multiple sentences then we would treat it as one sentence.
    - We sum over all samples in corpus C
    - Because precision favours shorter sentences compared to longer, there is also a brevity penalty; so exponentially penalises generated text being smaller than generated text
    - Preferable to look at precision as there may be multiple correct answers in translation; so want high precision and a generated text of similar length to the reference text
    - Issue: Doesn't take synonyms into account; and derivation steps seem ad-hoc and like fragile heuristics
    - Expects text to already be tokenised; so can cause errors if different tokenisation method. SacreBLEU metric addresses this by internalising tokenisation step; so is preferred for benchmarking

In [None]:
!pip install sacrebleu

In [None]:
from datasets import load_metric
bleu_metric = load_metric("sacrebleu")

`bleu_metric` is an instance of `Metric` class; works like an aggregator. Can add single instances with `add()` or whole batches with `add_batch()`; then can call `compute()` and the metric is calculated

In [None]:
import pandas as pd
import numpy as np

bleu_metric.add(
    prediction="the the the the the the", reference = ["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

- BLEU score also works with multiple reference translations, thus why it is passed as a list
    - Also has methods to modify precision calculation, like can add a constant to the numerator so that missing n-gram does not cause the score to go down to zero. Set smooth_value=0 for the purpose of explaining the values.
    
1-gram precision 2/6 and 2/3/4 grams are 0. Geometric mean is 0 and thus also the BLEU score. bp is brevity penalty.

In [None]:
# prediction where almost correct
bleu_metric.add(
    prediction="the cat is on mat", reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

Precision scores are much better. 1-gram almost all match; last does not so precision of 0.5 for 4-gram.

BLEU widely used for evaluating text, especially in machine translation since precise words favoured over translations that include all the possible and appropriate words. 

Applications also includes summarisation applications where the situation is difference; so want all the important information in thus favour high recall.Here, we would favour ROUGE score.

- **ROUGE**: Specificaly developed for applications where high recall is more important than just precision. Similar to BLEU in that it compares generated and reference texts. Difference is that we check how many n-grams in reference text also occur in the generated text; rather than the other way around
    - Such was the original proposal for ROUGE, then research found fully removing precision can have strong negative effects. BLEU without clipped counting means can measure precision also, then combine precision and recall ROGUE scores in harmonic mean to get F1-score. This is the metric commonly reported for ROGUE.
    - There is a separate score in ROUGE to measure longest common substring (LCS) called ROUGE-L. We normalise by the length of text, otherwise the longer text would be at an advantage.
    - Two variations; one calculates score per sentence and averages it for summaries (ROUGE-L); and he other calculates it directly over whole summary (ROUGE-Lsum)

In [None]:
!pip install rouge_score

In [None]:
rogue_metric = load_metric("rouge")

In [None]:
reference = dataset["train"][1]["highlights"]
records = []
rogue_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model in summaries:
    rogue_metric.add(prediction=summaries[model], reference=reference)
    score = rogue_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rogue_names)
    records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())

By default ROGUE calculates confidence intervals of 5th and 95th percentiles by default. Average score stored in attribute mid and interval can be retrieved with low and high. Unreliable results as we only looked at a single sample. But can compare quality of the summary for the one example.
- Confirms that GPT2 performs worst, not surprising as not trained to explicitly summarise
- Amazing to see how simple first three sentence baseline doesn't perform too badly compared to bn parameter scale transformers
- PEGASUS and BART are best models (overall ROUGE scores are higher) 
- T5 slightly better on ROUGE-1 and LCS scores. 

Treat results with caution as only one example. Looking at PEGASUS paper, we would expect PEGASUS to out-perform T5 on CNN/DailyMail dataset.

Try to reproduce those results with PEGASUS.

## Evaluating PEGASUS on CNN/DailyMail Dataset

In [None]:
def evaluate_summaries_baseline(
    dataset, metric, column_text="article", column_summary="highlights"):
    summaries = [three_sentence_summary(text) for text in dataset[column_text]]
    metric.add_batch(predictions=summaries, references=dataset[column_summary])
    score = metric.compute()
    return score

Apply to a subset of the data, because generating summaries for all articles would take a lot of time. Every generated token requires a forward pass through the model, so generating 100 tokens for each sample would require 1 million forward passes (10,000 articls). With beam search, that is multiplied by the number of beams.

So just run evaluation on 100 samples..

In [None]:
test_sampled = dataset["test"].shuffle(seed=42).select(range(100))

score = evaluate_summaries_baseline(test_sampled, rogue_metric)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rogue_names)
pd.DataFrame.from_dict(rouge_dict, orient="index", columns=["baseline"]).T

In [6]:
from tqdm import tqdm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(list_of_elements, batch_size):
    """Yield successive batch-size chunks from list_of_elements"""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i: i+batch_size]
        
def evaluate_summaries_pegasus(dataset, metric, model, tokenizer,
                              batch_size=16, device=device, 
                              column_text="article", 
                              column_summary="highlights"):
    # split dataset into smaller batches that we can process in parallel
    article_batches = list(chunks(dataset[column_text], batch_size))
    target_batches = list(chunks(datset[column_summary], batch_size))
    
    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):
        
        # tokenise input articles
        inputs = tokenizer(article_batch, max_length=1024, truncation=True,
                          padding="max_length", return_tensors="pt")
        
        # feed to generate to produce summaries using beam search
        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                                  attention_mask=inputs["attention_mask"].to(device),
                                  length_penalty=0.8, num_beams=8, max_length=128)
        
        # replace <n> and add decoded texts with reference to the metric
        decoded_summaries = [tokenizer.decode(
            s, skip_special_tokens=True, clean_up_tokenization_spaces=True) for s in summaries]
        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]
        metric.add_batch(predictions=decoded_summaries, references=target_batch)
    
    # compute metric and return ROUGE scores
    score = metric.compute()
    return score

In [9]:
from datasets import load_metric
rouge_metric = load_metric("rouge")

In [11]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
score = evaluate_summaries_pegasus(test_sampled, rouge_metric,
                                  model, tokenizer, batch_size=8)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index=["pegasus"])

Close to published results! One thing to note is that the loss and per-token accuracy are decoupled to some degree from ROUGE scores. The loss is independent of the decoding strategy, whereas the ROUGE score  is strongly coupled. 

ROGUE and BLEU correlate better with human judgement than loss or accuracy, better to focus on them and choose a decoding strategy for text generation models, though these are far from perfect and one should also consider human judgements.

Okay! Time to train our own model for summarisation now.

## Training a Summarisation Model

We can refresh the notebook to clear previous memory and start from here. Here we will train a custom text summarisation model. Use the SAMSum dataset developed by Samsung, which has dialogues and brief summaries.

In enterprise setting, these may represent the interactions between customer and support centre, so generating accurate summaries can help improve customer service and detect common patterns among customer requests.

In [2]:
!pip install py7zr

In [3]:
from datasets import load_dataset

dataset_samsum = load_dataset("samsum")
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDiaogue:")
print(dataset_samsum["test"][0]["dialogue"])
print(f"\nSummary:")
print(dataset_samsum["test"][0]["summary"])

### Evaluating PEGASUS on SAMSUM

In [5]:
from transformers import pipeline, set_seed

pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")

pipe_out = pipe(dataset_samsum["test"][0]["dialogue"])
print("Summary:")
print(pipe_out[0]["summary_text"].replace(" .<n>", ".\n"))

In [None]:
score = evaluate_summaries_pegasus(dataset_samsum["test"], rouge_metric, model,
                                  tokenizer, column_text="dialogue", 
                                  column_summary="summary", batch_size=8)

rouge_dict = dict((rn, score[rn].mod.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index=["pegasus"])

With this, we can:
- directly measure the success of training with the metric and have a good baseline
- Fine-tune on the new dataset should have immediate improvement in ROGUE metric, if not, we'll know we have something wrong with our training loop.

### Fine-Tune PEGASUS

In [None]:
# have a quick look at length distribution of input and outputs
d_len = [len(tokenizer.encode(s)) for s in dataset_samsum["train"]["dialogue"]]
s_len = [len(tokenizer.encode(s)) for s in dataset_samsum["train"]["summary"]]

fig, axes = plt.subplots(1, 2, figsize=(10, 3.5) sharey=True)

axes[0].hist(d_len, bins=20, color="C0", edgecolor="C0")
axes[0].set_titple("Dialogue Token Length")
axes[0].set_xlabel("Length")
axes[0].set_ylabel("Count")

axes[1].hist(s_len, bins=20, color="C0", edgecolor="C0")
axes[1].set_title("Summary Token Length")
axes[1].set_xlabel("Length")
plt.tight_layout()
plt.show()

Most dialogues are shorter than CNN/DailyMail articles; with around 100-200 tokens per dialogue. Similarly, the summaries are much shorter, around 20-40 tokens (average tweet length).

Keep observations in mind as we build data collator for trainer.

First, need to tokenise dataset; set max length to 1024 and 128 for dialogues and summaries, respectively.

In [None]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(
        example_batch["dialogue"], max_length=1024, truncation=True)
    
    # differentiate to let tokenizer know it is processing sequences for decoder
    # and can handle accordingly
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(
            example_batch["summary"], max_length=128, truncation=True)
        
    return {
        "input_ids": input_encodings["input_ids"],
        "attention_mask": input_encodings["attention_mask"],
        "labels": target_encodings["input_ids"]
    }

dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched=True)

columns = ["input_ids", "labels", "attention_mask"]
dataset_samsum_pt.set_format(type="torch", columns=columns)

Now create a data collator. This is called in Trainer just before batch is fed through the model. In most cases, can use the default collator, which collects all the tensors form the batch and simply stacks them.

For summarisation task, we need to not only stack the inputs but also prepare the targets on the decoder side. PEGASUS is an encoder-decoder transformer ths has classic seq2seq architecture. Common approach is "teacher forcing" in the decoder. With this strategy, the decoder receives input tokens that consists of labels shifted by one in addition to encoder output; so when making prediction for next token, the decoder gets the ground truth shifted by one as an input.

Shift by one so that decoder only sees previous ground truth labels and not current or future ones. Shifting suffices as decoder has masked self-attention and masks all inputs at present and in future. After, we pad tokens in the labels are ignored by the loss function by setting to -100. No need to do manually, since `DataCollatorForSeq2Seq` comes to the rescue and takes care of these steps for us.

In [None]:
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# set up Train Arguments as usual for training
training_args = TrainingArguments(
 output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
 per_device_train_batch_size=1, per_device_eval_batch_size=1,
 weight_decay=0.01, logging_steps=10, push_to_hub=True,
 evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
 gradient_accumulation_steps=16)

Batch size=1 since model is big; but small batch size can hurt convergence. So we use a technique called *gradient accumulation*; which makes smaller batches and aggregates gradients instead of calculating gradients of full batch at once. Once computed enough gradients, we run the optimisation step. This is a bit slower than one pass but saves us a lot of GPU memory.

In [None]:
# logged into HuggingFace hub to push the model after training
from huggingface_hub import notebook_login

notebook_login()

In [None]:
# initialise trainer with model, tokenizer, train args and data collator
trainer = Trainer(model=model, args=training_args,
 tokenizer=tokenizer, data_collator=seq2seq_data_collator,
 train_dataset=dataset_samsum_pt["train"],
 eval_dataset=dataset_samsum_pt["validation"])

Note; we can also evaluate generations as part of training loop with extension of `TrainingArguments` called `Seq2Seq` Training Arguments and specify `predict_with_generate=True`; pass to dedicated Trainer called `Seq2SeqTrainer` which uses generate() fn instead of model's forward pass to create predictions.

In [None]:
# can directly run evaluation fn on test set after train to see how well model performs
trainer.train()
score = evaluate_summaries_pegasus(
 dataset_samsum["test"], rouge_metric, trainer.model, tokenizer,
 batch_size=2, column_text="dialogue", column_summary="summary")

rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index=[f"pegasus"])


ROGUE scores improve considerably over model without fine-tuning!

In [None]:
# Push our model to the hub
trainer.push_to_hub("Training complete!")

#### Generating Dialogue Summaries

In [None]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}
sample_text = dataset_samsum["test"][0]["dialogue"]
reference = dataset_samsum["test"][0]["summary"]
pipe = pipeline("summarization", model="transformersbook/pegasus-samsum")

print("Dialogue:")
print(sample_text)
print("\nReference Summary:")
print(reference)
print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Seems model has learned to synthesise the dialogue into a summary without just extracting passages. Now.. How well does the model work on a custom input?

In [None]:
custom_dialogue = """\
Thom: Hi guys, have you heard of transformers?
Lewis: Yes, I used them recently!
Leandro: Indeed, there is a great library by Hugging Face.
Thom: I know, I helped build it ;)
Lewis: Cool, maybe we should write a book about it. What do you think?
Leandro: Great idea, how hard can it be?!
Thom: I am in!
Lewis: Awesome, let's do it together!
"""
print(pipe(custom_dialogue, **gen_kwargs)[0]["summary_text"])

It makes sense and summarises well! 

## Conclusion

Text summarisation poses unique challenges, as conventional metrics like accuracy do not reflect quality of generated text. So metrics such as BLEU or ROUGE can better evaluate generated texts; however, human judgement remains the best measure.

A common question is how can we summarise documents where texts are longer than model's context length. This is still an open research question and not a single strategy at the moment. Recent work by OpenAI scales summarisation by applying it recursively to long documents and using human feedback in the loop.

Next chapter is question answering, which provides an answer to a question based on a text passage. Here there are good strategies to deal with long or many documents and we'll learn to scale question answering to thousands of documents!