# Introduction

Text summarization is a complex task for recurrent neural networks, particularly in neural language models. Despite it's complexity, text summarization offers the prospect for domain experts to significantly increase productivity and is used in enterprise-level capacities todays to condense common domain knowledge, summarize complex corpus of text like contracts, and automatically generate content for use cases in social media, advertising, and more. In this project, the author expands on the recurrent neural network framework using encoder-decoder transformers from scratch to condense dialogues between several people into a crisp summary. Applications of this exercise are endless, but could be especially beneficial for summarizing long transcripts from meetings and so on.

Let's first look at the dataset we will use for training: Samsung transcript data. We will then go into the scoring parameters and demonstrate how we train the model. Lastly, we will then showcase our model's inference and discuss opportunities for future work and study use cases.

### Dataset
Explain SamSung Dataset here

In [1]:
from datasets import load_dataset

dataset_samsum = load_dataset("samsum")
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]
sample_text = dataset_samsum["train"][1]["dialogue"][:2000]
print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDialogue:")
print(dataset_samsum["test"][0]["dialogue"])
print("\nSummary:")
print(dataset_samsum["test"][0]["summary"])

Reusing dataset samsum (C:\Users\lwhieldon\.cache\huggingface\datasets\samsum\samsum\0.0.0\f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']

Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


### Pretraining Objectives



### ROUGE Metric

The ROUGE score was specifically developed for applications like summarization where high recall is more important than just precision. The approach is very similar to the BLEU score in that we look at different n-grams and compare their occurrences in the generated text and the reference texts. The difference is that with ROUGE we check how many n-grams in the reference text also occur in the generated text. For BLEU we look at how many n-grams in the generated text appear in the reference, so we can reuse the precision formula with the minor modification that we coun the (unclipped) occurrence of reference n-grams in the generated text in the denominator:

$$ROUGE-N = \frac{\sum_{ snt' \in C}\sum_{n-gram\in snt'}Count_{match}(n-gram)}{\sum_{snt'\in C}\sum_{n-gram\in snt'}Count(n-gram)}$$

This was the original proposal for ROUGE. Subsequently, researchers have found that fully removing precision can have strong negative effects. Going back to the BLEU formula without the clipped counting, we can measure precision as well, and we can then combine both precision and recall ROUGE scores in the harmonic mean to get an $F^1$-score. This score is the metric that is nowadays commonly reported for ROUGE.

There is a separate score in ROUGE to measure the __longest common substring__ (LCS), called ROUGE-L. The LCS can be calculated for any pair of strings. For example, the LCS for "abab" and "abc" would be "ab", and its length would be 2. If we want to compare this value between two samples we need to somehow normalize it because otherwise a longer text would be at an advantage. To achieve this, the inventor of ROUGE campe up with an F-Score-like scheme where the LCS is normalized with the length of the reference and generated text, then the two normalized scores are mixed together

$$R_{LCS} = \frac{LCS(X,Y)}{m}$$

$$P_{LCS} = \frac{LCS(X,Y)}{n}$$

$$F_{LCS} = \frac{(1 + \beta^2) R_{LCS}P_{LCS}}{R_{LCS} + \beta P_{LCS}}, where \beta = P_{LCS}/R_{LCS}$$

The way the LCS score is properly normalized and can be compared across samples. In 🤗 Datasets implementation, two variations of ROUGE are calculated: one calculates the score per sentence and averages it for the summaries (ROUGE-L), and the other calculates it directly over the whole summary (ROUGE-Lsum).

We can load the metric as follows:



In [2]:
%pip install rouge_score
from datasets import load_metric

rouge_metric = load_metric("rouge")



Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### PEGASUS

PEGASUS is an encoder-decoder transformer. As shown in figure below, its pretraining objective is to predict masked sentences in multisentence texts. The authors of [insert PEGASUS authors here] argue that the closer the pretraining objective is to the downstream task, the more effective it is. With the aim of finding a pretraining objective that is closer to summarization than general lanugage modeling, they automatically identified, in a very large corpus, sentences containing most of the content of their surrounding paragraphs (using summarization evaluation metrics as a heuristic for content overlap) and pretrained the PEGASUS model to reconstruct these sentences, thereby obtaining a state-of-the-art model for text summarization.

![PegasusArchitecture](PegasusArchitecture.png)

This model has a special token for newlines, which is why we don't need the sent_tokenize() function:

In [3]:
#We'll collect the generated summaries of each model in a dictionary 
summaries = {}

In [4]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lwhieldon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['The U.S. are a country.', 'The U.N. is an organization.']

In [5]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

def evaluate_summaries_baseline(dataset, metric, column_text="article",column_summary="highlights"):
    summaries = [three_sentence_summary(text) for text in dataset[column_text]]
    metric.add_batch(predictions=summaries,references=dataset[column_summary])
    score = metric.compute()
    return score
    
    

In [6]:
from transformers import pipeline, set_seed
pipe = pipeline("summarization",model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>",".\n")

Your max_length is set to 128, but you input_length is only 26. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


## Training a Summarization Model

For our application, we'll use the SAMSum dataset, developed by Samsung, which consists of a collection of dialogues along with brief summaries. In an enterprise setting, these dialogues might represent the interactions between a customer and the support center or transcript representing individuals taking part in a meeting, so generating accurate summaries can help improve customer service, cut down on note taking, and detect common patterns among customer requests or meeting themes. Let's load it and look at an example:

The dialogues look like what you would expect from a chat via SMS or WhatsApp, including emojis and placeholders for GIFs. The dialogue field contains the full text and the summary the summarized dialogue. Could a model that was fine-tuned on the CNN/DailyMail dataset deal with that? Let's find out!


### Evaluating PEGASUS on SAMSum

First we'll run the same summarization pipeline with PEGASUS to see what the output looks like.

In [7]:
pipe_out = pipe(dataset_samsum["test"][0]["dialogue"])
print("Summary: ")
print(pipe_out[0]["summary_text"].replace(" .<n>",".\n"))

Your max_length is set to 128, but you input_length is only 122. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Summary: 
Amanda: Ask Larry Amanda: He called her last time we were at the park together.
Hannah: I'd rather you texted him.
Amanda: Just text him .


We can see that the model mostly tries to summarize by extracting the key sentences from the dialogue. This probably worked relatively well on the CNN/DailyMail dataset, but the summaries in SAMSum are more abstract. Let's confirm this by running the full ROUGE evaluation on the test set:

In [8]:
from tenacity import AttemptManager
from tqdm import tqdm
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(list_of_elements, batch_size):
    """Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements),batch_size):
        yield list_of_elements[i : i + batch_size]

def evaluate_summaries_pegasus(dataset,
                                metric,
                                model,
                                tokenizer,
                                batch_size=16,
                                device=device,
                                column_text = "article",
                                column_summary = "highlights"):
    article_batches = list(chunks(dataset[column_text],batch_size))
    target_batches = list(chunks(dataset[column_summary],batch_size))
    
    for article_batch, target_batch in tqdm(zip(article_batches, target_batches), total=len(article_batches)):
        inputs = tokenizer(article_batch, max_length=1024, truncation=True, padding="max_length",return_tensors="pt")
        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                                   attention_mask=inputs["attention_mask"].to(device),
                                   length_penalty=0.8, num_beams=8, max_length=128)
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,clean_up_tokenization_space=True) for s in summaries]
        decoded_summaries = [d.replace("<n>"," ") for d in decoded_summaries]
        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    score = metric.compute()

    return score

In [9]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)



In [10]:
rouge_names = ["rouge1","rouge2","rougeL","rougeLsum"]


In [12]:
import pandas as pd
import numpy as np

score = evaluate_summaries_pegasus(dataset_samsum["test"],
                                    rouge_metric, 
                                    model, 
                                    tokenizer, 
                                    column_text="dialogue",
                                    column_summary="summary",
                                    batch_size=8)
rouge_dict = dict((rn,score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index=["pegasus"])

  2%|▏         | 2/103 [09:08<7:32:48, 268.99s/it]

The results aren't great but this is not unexpected since we've moved quite a bit away from the CNN/DailyMail data distribution. Nevertheless, setting up the evaluation pipeline before training has two advantages: we can directy measure the success of training with the metric and we have a good baseline. Fine-tuning the model on our dataset should result in an immediate improvement in the ROUGE metric, and if that is not the case we'll know something is wrong with our training loop.

### Fine-Tuning PEGASUS

Before we process the data for training, let's have a quick look at the length distribution of the input and outputs:

In [None]:
import matplotlib.pyplot as plt
d_len = [len(tokenizer.encode(s)) for s in dataset_samsum["train"]["dialogue"]]
s_len = [len(tokenizer.encode(s)) for s in dataset_samsum["train"]["summary"]]

fig,axes = plt.subplots(1,2,figsize=(10, 3.5), sharey=True)
axes[0].hist(d_len, bins=20, color="C0",edgecolor = "C0")
axes[0].set_title("Dialogue Token Length")
axes[0].set_xlabel("Length")
axes[0].set_ylabel("Count")
axes[1].hist(s_len,bins=20,color="C0",edgecolor="C0")
axes[1].set_title("Summary Token Length")
axes[1].set_xlabel("Length")
plt.tight_layout()
plt.show()

We see that most dialogues are much shorter than the CNN/DailyMail articles, which is 100-200 tokens per dialogue. Similarly, summaries are much shorter, with around 20-40 tokens (the average length of a tweet).

Let's keep those observations in mind as we build the data collator for the Trainer. First we need to tokenize the dataset. For now, we'll set the maximum lengths to 1024 and 128 for the dialogues and summaries, respectively:

In [None]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch["dialogue"],max_length=1024, truncation=True)
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch["summary"], max_length=128,truncation=True)
    return {"input_ids": input_encodings["input_ids"],
            "attention_mask": input_encodings["attention_mask"],
            "labels": target_encodings["input_ids"]}

dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched=True)

columns = ["input_ids","labels","attention_mask"]
dataset_samsum_pt.set_format(type="torch",columns=columns)

A new thing in the use of the tokenization step is the `tokenizer.as_target_tokenizer()` context. Some models require special tokens in the decoder inputs, so it's important to differentiate between the tokenization of encoder and decoder iputs. In the `with` statement (called a _context manager_), the tokenizer knows that it is tokenizing for the decoder and can process sequences accordingly.

Now we need to create the data collator. This function is called in the `Trainer` just before the batch is fed through the model. In most cases we can use the default collator, which collects all the tensors from the batch and simply stacks them. For the summarization task we need to not only stack the inputs but also prepare the targets on the decoder side. PEGASUS is an encoder-decoder transformer and thus has the classic seq2seq architecture. In a seq2seq setup, a common approach is to apply "teacher forcing" in the decoder. With this strategy, the decoder receives input tokens (like in decoder-only models such as GPT-2) that consists of the labels shifted by one in additiona to the encoder output; so, when making the prediction for the next token the decoder gets the ground truth shifted by one as an input.

We shift it by one so that the decoder only see the previous group truth labels and not the current or future ones. Shifting alone suffices since the decoder has masked self-attention that masks all inputs at present and in the future.

So, when we prepare our batch, we set up the decoder inputs by shifting the labels to the right by one. After that, we make sure the padding tokens in the labels are ignored by the loss function by setting them to -100. We actually don't have to do this manually, though, since the `DataCollatorForSeq2Seq` comes to the rescue and takes care of all these steps for us:



In [None]:
from transformers import DataCollatorForSeq2Seq
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Then, as usual, we set up the `TrainingArguments` for training:

In [None]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir='pegasus-samsum',
                                  num_train_epochs=1,
                                  warmup_steps=500,
                                  per_device_train_batch_size=1,
                                  per_device_eval_batch_size=1,
                                  weight_decay=0.01,
                                  logging_steps=10,
                                  push_to_hub=True,
                                  evaluation_strategy='steps',
                                  eval_steps=500,
                                  save_steps=1e6,
                                  gradient_accumulation_steps=16
                            )

One thing that is different from the previous settings is that new argument, `gradient_accumulation_steps`. Since the model is quite big, we had to set the batch size to 1. However, a batch size that is too small can hurt convergence. To resolve the issue, we can use a nifty technique called _gradient accummulation_. As the name suggests, instead of calculating the gradients of the full batch all at once, we make smaller batches and aggregate the gradients. When we have aggregated enough gradients, we run the optimization step. Naturally, this is a bit slower than doing it in one pass, but it saves us a lot of GPU memory.

Let's now make sure that we are logged in to Hugging Face so we can push the model to the Hub after training:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

We now have everything we need to initialize the trainer with the model, tokenizer, training arguments, and data collator, as well as the training and evaluation sets:

In [None]:
trainer = Trainer(model=model,
                  args=training_args,
                  tokenizer=tokenizer,
                  data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["train"],
                  eval_dataset=dataset_samsum_pt["validation"])

We are ready for training. After training, we can directly run the evaluation function on the test set to see how well the model performs:

In [None]:
trainer.train()
score = evaluate_summaries_pegasus(dataset_samsum["test"],
                                    rouge_metric,
                                    trainer.model,
                                    tokenizer,
                                    batch_size=2,
                                    column_text="dialogue",
                                    column_summary="summary")

rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index=[f"pegasus"])

We see that the ROUGE scores improved considerably over the model without fine-tuning, so even though the previous model was also trained for summarization, it was not well adapted for the new domain. Let's push our model to the Hub:

In [None]:
trainer.push_to_hub("Training complete!")

### Generating Dialogue Summaries

Looking at the losses and ROUGE score, it seems the model is showing a significant improvement over the original model trained on CNN/DailyMail only. Let's see what a summary generate on a sample from the test set looks like:

In [None]:
gen_kwargs = {"length_penalty":0.8,"num_beams":8, "max_length":128}
sample_text = dataset_samsum["test"][0]["dialogue"]
reference = dataset_samsum["test"][0]["summary"]
pipe = pipeline("summarization",model="Lwhieldon/pegasus-samsum")

print("Dialogue")
print(sample_text)
print("\nReference Summary:")
print(reference)
print(pipe(sample_text,**gen_kwargs)[0]["summary_text"])

That looks much more like the reference summary. It seems the model has learned to synthesize the dialogue into a summary without just extracting passages. Now, the ultimate test: how well does the model work on custom input?

In [None]:
custom_dialogue = """\
Thom: Hi guys, have you heard of transformers?
Lewis: Yes, I used them recently!
Leandro: Indeed, there is a great library by Hugging Face.
Thom: I know, I helped build it ;)
Lewis: Cool, maybe we should write a book about it. What do you think?
Leandro: Great idea, how hard can it be?!
Thom: I am in!
Lewis: Awesome, let's do it together!
"""

print(pipe(custom_dialogue,**gen_kwargs)[0]["summary_text"])

The generated summary of the custom dialogue makes sense. It summarizes well that all the people in the discussion want to write the book together and does not simply extract single sentences. 