<img align='center' style='max-width: 1000px' src='banner.png'>

<img align='right' style='max-width: 200px; height: auto' src='hsg_logo.png'>

##  Lab 08 - Transformers for Summarization (FLAN-T5)

GSERM Summer School 2023, Deep Learning: Fundamentals and Applications, University of St. Gallen

The lab environment of the "Deep Learning: Fundamentals and Applications" GSERM course at the University of St. Gallen (HSG) is based on Jupyter Notebooks (https://jupyter.org), which allow to perform a variety of statistical evaluations and data analyses.

_This worksheet is largely based on HuggingFace's Tech Lead Philipp Schmid's [excellent tutorial](https://www.philschmid.de/fine-tune-flan-t5)._

# Fine-tuning FLAN-T5 for chat and dialogue summarization

In this lab, we will fine-tune [google/flan-t5-small](https://huggingface.co/google/flan-t5-small), a special transformer foundation model, for chat and dialogue summarization using Hugging Face Transformers. 
While it is based on the T5 setup (even in number of parameters), FLAN-T5 has been fine-tuned on more than 1000 additional tasks covering also more languages. 

In this example we will use the [samsum](https://huggingface.co/datasets/samsum) dataset which is a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

You will learn how to:

1. [Setup Development Environment](#1-setup-development-environment)
2. [Load and prepare samsum dataset](#2-load-and-prepare-samsum-dataset)
3. [Fine-tune and evaluate FLAN-T5](#3-fine-tune-and-evaluate-flan-t5)
4. [Run Inference and summarize ChatGPT dialogues](#4-run-inference-and-summarize-chatgpt-dialogues)

Before we can start, make sure you have a [Hugging Face Account](https://huggingface.co/join) to save artifacts and experiments. 

## Quick intro: FLAN-T5, just a better T5

FLAN-T5 released with the [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models. 

![flan-t5](flan-t5.png)

* Paper: https://arxiv.org/abs/2210.11416
* Official repo: https://github.com/google-research/t5x

## 1. Setup Development Environment

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages. 

In [None]:
import sys
!{sys.executable} -m pip install accelerate torch pytesseract transformers==4.29.0 datasets evaluate rouge-score nltk tensorboard py7zr --upgrade

## 2. Load and prepare samsum dataset

we will use the [samsum](https://arxiv.org/abs/1911.12237v2) dataset a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

```json
{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}
```

It's a fairly small dataset, you can download it [from arxiv](https://arxiv.org/src/1911.12237v2/anc) or via [HuggingFace](https://huggingface.co/datasets/samsum) (which we will be doing below).

In [1]:
dataset_id = "samsum"

To load the `samsum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [2]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

# Train dataset size: 14732
# Test dataset size: 819

Downloading builder script:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

Downloading and preparing dataset samsum/samsum to /Users/riedhammerko/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e...


Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataset samsum downloaded and prepared to /Users/riedhammerko/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Train dataset size: 14732
Test dataset size: 819


Lets checkout an example of the dataset.

In [3]:
from random import randrange        


sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")
print(f"summary: \n{sample['summary']}\n---------------")

dialogue: 
Taylor: What music do you listen to?
Valerie: Y?
Taylor: OMG. Can't u just say?
Valerie: Don't be so touchy! Country, pop, a bit of rock and sometimes rap.
Taylor: I've got some good news 4 u :)
Valerie: Rly? What is it?
Taylor: I have 2 tickets 2 a rock concert :)
Valerie: And? :)
Taylor: Would u like to come with me?
Valerie: Sure!
---------------
summary: 
Taylor wants to go to a rock concert with Valerie. She agreed to go with Taylor.
---------------


To train our model we need to convert our inputs (text) to token IDs. This is done by a tokenizer that matches our target transformer model.

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-small"

# Load tokenizer of FLAN-t5-small
tokenizer = AutoTokenizer.from_pretrained(model_id)


Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Before we can start training we need to preprocess our data. 
Abstractive summarization is a text2text-generation task where the summary is freely generated (as opposed to extractive summarization). 
This means our model will take a text as input and generate a summary as output. 
For this we want to understand how long our input and output will be to be able to efficiently batch our data. 

In [5]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Max source length: 512


Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Max target length: 95


In [6]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5

After we have processed our dataset, we can start training our model. 
Therefore we first need to load our [FLAN-T5](https://huggingface.co/models?search=flan-t5) model from the Hugging Face Hub. 
Here we will work with the `small` version to speed up things; feel free to use a larger one (eg. `base` or `large`) and compare the performance.

In [7]:
from transformers import AutoModelForSeq2SeqLM

# huggingface hub model id
model_id="google/flan-t5-small"

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/308M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Before we get startedTo get an idea of the basic performance, let's use the original model and generate a few summaries.
To get an objective means of quality, let's use the [ROUGE metric](https://aclanthology.org/W04-1013.pdf), which basically count's an n-gram overlap between the generated and reference summary.

In [10]:
from transformers import pipeline
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)

# select a random test sample
for i in range(5):
    sample = dataset['test'][randrange(len(dataset["test"]))]
    print(f"dialogue: \n{sample['dialogue']}\n---------------")

    # summarize dialogue
    res = summarizer(sample["dialogue"])
    print(f"reference summary:\n{sample['summary']}")
    print(f"flan-t5-small summary:\n{res[0]['summary_text']}")
    print(f"rouge metrics: {metric.compute(predictions=[res[0]['summary_text']], references=[sample['summary']], use_stemmer=True)}")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/riedhammerko/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Your max_length is set to 200, but your input_length is only 142. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=71)


dialogue: 
Molly: D‑':
Luca: What???
Molly: I can't think anymore today!
Luca: LOL!
Molly: Seriously!
Luca: Deep breath...
Molly: It's the silly season, isn't it?
Luca: Yep. Just hang in there.
Molly: Trying...
Luca: Try harder! LOL!
Molly: Driving me nuts!
Luca: I know, but you can't let it get you down!
Molly: Yeah, you're right.
Luca: Of course I am! LOL!
Molly: Modest, too.
---------------


Token indices sequence length is longer than the specified maximum sequence length for this model (755 > 512). Running this sequence through the model will result in indexing errors


reference summary:
Luca wants Molly to try harder.
flan-t5-small summary:
Molly can't think anymore today. Luca hangs in there and tries harder. Molly is going to try harder.
rouge metrics: {'rouge1': 0.39999999999999997, 'rouge2': 0.1739130434782609, 'rougeL': 0.39999999999999997, 'rougeLsum': 0.39999999999999997}
dialogue: 
Noah: Hi there! The quince we talked about the other day... Are you still interested?
Sophie: Hello Noah, but of course I am.Thank you.
Noah: Actually William went to collect the rest of them for you immediately. They're in a basket on our terrace, so you call collect them any time.
Sophie: That's very very kind of him! He's really a darling.
Sophie: We won't be going your direction any time soon I'm afraid. How long will they keep? Is it windfall?
Noah: Both really but they all look very healthy. No bruises afa one can see. Or only odd small ones. They' be alright for a couple of days I guess. The weather's cool.
Sophie: It would be a shame if they rotted. I'll t

Your max_length is set to 200, but your input_length is only 159. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=79)


reference summary:
Sophie accepts some quince from Noah. Noah has left the quince in a basket on his terrace and the twins can pick it up any time. The twins are in college. Noah's son is in the military. He is still single but reportedly not gay.
flan-t5-small summary:
Noah and Sophie are going to collect the quince they talked about the other day. They will be on their way from Notts tomorrow afternoon and passing Windfield. Sophie and Noah are both happy they've been doing fine at college. They have a strict regime at the academy. Noah's twins will be coming home this weekend only because of Alexa's ceremony. They'll come back to Sophie asap.
rouge metrics: {'rouge1': 0.2735042735042735, 'rouge2': 0.034782608695652174, 'rougeL': 0.15384615384615385, 'rougeLsum': 0.15384615384615385}
dialogue: 
Maria: Where are you?
Lori: Copley 
Maria: why not Prudential?
Lori: It's close, I can be there in 10min
Maria: we don't have time, we were supposed to meet at Prudential
Lori: sorry, I misund

Your max_length is set to 200, but your input_length is only 48. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


reference summary:
Lori is at Copley. Lori was supposed to meet Maria at Prudential, but misunderstood her. There's a traffic jam, but Lori believes she will meet Maria in several minutes. 
flan-t5-small summary:
Maria and Lori will meet at Prudential on Friday in 10 minutes. They will see each other in front of a mall in 5 minutes.
rouge metrics: {'rouge1': 0.3272727272727273, 'rouge2': 0.07547169811320754, 'rougeL': 0.2545454545454545, 'rougeLsum': 0.2545454545454545}
dialogue: 
Linda: I missed the train
Joseph: When is the next one?
Linda: In one hour
Darcy: Not a tragedy
Linda: It was 80 euros
Linda: Return to Amsterdam
Linda: I was shocked!! 
---------------


Your max_length is set to 200, but your input_length is only 60. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=30)


reference summary:
Linda missed her train and the next one is in one hour. A return to Amsterdam was 80 euros.
flan-t5-small summary:
Linda missed the train in one hour. It was 80 euros. She will return to Amsterdam in a few hours. Joseph and Darcy will be back in Amsterdam.
rouge metrics: {'rouge1': 0.6382978723404255, 'rouge2': 0.3111111111111111, 'rougeL': 0.3829787234042553, 'rougeLsum': 0.3829787234042553}
dialogue: 
Conrad: I'm outside the house
Conrad: I forgot my keys...
Rebecca: 💩
Tiffany: I'll be home at 10-11
Rebecca: I'm coming back even later
Conrad: Oh no...
Conrad: I'll wait in the coffee shop
---------------
reference summary:
Conrad can't enter the house because he forgot his keys. Since Rebecca and Tiffany are coming back late, he'll wait in the coffee shop. 
flan-t5-small summary:
Conrad forgot his keys. Tiffany will be home at 10-11. Conrad will wait in the coffee shop. Rebecca will wait for him.
rouge metrics: {'rouge1': 0.44897959183673475, 'rouge2': 0.25531914893

As you can see, our summaries already achieve sort of high ROUGE-1 and ROUGE-2 scores.
However: since the summaries are very short, you can already guess that this recall-oriented measure is not great-- just by getting named entities right, we can already get good scores.

We want to evaluate our model during training. 
The `Trainer` supports evaluation during training by providing a `compute_metrics`, so let's write a wrapper that decodes the tokens and compares them to the reference.

We are going to use `evaluate` library to evaluate the `rogue` score (and `punkt` to do some tokenization of the raw text).

In [11]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/riedhammerko/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Before we can start training is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library. 

In [12]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)


The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training.

In [13]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# let's give the new model a nice name (flan-t5-small-samsum)
repository_id = f"{model_id.split('/')[1]}-{dataset_id}"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=5,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=False
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

We can start our training by using the `train` method of the `Trainer`.
Note that the training will take quite a long time (10h on an Apple M1, likely twice as much on Colab/T4).
Thus, please verify that the training runs properly, but then terminate and advance to the [next step](#4-run-inference), where you can use checkpoints and a pre-trained model provided in class.

In [14]:
# Start training
trainer.train()



  0%|          | 0/9210 [00:00<?, ?it/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


KeyboardInterrupt: 

Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.


In [15]:
trainer.evaluate()

  0%|          | 0/103 [00:00<?, ?it/s]

KeyboardInterrupt: 

The best score we achieved is an `rouge1` score of `47.23`. 

In [None]:
# Save our tokenizer and model
tokenizer.save_pretrained(repository_id)
trainer.save_model(repository_id)

## 4. Run Inference

### Optional: Load check-points or trained model

At the lab, we unfortunately don't have the hardware to run fine-tuning in decent amount of time.
Thus, your instructors will provide you with checkpoints (ie. intermediate models) and a pre-trained final model.

In [16]:
repository_id = 'flan-t5-small-samsum-pretrained'
#repository_id = 'flan-t5-small-samsum/checkpoint-XXXXX'
#repository_id = 'flan-t5-small-samsum/checkpoint-XXXXX'

tokenizer = tokenizer.from_pretrained(repository_id)
model = model.from_pretrained(repository_id)

Now we have a trained model, we can use it to run inference. We will use the `pipeline` API from transformers and a `test` example from our dataset.

In [20]:
from transformers import pipeline
from random import randrange        

# load model and tokenizer from above
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)

# select a random test sample
sample = dataset['test'][randrange(len(dataset["test"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")

# summarize dialogue
res = summarizer(sample["dialogue"])

print(f"flan-t5-small summary:\n{res[0]['summary_text']}")
print(f"reference summary:\n{sample['summary']}")

dialogue: 
Ollie: Okay, Kelly! Ur up nxt!
Kelly: Me? I don't wanna.
Mickey: C'mon!
Jessica: Yeah! What's yours?
Kelly: Fine. It's a sculpture garden in Finnland.
Ollie: What's scary about sculptures? Wait! Do they resemble vampires and stuff?
Mickey: Nah, I'm sure they look rly nice.
Kelly: It's not the sculptures, it's the amount of them and their faces!
Jessica: Faces? What faces?
Kelly: Well, they resemble ppl in different activities like hugging, training, doing sport and so on. But the faces are just morbid and there's like a hundred of them. All staring at you!
Ollie: Another one?
Mickey: Certainly!
Jessica: Well, Ollie, ur turn!
Ollie: Nagoro village in Japan!
Mickey: Y?
Ollie: Well, maybe it's not scary, but it similar to Kelly's place. It's just creepy as hell.
Jessica: Bt y?
Ollie: Imagine a village with ppl living in it. And in the same village u have these human-sized figures. And there's more of them than the ppl that actually live there!
Kelly: Creepy AH!
Mickey: WTF?! Y 

# Part for the (take-home) assignment

1. Work with either your own or the provided checkpoints to study the effect of fine-tuning on the performance, by making a comparison of the ROUGE-1 and ROUGE-2 scores for the test set.
2. In a similar way, try out larger variants of FLAN-T5 (`base`, `large`): is it better to fine-tune a small model, or to just use a larger one?
3. Play with the fine-tuning options to see if there are any low-hanging fruit: How do learning rate and epochs affect the scores? Can you make a smaller fine-tuned model outperform a larger foundational one? Can you find good examples where fine-tuning lead to better results (hint: compare two summaries using https://pypi.org/project/jiwer/)