<a href="https://colab.research.google.com/github/axel-sirota/nlp-and-transformers/blob/main/module4/NLPTransformers_Mod4Demo3_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization with T5

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

Summarization is one of those tasks that is quite challenging. However the T5 multi-task model can take it! We will show you how to finetune T5 to your needs, however I highly recommend tuning the ammount of traning data depending on your GPU power avaiable.

## Prep

In [None]:
!pip install transformers datasets evaluate rouge_score transformers[sentencepiece]

In [None]:
import multiprocessing
import tensorflow as tf
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
import numpy as np
import evaluate

import sys
import keras.backend as K
import random
import os
import pandas as pd
import warnings
import time
import nltk

TRACE = False
PATIENCE = 1
EPOCHS = 1
BATCH_SIZE = 32
DIVISION_FACTOR = 6

def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config)
  tf.compat.v1.keras.backend.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')
nltk.download('punkt')

## Download dataset and tokenize it

Let's download the dataset and metric used for Summarization

In [None]:
raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

In [None]:
raw_datasets

As you can tell we have 200k training datapoints, if you have a less powerful GPU in Colab, just use maybe 5% of that and it should suffice.

In [None]:
raw_datasets["train"][0]

As we can tell, we have a document, a summary and an ID. Let's see if we can make this work!

We are going to use the T5 model (a small version of it) which is a **multi task model**



<table>
<tr>
  <th colspan=1>The T5 architecture</th>
<tr>
<tr>
  <td>
   <img src="https://www.dropbox.com/s/5mcfpwuahy3yivg/t5_arch.webp?raw=1"/>
  </td>
</tr>
</table>

As you can see, it is quite a simple vanilla Encoder-Decoder model, but it was trained on a variety of situations with multiple targets and tasks.


<table>
<tr>
  <th colspan=1>T5 training and finetuning</th>
<tr>
<tr>
  <td>
   <img src="https://www.dropbox.com/s/r8c64dvgmz6jom2/t5-fnietuning.png?raw=1"/>
  </td>
</tr>
</table>

So in the end we are able to use it simple for all the tasks, as the following:



<table>
<tr>
  <th colspan=1>T5 usage</th>
<tr>
<tr>
  <td>
   <img src="https://www.dropbox.com/s/4i71mp3axb4x8gy/t5_tasks.png?raw=1"/>
  </td>
</tr>
</table>

As before remember to use the `AutoTokenizer` to download the correct tokenization

In [None]:

model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

T5 is a multitask model, so it needs a prefix to know what we want to do (summary, q&a, grammatical checks, you name it). The prefix for our tasks is `summarize: `

In [None]:
tokenizer(["Hello, this is a sentence!", "This is another sentence."])

In [None]:
if model_checkpoint in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

In [None]:
max_input_length = 1024  # This comes from the model card
max_target_length = 128  # This comes from the model card


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Notice how the preprocessing just applies the tokenizer to both the document and the summary and sets the summary input_ids as the label

In [None]:
preprocess_function(raw_datasets["train"][:2])

Now we filter the training set to fit in RAM, here you can adapt it to your needs

In [None]:
DIVISION_FACTOR = 20
even_dataset = raw_datasets.filter(lambda example, index: index % DIVISION_FACTOR == 0, with_indices=True)

In [None]:
tokenized_datasets = even_dataset.map(preprocess_function, batched=True)

## Downloading the model and Training

Now we can download the model

In [None]:
from transformers import TFT5ForConditionalGeneration

model = TFT5ForConditionalGeneration.from_pretrained(model_checkpoint)

At this point, we could specify a maximum length and pad based on that, we used to do this a lot with RNNs; however that is not optimal. It would be better if we can pad up to the maximum *of each batch*. And that is what a **DataCollator** do

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="np")  # By default it returns PyTorch tensors which would fail, better to specify Numpy

generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="np", pad_to_multiple_of=128)

In [None]:
train_dataset = model.prepare_tf_dataset(
    tokenized_datasets["train"],
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=data_collator,
)

validation_dataset = model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=data_collator,
)

generation_dataset = model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=generation_data_collator
)

In [None]:
from transformers import AdamWeightDecay
import tensorflow as tf

learning_rate = 2e-5
weight_decay = 0.01
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

In [None]:
model.summary()

Sadly, we cannot freeze some layers of this model in particular since it was uploaded as a simple encoder decoder, so we will just finetune it.

In [None]:
model.fit(train_dataset, validation_data=validation_dataset, epochs=EPOCHS)

## Testing it out

In [None]:
document = 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled'
if 't5' in model_checkpoint:
    document = "summarize: " + document
tokenized = tokenizer([document], return_tensors='np')
out = model.generate(**tokenized, max_length=128)

In [None]:
print(document)

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0]))