<a href="https://colab.research.google.com/github/Taaniya/explore-T5-model/blob/main/Text_summarization_with_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook fine-tunes T5 model for text summarization task

* Model - T5-base
* Dataset - [Medium articles dataset on Kaggle](https://www.kaggle.com/datasets/fabiochiusano/medium-articles) (190K+ Medium articles)
* In the interest of time, I have sampled train data from original dataset.
Train data size - 17K, test - 851
* Training - supervised training (no masking in inputs like it's done for denoising pre-training objective)
* Evaluation metric used - ROUGE



In [None]:
!pip install datasets transformers rouge_score nltk accelerate

In [None]:
import transformers
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

import nltk
nltk.download('punkt')
import string
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
model_checkpoint = "t5-base"
dataset_path = "./drive/MyDrive/Colab Notebooks/NLP_exploration/exploring_T5/medium_articles_sample_50k.csv"
batch_size = 8
model_name = "t5-base-medium-title-generation"
model_dir = f"./drive/MyDrive/Colab Notebooks/NLP_exploration/exploring_T5/Models/{model_name}"


#### Load dataset

In [None]:
medium_datasets = load_dataset("csv", data_files=dataset_path)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
medium_datasets

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'url', 'authors', 'timestamp', 'tags'],
        num_rows: 50000
    })
})

In [None]:
datasets_train_test = medium_datasets["train"].train_test_split(test_size=3000)
datasets_train_validation = datasets_train_test["train"].train_test_split(test_size=3000)

medium_datasets["train"] = datasets_train_validation["train"]
medium_datasets["validation"] = datasets_train_validation["test"]
medium_datasets["test"] = datasets_train_test["test"]


In [None]:
medium_datasets["train"] = medium_datasets["train"].shuffle().select(range(20000))
medium_datasets["validation"] = medium_datasets["validation"].shuffle().select(range(1000))
medium_datasets["test"] = medium_datasets["test"].shuffle().select(range(1000))

#### Data preprocessing

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
# prepend 'summarize' to each text
prefix = "summarize: "

max_input_length = 512
max_target_length = 64

def clean_text(text):
  sentences = nltk.sent_tokenize(text.strip())
  sentences_cleaned = [s for sent in sentences for s in sent.split("\n")]

  # Include lines that end in punctuation
  sentences_cleaned_no_titles = [sent for sent in sentences_cleaned
                                 if len(sent) > 0 and
                                 sent[-1] in string.punctuation]
  text_cleaned = "\n".join(sentences_cleaned_no_titles)
  return text_cleaned

def preprocess_data(examples):
  # create inputs for model by tokenizing article text
  texts_cleaned = [clean_text(text) for text in examples["text"]]
  inputs = [prefix + text for text in texts_cleaned]
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

  # Setup the tokenizer for targets
  # create labels for model by tokenizing article titles. Since this is text
  # summarization, where the vocabulary for input and labels is the same, the same
  # tokenizer is used for both inputs to encoder and labels (inputs to decoder)

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples["title"], max_length=max_target_length,
                       truncation=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

In [None]:
medium_datasets_cleaned = medium_datasets.filter(lambda example: (len(example['text']) >= 500) and (len(example['title']) >= 20))
tokenized_datasets = medium_datasets_cleaned.map(preprocess_data, batched=True)
tokenized_datasets

Filter:   0%|          | 0/20000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/17054 [00:00<?, ? examples/s]



Map:   0%|          | 0/849 [00:00<?, ? examples/s]

Map:   0%|          | 0/851 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'url', 'authors', 'timestamp', 'tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 17054
    })
    validation: Dataset({
        features: ['title', 'text', 'url', 'authors', 'timestamp', 'tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 849
    })
    test: Dataset({
        features: ['title', 'text', 'url', 'authors', 'timestamp', 'tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 851
    })
})

In [None]:
!rm -r {model_dir}

In [None]:
args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",       # do evaluation every eval_steps
    eval_steps=100,
    logging_strategy="steps",          # do logging every eval_steps
    logging_steps=100,
    save_strategy="steps",             # Save checkpoint every save_steps
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,               # save N recent chkpts including best model
    num_train_epochs=1,
    predict_with_generate=True,      # compute generative metrics - ROUGE, BLEU
    fp16=True,                       # Whether to use fp16 16-bit (mixed) precision training instead of 32-bit
    load_best_model_at_end=True,     #  ensures best model chkpt is always saved
    metric_for_best_model="rouge1",  # metric to compare chkpts, otherwise use loss
    report_to="tensorboard"
)

Details on each training argument can be found in the [Huggingface documentation for Seq2SeqTrainer](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments)

There's a different data collator for sequence to sequence task[DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForSeq2Seq), since the length of labels for sequence to sequence task are variable or different from input length.
So, the inputs and labels are padded accordingly, where padding token id for input is 0 and that of labels is -100 since -100 is the ignore_index for loss function to ignore padding token_ids from labels.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer)

In [None]:
metric = load_metric("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip()))
                      for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip()))
                      for label in decoded_labels]

    # Compute ROUGE scores
    result = metric.compute(predictions=decoded_preds, references=decoded_labels,
                            use_stemmer=True)

    # Extract ROUGE f1 scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    # Add mean generated length to metrics
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
                      for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [None]:
# Function that returns an untrained model to be trained
def model_init():
    return AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# Start TensorBoard before training to monitor it in progress
%load_ext tensorboard
%tensorboard --logdir '{model_dir}'/runs

In [None]:
trainer.train()



Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
100,3.1615,2.612373,25.3951,11.5321,22.7996,22.8965,12.3051
200,2.8239,2.549661,28.4687,13.7783,25.8592,25.8839,11.4582
300,2.7259,2.516123,29.1775,14.3767,26.8422,26.8906,11.5618
400,2.6553,2.504573,29.116,14.5796,26.7028,26.7205,11.6337
500,2.6696,2.488245,29.0585,14.5326,26.4908,26.4946,11.9941
600,2.6852,2.480134,29.3248,14.476,26.8777,26.8936,11.4229
700,2.6674,2.468766,29.452,14.5269,27.1131,27.1449,11.3404
800,2.652,2.458986,29.6303,14.5699,27.1487,27.1865,11.4499
900,2.6535,2.458302,29.9547,14.8165,27.244,27.2607,11.6219
1000,2.6858,2.448932,29.7455,14.8469,27.4643,27.5189,11.4982


Step,Training Loss,Validation Loss


TrainOutput(global_step=2132, training_loss=2.680229229953902, metrics={'train_runtime': 3409.3796, 'train_samples_per_second': 5.002, 'train_steps_per_second': 0.625, 'total_flos': 1.038516786561024e+16, 'train_loss': 2.680229229953902, 'epoch': 1.0})

In [None]:
trainer.save_model()

#### Load model from Google drive

In [None]:
model_name = "t5-base-medium-title-generation/checkpoint-2000"
model_dir = f"./drive/MyDrive/Colab Notebooks/NLP_exploration/exploring_T5/Models/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

max_input_length = 512

In [None]:
text = """
Many financial institutions started building conversational AI, prior to the Covid19
pandemic, as part of a digital transformation initiative. These initial solutions
were high profile, highly personalized virtual assistants — like the Erica chatbot
from Bank of America. As the pandemic hit, the need changed as contact centers were
under increased pressures. As Cathal McGloin of ServisBOT explains in “how it started,
and how it is going,” financial institutions were looking for ways to automate
solutions to help get back to “normal” levels of customer service. This resulted
in a change from the “future of conversational AI” to a real tactical assistant
that can help in customer service. Haritha Dev of Wells Fargo, saw a similar trend.
Banks were originally looking to conversational AI as part of digital transformation
to keep up with the times. However, with the pandemic, it has been more about
customer retention and customer satisfaction. In addition, new use cases came about
as a result of Covid-19 that accelerated adoption of conversational AI. As Vinita
Kumar of Deloitte points out, banks were dealing with an influx of calls about new
concerns, like questions around the Paycheck Protection Program (PPP) loans. This
resulted in an increase in volume, without enough agents to assist customers, and
tipped the scale to incorporate conversational AI. When choosing initial use cases
to support, financial institutions often start with high volume, low complexity
tasks. For example, password resets, checking account balances, or checking the
status of a transaction, as Vinita points out. From there, the use cases can evolve
as the banks get more mature in developing conversational AI, and as the customers
become more engaged with the solutions. Cathal indicates another good way for banks
to start is looking at use cases that are a pain point, and also do not require a
lot of IT support. Some financial institutions may have a multi-year technology
roadmap, which can make it harder to get a new service started. A simple chatbot
for document collection in an onboarding process can result in high engagement,
and a high return on investment. For example, Cathal has a banking customer that
implemented a chatbot to capture a driver’s license to be used in the verification
process of adding an additional user to an account — it has over 85% engagement
with high satisfaction. An interesting use case Haritha discovered involved
educating customers on financial matters. People feel more comfortable asking a
chatbot what might be considered a “dumb” question, as the chatbot is less judgmental.
Users can be more ambiguous with their questions as well, not knowing the right
words to use, as chatbot can help narrow things down.
"""

inputs = ["summarize: " + text]

inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=64)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
predicted_title = nltk.sent_tokenize(decoded_output.strip())[0]

print(predicted_title)
# Conversational AI: The Future of Customer Service

Conversational AI: The Future of Customer Service


Comparing above model output with base model without fine-tuned on summarization

In [None]:
model_checkpoint

't5-base'

In [None]:
base_tokenizer = AutoTokenizer.from_pretrained("t5-base")
base_model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
inputs = ["summarize: " + text]

inputs_v2 = base_tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")
output_v2 = base_model.generate(**inputs_v2, num_beams=8, do_sample=True, min_length=10, max_length=64)
base_model_decoded_output = base_tokenizer.batch_decode(output_v2, skip_special_tokens=True)[0]
base_model_predicted_title = nltk.sent_tokenize(base_model_decoded_output.strip())[0]

print(base_model_predicted_title)

many financial institutions started building conversational AI prior to the covid19 pandemic.


The base model simply reconstructed the beginning of the 1st line of input text.

#### Evaluate the model on test set

In [None]:
import torch

# get test split
test_tokenized_dataset = tokenized_datasets["test"]

# pad texts to the same length
def preprocess_test(examples):
  inputs = [prefix + text for text in examples["text"]]
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True,
                           padding="max_length")
  return model_inputs

test_tokenized_dataset = test_tokenized_dataset.map(preprocess_test, batched=True)

# prepare dataloader
test_tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
dataloader = torch.utils.data.DataLoader(test_tokenized_dataset, batch_size=32)

# generate text for each batch
all_predictions = []
for i,batch in enumerate(dataloader):
  predictions = model.generate(**batch)
  all_predictions.append(predictions)

# flatten predictions
all_predictions_flattened = [pred for preds in all_predictions for pred in preds]

# tokenize and pad titles
all_titles = tokenizer(test_tokenized_dataset["title"], max_length=max_target_length,
                       truncation=True, padding="max_length")["input_ids"]

# compute metrics
predictions_labels = [all_predictions_flattened, all_titles]
compute_metrics(predictions_labels)

Map:   0%|          | 0/851 [00:00<?, ? examples/s]



{'rouge1': 43.3828,
 'rouge2': 30.9557,
 'rougeL': 41.4247,
 'rougeLsum': 41.4544,
 'gen_len': 11.7368}

#### References
* [Original blog post on fine-tuning T5 for summarization](https://medium.com/nlplanet/a-full-guide-to-finetuning-t5-for-text2text-and-building-a-demo-with-streamlit-c72009631887)

* [Medium articles Kaggle dataset](https://www.kaggle.com/datasets/fabiochiusano/medium-articles)

* [Data collators - Huggingface](https://youtu.be/-RPeakdlHYo)
