Ref: https://medium.com/nlplanet/a-full-guide-to-finetuning-t5-for-text2text-and-building-a-demo-with-streamlit-c72009631887

The task of generating titles starting from the textual content of an article is a text2text generation task: we have a text in input and we want to generate some text as output.

Popular text2text generation tasks are machine translation, commonly evaluated with the BLEU score and a focus on word precision, and text summarization, commonly evaluated with the ROUGE score and a focus on word recall.

I see title generation as closely related to text summarization as the title should make the reader understand what is the article about, with the added flavor that the title should also intrigue the reader and make him/her curious about the article. For this reason, I decided to evaluate my models with the ROUGE score.

Steps involved:
- Load the dataset.
- Split the dataset into train, validation, and test set.
- Preprocess the dataset for T5.
- Preparing the Hugging Face trainer.
- Start TensorBoard. (optional)
- Fine-tune T5.
- Try the model.
- Evaluate the model on the test set.

Our dataset URL is kaggle.com/datasets/fabiochiusano/medium-articles

In [50]:
!pip install datasets transformers rouge_score nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [51]:
import transformers
from datasets import load_dataset, load_metric

In [52]:
medium_datasets = load_dataset("csv", data_files="medium_articles.csv.zip")

In [53]:
datasets_train_test = medium_datasets["train"].train_test_split(test_size=3000)
datasets_train_validation = datasets_train_test["train"].train_test_split(test_size=3000)

medium_datasets["train"] = datasets_train_validation["train"]
medium_datasets["validation"] = datasets_train_validation["test"]
medium_datasets["test"] = datasets_train_test["test"]

In [54]:
medium_datasets["train"] = medium_datasets["train"].shuffle().select(range(100000))
medium_datasets["validation"] = medium_datasets["validation"].shuffle().select(range(1000))
medium_datasets["test"] = medium_datasets["test"].shuffle().select(range(1000))

Hugging Face provides us with a complete notebook example of how to fine-tune T5 for text summarization. As for every transformer model, we need first to tokenize the textual training data: the article content and the title.

In [55]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
import string
from transformers import AutoTokenizer

model_checkpoint = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

[nltk_data] Downloading package punkt to /Users/thejus/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/thejus/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Before applying the tokenizer to the data, let’s filter out some bad samples (i.e. articles whose title is long less than 20 characters and whose text content is long less than 500 characters).



Then, we define the preprocess_data function that takes a batch of samples as inputs and outputs a dictionary of new features to add to the samples. The preprocess_data function does the following:

Extract the “text” feature from each sample (i.e. the article text content), fix the newlines in the article, and remove the lines without ending punctuation (i.e. the subtitles).

Prepend the text “summarize: “ to each article text, which is needed for fine-tuning T5 on the summarization task.

Apply the T5 tokenizer to the article text, creating the model_inputs object. This object is a dictionary containing, for each article, an input_ids and anattention_mask arrays containing the token ids and the attention masks respectively.

Apply the T5 tokenizer to the article titles, creating the labels object. Also in this case, this object is a dictionary containing, for each article, an input_ids and anattention_mask arrays containing the token ids and the attention masks respectively. Note that this step is done inside the tokenizer.as_target_tokenizer() context manager: this is usually done because there are text2text tasks where inputs and labels must be tokenized with different tokenizers (e.g. when translating between two languages, where each language has its own tokenizer). For text summarization the labels are tokenized with the same tokenizer as the inputs, thus the context manager is optional.

Return a dictionary containing the token ids and attention masks of the inputs, and the token ids of the labels.

In [56]:
medium_datasets_cleaned = medium_datasets.filter(
    lambda example: (len(example['text']) >= 500) and
    (len(example['title']) >= 20)
)

Filter:   0%|          | 0/100000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [57]:
prefix = "summarize: "
max_input_length = 512
max_target_length = 64

def clean_text(text):
  sentences = nltk.sent_tokenize(text.strip())
  sentences_cleaned = [s for sent in sentences for s in sent.split("\n")]
  sentences_cleaned_no_titles = [sent for sent in sentences_cleaned
                                 if len(sent) > 0 and
                                 sent[-1] in string.punctuation]
  text_cleaned = "\n".join(sentences_cleaned_no_titles)
  return text_cleaned

def preprocess_data(examples):
  texts_cleaned = [clean_text(text) for text in examples["text"]]
  inputs = [prefix + text for text in texts_cleaned]
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

  # Setup the tokenizer for targets
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples["title"], max_length=max_target_length, 
                       truncation=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

The preprocess_data function can be applied to all the datasets with the map method.

In [58]:
tokenized_datasets = medium_datasets_cleaned.map(preprocess_data,
                                                 batched=True)

Map:   0%|          | 0/85497 [00:00<?, ? examples/s]



Map:   0%|          | 0/849 [00:00<?, ? examples/s]

Map:   0%|          | 0/856 [00:00<?, ? examples/s]

We can now fine-tune T5 with our preprocessed data! Let’s import some necessary classes to train text2text models.

In [59]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer #, Trainer, DataCollator

Next, we need to create a Seq2SeqTrainingArguments object containing, as the name implies, several training parameters that will define how the model is trained. Refer to the Trainer documentation to know about the meaning of each one of these parameters.

In [60]:
batch_size = 8
model_name = "t5-base-medium-title-generation"
model_dir = f"drive/MyDrive/Models/{model_name}"

args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps",
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=False,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    report_to="tensorboard"
)



Here is the explanation of some unusual parameters passed to the Seq2SeqTrainingArguments object:

predict_with_generate: Must be set to True to calculate generative metrics such as ROUGE and BLEU.

fp16: Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training. Makes training faster.

report_to: List of integrations to write logs to.


Next, we instantiate a DataCollatorForSeq2Seq object using the tokenizer. Data collators are objects that form a batch by using a list of dataset elements as input and, in some cases, applying some processing. In this case, all the inputs and labels in the same batch will be padded to their respective maximum length in the batch. Padding of the inputs is done with the usual [PAD] token, whereas the padding of the labels is done with a token with id -100, which is a special token automatically ignored by PyTorch loss functions.

In [61]:
data_collator = DataCollatorForSeq2Seq(tokenizer)
# data_collator = DataCollator(tokenizer)

Next, we download the ROUGE code using the load_metric function from the datasets library, thus instantiating a metric object. This object can then be used to compute its metrics using predictions and reference labels.

In [62]:
!pip install rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [63]:
metric = load_metric("rouge")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


The metric object must then be called inside a compute_metrics function which takes a tuple of predictions and reference labels as input, and outputs a dictionary of metrics computed over the inputs. Specifically, the compute_metrics function does the following:

Decode the predictions (i.e. from token ids to words).

Decode the labels after substituting the -100 token id with the [PAD] token id.

Compute ROUGE scores using the decoded predictions and labels, and select only a subset of these metrics.

Compute a new metric, which is the average length of the predictions.

Return a dictionary whose keys are the names of the metrics and the values are the metric values.


In [64]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip()))
                      for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) 
                      for label in decoded_labels]
    
    # Compute ROUGE scores
    result = metric.compute(predictions=decoded_preds, references=decoded_labels,
                            use_stemmer=True)

    # Extract ROUGE f1 scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length to metrics
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
                      for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

We must now create a Seq2SeqTrainer passing all the objects that we have just defined: the training arguments, the training and evaluation data, the data collator, the tokenizer, the compute_metrics function, and a model_init function. The model_init function must return a fresh new instance of the pre-trained model to fine-tune, making sure that training always starts from the same model and not from a partially fine-tuned model from your notebook.

In [65]:
# Function that returns an untrained model to be trained
def model_init():
    return AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# trainer = Trainer(
#     model_init=model_init,
#     args=args,
#     train_dataset=tokenized_datasets["train"],
#     eval_dataset=tokenized_datasets["validation"],
#     data_collator=data_collator,
#     tokenizer=tokenizer,
#     compute_metrics=compute_metrics
# )

ValueError: fp16 mixed precision requires a GPU (not 'mps').

In [None]:
# Start TensorBoard before training to monitor it in progress
%load_ext tensorboard
%tensorboard --logdir '{model_dir}'/runs

Train T5.

In [None]:
trainer.train()

Trying new model

In [None]:
model_name = "t5-base-medium-title-generation/checkpoint-2000"
model_dir = f"drive/MyDrive/Models/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

max_input_length = 512

In [None]:
text = """
We define access to a Streamlit app in a browser tab as a session.
For each browser tab that connects to the Streamlit server, a new session is created.
Streamlit reruns your script from top to bottom every time you interact with your app.
Each reruns takes place in a blank slate: no variables are shared between runs.
Session State is a way to share variables between reruns, for each user session.
In addition to the ability to store and persist state, Streamlit also exposes the
ability to manipulate state using Callbacks. In this guide, we will illustrate the
usage of Session State and Callbacks as we build a stateful Counter app.
For details on the Session State and Callbacks API, please refer to our Session
State API Reference Guide. Also, check out this Session State basics tutorial
video by Streamlit Developer Advocate Dr. Marisa Smith to get started:
"""

inputs = ["summarize: " + text]

inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=64)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
predicted_title = nltk.sent_tokenize(decoded_output.strip())[0]

print(predicted_title)

In [None]:
text = """
Many financial institutions started building conversational AI, prior to the Covid19
pandemic, as part of a digital transformation initiative. These initial solutions
were high profile, highly personalized virtual assistants — like the Erica chatbot
from Bank of America. As the pandemic hit, the need changed as contact centers were
under increased pressures. As Cathal McGloin of ServisBOT explains in “how it started,
and how it is going,” financial institutions were looking for ways to automate
solutions to help get back to “normal” levels of customer service. This resulted
in a change from the “future of conversational AI” to a real tactical assistant
that can help in customer service. Haritha Dev of Wells Fargo, saw a similar trend.
Banks were originally looking to conversational AI as part of digital transformation
to keep up with the times. However, with the pandemic, it has been more about
customer retention and customer satisfaction. In addition, new use cases came about
as a result of Covid-19 that accelerated adoption of conversational AI. As Vinita
Kumar of Deloitte points out, banks were dealing with an influx of calls about new
concerns, like questions around the Paycheck Protection Program (PPP) loans. This
resulted in an increase in volume, without enough agents to assist customers, and
tipped the scale to incorporate conversational AI. When choosing initial use cases
to support, financial institutions often start with high volume, low complexity
tasks. For example, password resets, checking account balances, or checking the
status of a transaction, as Vinita points out. From there, the use cases can evolve
as the banks get more mature in developing conversational AI, and as the customers
become more engaged with the solutions. Cathal indicates another good way for banks
to start is looking at use cases that are a pain point, and also do not require a
lot of IT support. Some financial institutions may have a multi-year technology
roadmap, which can make it harder to get a new service started. A simple chatbot
for document collection in an onboarding process can result in high engagement,
and a high return on investment. For example, Cathal has a banking customer that
implemented a chatbot to capture a driver’s license to be used in the verification
process of adding an additional user to an account — it has over 85% engagement
with high satisfaction. An interesting use case Haritha discovered involved
educating customers on financial matters. People feel more comfortable asking a
chatbot what might be considered a “dumb” question, as the chatbot is less judgmental.
Users can be more ambiguous with their questions as well, not knowing the right
words to use, as chatbot can help narrow things down.
"""

inputs = ["summarize: " + text]

inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=64)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
predicted_title = nltk.sent_tokenize(decoded_output.strip())[0]

print(predicted_title)
