# Tutorial: Working with Hugging Face Models and Datasets
## Chapter 3: Text Summarization Using Models in Hugging Face
### Lesson 3.2: Fine-tuning the pre-trained T5-small model in Hugging Face for text summarization

In this lesson, we will fine-tune the [T5-small](https://huggingface.co/t5-small) model on the California state bill subset of the [BillSum](https://huggingface.co/datasets/billsum) dataset. We can also fine-tune other models including Google's PEGASUS model. However, for illustration, we only demonstrate the fine-tuning steps using the smaller model, t5-small, in this tutorial.

# Install Transformers and Datasets from Hugging Face

In [None]:
# Transformers installation
! pip install -q transformers[torch] datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

## Load BillSum dataset

Let us load the [BillSum](https://huggingface.co/datasets/billsum) dataset from the Huggingface Datasets library.

In [None]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

The loaded billsum dataset only has one Dataset object:

In [None]:
billsum

For fine-tuning and late evaluation, we should split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [None]:
billsum = billsum.train_test_split(test_size=0.2)

Check that we have a train and test Dataset:

In [None]:
billsum

Take a look at an example:

In [None]:
example = billsum["train"][0]
for key in example:
    print("A key of the example: \"{}\"".format(key))
    print("The value corresponding to the key-\"{}\"\n \"{}\"".format(key, example[key]))

There are three fields:

- `text`: the text of the bill.
- `summary`: a given summary of the text.
- `title`: the title of the text

## Preprocess

We will fine-tune the T5-small model. At the Overview page of the [Hugging Face T5 model](https://huggingface.co/docs/transformers/model_doc/t5#overview), it provides the following tips:
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.
- T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: ….

We will load a T5 tokenizer to process `text` and `summary` and prepend a prefix "summarize: " for our text summarization task.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

Test the tokenizer on an example:

In [None]:
tokenized_text = tokenizer(example['text'])
for key in tokenized_text:
    print(key)
    print(tokenized_text[key])

We will create a function to preprocess the training and test data in batch. The preprocessing function will perform the following actions:
- Prepend the prefix "summarize: " to each text document to indicate to the T5 model that the task at hand is summarization.
- Convert the input texts and summary labels into a tokenized format that can be processed by the T5 model.
- Set the max_length parameter to ensure that the tokenized inputs and labels do not exceed a certain length, truncating any text that is too long.
- Assign the tokenized labels to the labels field of model_inputs, which will be used during training to calculate the loss and optimize the model's parameters.

In [None]:
def preprocess_function(examples):
    # Prepends the string "summarize: " to each document in the 'text' field of the input examples.
    # This is done to instruct the T5 model on the task it needs to perform, which in this case is summarization.
    inputs = ["summarize: " + doc for doc in examples["text"]]

    # Tokenizes the prepended input texts to convert them into a format that can be fed into the T5 model.
    # Sets a maximum token length of 1024, and truncates any text longer than this limit.
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Tokenizes the 'summary' field of the input examples to prepare the target labels for the summarization task.
    # Sets a maximum token length of 128, and truncates any text longer than this limit.
    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    # Assigns the tokenized labels to the 'labels' field of model_inputs.
    # The 'labels' field is used during training to calculate the loss and guide model learning.
    model_inputs["labels"] = labels["input_ids"]

    # Returns the prepared inputs and labels as a single dictionary, ready for training.
    return model_inputs

Let us apply the preprocessing function over the entire dataset, use Huggingface Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. We can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Let us take a look at a test example:

In [None]:
tokenized_billsum['test'][0]['text']

In [None]:
tokenized_billsum['test'][0]['summary']

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="t5-small")

## Evaluation Metrics for Training

We will use the [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) metric for training. We will load the evaluation method from the Huggingface [Evaluate](https://huggingface.co/docs/evaluate/index) library.

In [None]:
! pip install -q evaluate rouge_score

In [None]:
import evaluate

rouge = evaluate.load("rouge")

Create a function that passes the predictions and labels to calculate the ROUGE metric as follows:
- The eval_pred tuple is unpacked into predictions and labels.
- The tokenizer.batch_decode method is used to decode the tokenized predictions and labels back to text, skipping any special tokens like padding tokens.
- The np.where function is used to replace any instances of -100 in the labels array with the tokenizer's pad_token_id, as -100 is often used to signify tokens that should be ignored during loss calculation.
- The rouge.compute method is called to calculate the ROUGE metric between the predictions and labels, which is a common metric for evaluating text summarization performance.
- The length of each prediction is calculated by counting the number of non-padding tokens, and the mean prediction length is added to the result dictionary under the key "gen_len".
- Finally, the values in the result dictionary are rounded to 4 decimal places for cleaner output, and the result is returned.

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    # Unpacks the evaluation predictions tuple into predictions and labels.
    predictions, labels = eval_pred

    # Decodes the tokenized predictions back to text, skipping any special tokens (e.g., padding tokens).
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replaces any -100 values in labels with the tokenizer's pad_token_id.
    # This is done because -100 is often used to ignore certain tokens when calculating the loss during training.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decodes the tokenized labels back to text, skipping any special tokens (e.g., padding tokens).
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Computes the ROUGE metric between the decoded predictions and decoded labels.
    # The use_stemmer parameter enables stemming, which reduces words to their root form before comparison.
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Calculates the length of each prediction by counting the non-padding tokens.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Computes the mean length of the predictions and adds it to the result dictionary under the key "gen_len".
    result["gen_len"] = np.mean(prediction_lens)

    # Rounds each value in the result dictionary to 4 decimal places for cleaner output, and returns the result.
    return {k: round(v, 4) for k, v in result.items()}


## Train

Load AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer classes from the Hugging Face transformers library:

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

Load the t5-small model:

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Define training hyperparameters in Seq2SeqTrainingArguments. Assign a value to the parameter `output_dir` to specify the location to save the model. It is a required parameter.

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_fine_tuned_t5_small_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
)

Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, data collator, and the `compute_metrics` function.

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Call train() to fine tune the model:

In [None]:
trainer.train()

Observations: The function `compute_metrics` worked during the training. At the last epoch, we have rouge1 value 0.1397, rouge2 value 0.1168, and rougelsum 0.1168.

Save the model:

In [None]:
trainer.save_model("my_fine_tuned_t5_small_model")

## Use the Fine-Tuned Model to Summarize Text

We have fine-tuned the t5-small model on the billsum dataset. We can use it for inference.

We will use an example from the test dataset.

In [None]:
text = billsum['test'][100]['text']
text = "summarize: " + text
text

The simplest way to try out your fine-tuned model for inference is to use it in a pipeline(). Create a `pipeline` object for summarization with the fine-tuned model, and pass the text to it:

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="my_fine_tuned_t5_small_model")
pred = summarizer(text)
pred

We can also manually replicate the results of the `pipeline`.


Tokenize the text and return the `input_ids` as PyTorch tensors:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_fine_tuned_t5_small_model")
inputs = tokenizer(text, return_tensors="pt").input_ids
inputs

Use the generate() method to create the summarization.

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("my_fine_tuned_t5_small_model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Decode the generated token ids back into text:

In [None]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

# Evaluate the result
We can compute the rouge values for the predicted summary comparing to the given summary.

In [None]:
pred[0]['summary_text']

In [None]:
preds = [pred[0]['summary_text']]

In [None]:
labels = [billsum['test'][100]['summary']]

In [None]:
rouge.compute(predictions=preds, references=labels, use_stemmer=True)

Great!! We have fine-tuned a pre-trained model in Hugging Face for text summarization.