# Tutorial: Working with Hugging Face Models and Datasets
## Chapter 3: Text Summarization Using Models in Hugging Face
### Lesson 3.2: Fine-tuning the pre-trained T5-small model in Hugging Face for text summarization

In this lesson, we will fine-tune the [T5-small](https://huggingface.co/t5-small) model on the California state bill subset of the [BillSum](https://huggingface.co/datasets/billsum) dataset. We can also fine-tune other models including Google's PEGASUS model. However, for illustration, we only demonstrate the fine-tuning steps using the smaller model, t5-small, in this tutorial.

# Install Transformers and Datasets from Hugging Face

In [1]:
# Transformers installation
! pip install -q transformers[torch] datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git


[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Load BillSum dataset

Let us load the BillSum dataset from the Huggingface Datasets library.

In [2]:
from datasets import load_dataset

billsum = load_dataset("Kyudan/GTNT_8.25M")

  from .autonotebook import tqdm as notebook_tqdm
Downloading data: 100%|██████████| 3.22G/3.22G [01:48<00:00, 29.7MB/s]
Generating train split: 8253159 examples [00:04, 1883466.12 examples/s]


The loaded billsum dataset only has one Dataset object:

In [5]:
split_dataset = billsum['train'].train_test_split(test_size=0.2)

# train과 test 데이터셋 확인
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

In [7]:
train_dataset

Dataset({
    features: ['GT', 'NT'],
    num_rows: 6602527
})

Take a look at an example:

## Preprocess

We will fine-tune the T5-small model. At the Overview page of the [Hugging Face T5 model](https://huggingface.co/docs/transformers/model_doc/t5#overview), it provides the following tips:
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.
- T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: ….

We will load a T5 tokenizer to process `text` and `summary` and prepend a prefix "summarize: " for our text summarization task.

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Test the tokenizer on an example:

We will create a function to preprocess the training and test data in batch. The preprocessing function will perform the following actions:
- Prepend the prefix "summarize: " to each text document to indicate to the T5 model that the task at hand is summarization.
- Convert the input texts and summary labels into a tokenized format that can be processed by the T5 model.
- Set the max_length parameter to ensure that the tokenized inputs and labels do not exceed a certain length, truncating any text that is too long.
- Assign the tokenized labels to the labels field of model_inputs, which will be used during training to calculate the loss and optimize the model's parameters.

In [8]:
def preprocess_function(examples):
    # Prepends the string "summarize: " to each document in the 'text' field of the input examples.
    # This is done to instruct the T5 model on the task it needs to perform, which in this case is summarization.
    inputs = ["translate: " + doc for doc in examples["NT"]]

    # Tokenizes the prepended input texts to convert them into a format that can be fed into the T5 model.
    # Sets a maximum token length of 1024, and truncates any text longer than this limit.
    model_inputs = tokenizer(inputs, max_length=256, truncation=True)

    # Tokenizes the 'summary' field of the input examples to prepare the target labels for the summarization task.
    # Sets a maximum token length of 128, and truncates any text longer than this limit.
    labels = tokenizer(text_target=examples["GT"], max_length=256, truncation=True)

    # Assigns the tokenized labels to the 'labels' field of model_inputs.
    # The 'labels' field is used during training to calculate the loss and guide model learning.
    model_inputs["labels"] = labels["input_ids"]

    # Returns the prepared inputs and labels as a single dictionary, ready for training.
    return model_inputs

Let us apply the preprocessing function over the entire dataset, use Huggingface Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. We can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [9]:
tokenized_billsum = split_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 6602527/6602527 [07:26<00:00, 14777.87 examples/s]
Map: 100%|██████████| 1650632/1650632 [02:00<00:00, 13740.83 examples/s]


Let us take a look at a test example:

In [10]:
tokenized_billsum['test'][0]['NT']

'The equation hat C equals negative i sub y C plus open parenthesis d hat z plus i sub y B sub 2 close parenthesis wedge open parenthesis C to the minus i sub y C wedge fraction E to the star to the minus i sub y E to the star close fraction close parenthesiswe can infer that'

In [11]:
tokenized_billsum['test'][0]['GT']

'The equation $\\hat { C } = - i_{ y } C + ( d \\hat { z } + i_{ y } B_{ 2 } ) \\wedge \\left( C^{ ( - ) } - i_{ y } C \\wedge \\, { \\frac { { E }^{ { * } ( - ) } } { i_{ y } { E }^{ { * } } } } \\, \\right) \\,$we can infer that'

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [12]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="t5-small")

## Evaluation Metrics for Training

We will use the [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) metric for training. We will load the evaluation method from the Huggingface [Evaluate](https://huggingface.co/docs/evaluate/index) library.

In [13]:
! pip install -q evaluate rouge_score


[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<?, ?B/s]


Create a function that passes the predictions and labels to calculate the ROUGE metric as follows:
- The eval_pred tuple is unpacked into predictions and labels.
- The tokenizer.batch_decode method is used to decode the tokenized predictions and labels back to text, skipping any special tokens like padding tokens.
- The np.where function is used to replace any instances of -100 in the labels array with the tokenizer's pad_token_id, as -100 is often used to signify tokens that should be ignored during loss calculation.
- The rouge.compute method is called to calculate the ROUGE metric between the predictions and labels, which is a common metric for evaluating text summarization performance.
- The length of each prediction is calculated by counting the number of non-padding tokens, and the mean prediction length is added to the result dictionary under the key "gen_len".
- Finally, the values in the result dictionary are rounded to 4 decimal places for cleaner output, and the result is returned.

In [15]:
import numpy as np

def compute_metrics(eval_pred):
    # Unpacks the evaluation predictions tuple into predictions and labels.
    predictions, labels = eval_pred

    # Decodes the tokenized predictions back to text, skipping any special tokens (e.g., padding tokens).
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replaces any -100 values in labels with the tokenizer's pad_token_id.
    # This is done because -100 is often used to ignore certain tokens when calculating the loss during training.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decodes the tokenized labels back to text, skipping any special tokens (e.g., padding tokens).
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Computes the ROUGE metric between the decoded predictions and decoded labels.
    # The use_stemmer parameter enables stemming, which reduces words to their root form before comparison.
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Calculates the length of each prediction by counting the non-padding tokens.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Computes the mean length of the predictions and adds it to the result dictionary under the key "gen_len".
    result["gen_len"] = np.mean(prediction_lens)

    # Rounds each value in the result dictionary to 4 decimal places for cleaner output, and returns the result.
    return {k: round(v, 4) for k, v in result.items()}


## Train

Load AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer classes from the Hugging Face transformers library:

In [16]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

Load the t5-small model:

In [17]:
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Define training hyperparameters in Seq2SeqTrainingArguments. Assign a value to the parameter `output_dir` to specify the location to save the model. It is a required parameter.

In [18]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_fine_tuned_t5_small_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
)



Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, data collator, and the `compute_metrics` function.

In [19]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Call train() to fine tune the model:

In [20]:
trainer.train()

  0%|          | 312/1650632 [00:34<48:02:08,  9.54it/s]

KeyboardInterrupt: 

Observations: The function `compute_metrics` worked during the training. At the last epoch, we have rouge1 value 0.1397, rouge2 value 0.1168, and rougelsum 0.1168.

Save the model:

In [None]:
trainer.save_model("my_fine_tuned_t5_small_model")

## Use the Fine-Tuned Model to Summarize Text

We have fine-tuned the t5-small model on the billsum dataset. We can use it for inference.

We will use an example from the test dataset.

In [None]:
text = billsum['test'][100]['text']
text = "summarize: " + text
text

'summarize: The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 8610.5 is added to the Government Code, to read:\n8610.5.\n(a) For purposes of this section:\n(1) “Office” means the Office of Emergency Services.\n(2) “Previous fiscal year” means the fiscal year immediately prior to the current fiscal year.\n(3) “Utility” means an “electrical corporation” as defined in Section 218 of the Public Utilities Code.\n(b) (1) State and local costs to carry out activities pursuant to this section and Chapter 4 (commencing with Section 114650) of Part 9 of Division 104 of the Health and Safety Code that are not reimbursed by federal funds shall be borne by a utility operating a nuclear powerplant with a generating capacity of 50 megawatts or more.\n(2) The Public Utilities Commission shall develop and transmit to the office an equitable method of assessing a utility operating a powerplant for its reasonable share of state agency costs specified in paragraph (1).\n(

The simplest way to try out your fine-tuned model for inference is to use it in a pipeline(). Create a `pipeline` object for summarization with the fine-tuned model, and pass the text to it:

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="my_fine_tuned_t5_small_model")
pred = summarizer(text)
pred

Token indices sequence length is longer than the specified maximum sequence length for this model (1645 > 512). Running this sequence through the model will result in indexing errors


[{'summary_text': 'The Public Utilities Commission shall develop and transmit to the office an equitable method of assessing a utility operating a nuclear powerplant for its reasonable share of state agency costs specified in paragraph (1), as required, to carry out activities pursuant to this section and Chapter 4 (commencing with Section 114650) of Part 9 of Division 104 of the Health and Safety Code, upon appropriation by the office, from time to time, for allocation by the Controller for deposit in the Nuclear Planning Assessment Special Account, which is continued in'}]

We can also manually replicate the results of the `pipeline`.


Tokenize the text and return the `input_ids` as PyTorch tensors:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_fine_tuned_t5_small_model")
inputs = tokenizer(text, return_tensors="pt").input_ids
inputs

Token indices sequence length is longer than the specified maximum sequence length for this model (1643 > 512). Running this sequence through the model will result in indexing errors


tensor([[21603,    10,    37,  ...,  2017,     5,     1]])

Use the generate() method to create the summarization.

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("my_fine_tuned_t5_small_model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Decode the generated token ids back into text:

In [None]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'The Public Utilities Commission shall develop and transmit to the office an equitable method of assessing a utility operating a nuclear powerplant for its reasonable share of state agency costs specified in paragraph (1), as required, for allocation by the Controller, upon appropriation by the Legislature, for allocation by the Controller, upon appropriation by the office, from time to time, of the amount of its share of the actual or anticipated state and local agency costs, as specified, for activities'

# Evaluate the result
We can compute the rouge values for the predicted summary comparing to the given summary.

In [None]:
pred[0]['summary_text']

'The Public Utilities Commission shall develop and transmit to the office an equitable method of assessing a utility operating a nuclear powerplant for its reasonable share of state agency costs specified in paragraph (1), as required, to carry out activities pursuant to this section and Chapter 4 (commencing with Section 114650) of Part 9 of Division 104 of the Health and Safety Code, upon appropriation by the office, from time to time, for allocation by the Controller for deposit in the Nuclear Planning Assessment Special Account, which is continued in'

In [None]:
preds = [pred[0]['summary_text']]

In [None]:
labels = [billsum['test'][100]['summary']]

In [None]:
rouge.compute(predictions=preds, references=labels, use_stemmer=True)

{'rouge1': 0.22745098039215686,
 'rouge2': 0.05905511811023622,
 'rougeL': 0.12156862745098039,
 'rougeLsum': 0.1647058823529412}

Great!! We have fine-tuned a pre-trained model in Hugging Face for text summarization.