<a href="https://colab.research.google.com/github/Abonia1/LLM-finetuning/blob/main/Fine_tuning_flan_t5_summarize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face - Summarization 

This source code builds the fine-tuned model of [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) for summarization.


Install packages depending on T5 tokenizer.

In [1]:
!pip install protobuf==3.20.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Install packages depending on rouge evaluation.

In [2]:
!pip install absl-py rouge_score nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Install other dependent packages.

In [3]:
!pip install numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
!pip install transformers==4.28.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
pip install datasets


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Check device

Check whether GPU is available.

In [6]:
import torch

if torch.cuda.is_available():
    print("GPU is enabled.")
    print("device count: {}, current device: {}".format(torch.cuda.device_count(), torch.cuda.current_device()))
else:
    print("GPU is not enabled.")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

GPU is enabled.
device count: 1, current device: 0


## Prepare data

In this example, we use [XL-Sum english dataset](https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/english) in Hugging Face, which is the annotated article-summary pairs generated by BBC.<br>
This dataset has around 7000 samples for training.

In [7]:
from datasets import load_dataset

ds = load_dataset("csebuetnlp/xlsum", name="english")
ds



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'summary', 'text'],
        num_rows: 306522
    })
    test: Dataset({
        features: ['id', 'url', 'title', 'summary', 'text'],
        num_rows: 11535
    })
    validation: Dataset({
        features: ['id', 'url', 'title', 'summary', 'text'],
        num_rows: 11535
    })
})

In [8]:
ds["train"][0]

{'id': 'uk-wales-56321577',
 'url': 'https://www.bbc.com/news/uk-wales-56321577',
 'title': 'Weather alert issued for gale force winds in Wales',
 'summary': 'Winds could reach gale force in Wales with stormy weather set to hit the whole of the country this week.',

To generate inputs for fine-tuning, now I tokenize each text and convert into token ids.

First, load tokenizer in pre-trained ```google/flan-t5-small``` model.

In [9]:
from transformers import AutoTokenizer
t5_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
#t5_tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")

For fine-tuning, apply tokenization for dataset.

In [10]:
def tokenize_sample_data(data):
    # Max token size is 14536 and 215 for inputs and labels, respectively.
    # Here I restrict these token size.
    input_feature = t5_tokenizer(data["text"], truncation=True, max_length=1024)
    label = t5_tokenizer(data["summary"], truncation=True, max_length=128)
    return {
        "input_ids": input_feature["input_ids"],
        "attention_mask": input_feature["attention_mask"],
        "labels": label["input_ids"],
    }

In [11]:
tokenized_ds = ds.map(
    tokenize_sample_data,
    remove_columns=["id", "url", "title", "summary", "text"],
    batched=True,
    batch_size=128)
tokenized_ds



Map:   0%|          | 0/11535 [00:00<?, ? examples/s]



DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 306522
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11535
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11535
    })
})

## Fine-tune

In this example, we use flan-t5 model.<br>
There exist several sizes of flan-t5 and I'll use small one (```google/flan-t5-small```) to fit to memory in my machine. The name is "small", but it's still so large.

In [12]:
from transformers import AutoConfig, AutoModelForSeq2SeqLM

# see https://huggingface.co/docs/transformers/main_classes/configuration
flant5_config = AutoConfig.from_pretrained(
    "google/flan-t5-small",
    max_length=128,
    length_penalty=0.6,
    no_repeat_ngram_size=2,
    num_beams=15,
)
model = (AutoModelForSeq2SeqLM
         .from_pretrained("google/flan-t5-small", config=flant5_config)
         .to(device))

We prepare data collator, which works for preprocessing data.

For the sequence-to-sequence (seq2seq) task, we need to not only stack the inputs for encoder, but also prepare for the decoder side. In seq2seq setup, a common technique called "teach forcing" will then be applied in decoder.<br>
These tasks are not needed to manually setup in Hugging Face, and ```DataCollatorForSeq2Seq``` will take care of all steps.

In this collator, the padded token will also be filled with label id -100.<br>
This token will then be ignored in the sebsequent loss computation and evaluation.

In [13]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    t5_tokenizer,
    model=model,
    return_tensors="pt")

We also prepare metrics function for evaluation in the training.<br>
Measuring the quality of generated text is very difficult, and BLEU and ROUGE are often used.

Briefly speaking, BLEU measures how many of n-grams in the generated (predicted) text are overlaped in the reference text. This score is used for evaluation, especially in the machine translation task.
However, in summarization, we need all important words (which appears on the reference text) in the generated text. This is because we often use ROUGE in summarization tasks.
The idea of ROUGE is similar to BLEU, but it also measures how many of n-grams in the reference text appears in the generated (predicted) text. (This is why the name of ROUGE includes "RO", which means "Recall-Oriented".)<br>
There also exist variations, ROUGE-L and ROUGE-Lsum, which also measures the longest common substrings (LCS).

In Hugging Face, you don't need to manually implement these logics and can use built-in objects for scoring these matrics.<br>
In this example, I have configured flan-t5 tokenization as custom tokenization in computation (which is based on SentencePiece Unigram segmentation), because the white space tokenization is used as default in ROUGE evaluation.

> Note : You can also specify multilingual stemmer.

> Note : As I have mentioned above, the padded token id becomes -100 by data collator and I then also convert it into padded token id before processing.

In [14]:
pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [15]:
pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
import evaluate
import numpy as np
from nltk.tokenize import RegexpTokenizer

rouge_metric = evaluate.load("rouge")

def tokenize_sentence(arg):
    encoded_arg = t5_tokenizer(arg)
    return t5_tokenizer.convert_ids_to_tokens(encoded_arg.input_ids)

def metrics_func(eval_arg):
    preds, labels = eval_arg
    # Replace -100
    labels = np.where(labels != -100, labels, t5_tokenizer.pad_token_id)
    # Convert id tokens to text
    text_preds = t5_tokenizer.batch_decode(preds, skip_special_tokens=True)
    text_labels = t5_tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Insert a line break (\n) in each sentence for ROUGE scoring
    # (Note : Please change this code, when you perform on other languages except for Japanese)
    text_preds = [(p if p.endswith(("!", "！", "?", "？", "。")) else p + "。") for p in text_preds]
    text_labels = [(l if l.endswith(("!", "！", "?", "？", "。")) else l + "。") for l in text_labels]
    sent_tokenizer_jp = RegexpTokenizer(u'[^!！?？。]*[!！?？。]')
    text_preds = ["\n".join(np.char.strip(sent_tokenizer_jp.tokenize(p))) for p in text_preds]
    text_labels = ["\n".join(np.char.strip(sent_tokenizer_jp.tokenize(l))) for l in text_labels]
    # compute ROUGE score with custom tokenization
    return rouge_metric.compute(
        predictions=text_preds,
        references=text_labels,
        tokenizer=tokenize_sentence
    )

Before fine-tuning, now I check ROUGE score with plain flan-t5 model. Here I check scores for top 5 rows in test dataset.

The score is very low, because this model is not trained for any downstream tasks. (It's just trained by unsupervised approach.)

> Note : In order to avoid suboptimal text generation, here I have applied beam search for the text generation algorithm.

In [17]:
from torch.utils.data import DataLoader

sample_dataloader = DataLoader(
    tokenized_ds["test"].with_format("torch"),
    collate_fn=data_collator,
    batch_size=5)
for batch in sample_dataloader:
    with torch.no_grad():
        preds = model.generate(
            batch["input_ids"].to(device),
            num_beams=15,
            num_return_sequences=1,
            no_repeat_ngram_size=1,
            remove_invalid_values=True,
            max_length=128,
        )
    labels = batch["labels"]
    break

metrics_func([preds, labels])

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'rouge1': 0.3624369577411001,
 'rouge2': 0.12602818905339913,
 'rougeL': 0.24435002241919287,
 'rougeLsum': 0.24435002241919287}

We prepare training arguments for fine-tuning.<br>
In this example, we use HuggingFace transformer trainer class, with which you can run training without manually writing training loop.

In usual training evaluation, training loss and accuracy will be computed and evaluated, by comparing the generated logits with labels. However, as we saw above, we want to evaluate ROUGE score using the predicted tokens.<br>
To simplify these sequence-to-sequence specific steps, here I use built-in ```Seq2SeqTrainingArguments``` and ```Seq2SeqTrainer``` classes in HuggingFace, instead of usual ```TrainingArguments``` and ```Trainer```.<br>
By setting ```predict_with_generate=True``` in this class, the predicted tokens generated by  ```model.generate()``` will be used in each evaluation.

The checkpoint files (in each 500 steps) are saved in the folder named ```flan-t5-summarize-en```.

> Note : Do not use FP16 precision in flan-t5 fine-tuning.

> Note : In general, the saved checkpoints in the training will become so large.<br>
> Set ```save_total_limit``` property (which limits the total amount of checkpoints by deleting the older ones) to save disk spaces, or expand disks in Azure VM. (See [here](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/expand-disks) to expand disks in Azure.)

In [18]:
!pip install --upgrade accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [19]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir = "flant5-summarize-en",
    log_level = "error",
    num_train_epochs = 10,
    learning_rate = 5e-4,
    lr_scheduler_type = "linear",
    warmup_steps = 90,
    optim = "adafactor",
    weight_decay = 0.01,
    per_device_train_batch_size = 2,
    per_device_eval_batch_size = 1,
    gradient_accumulation_steps = 16,
    evaluation_strategy = "steps",
    eval_steps = 100,
    predict_with_generate=True,
    generation_max_length = 128,
    save_steps = 500,
    logging_steps = 10,
    push_to_hub = False
)

In [22]:
"""training_args = Seq2SeqTrainingArguments(
    output_dir = "flant5-summarize-en",
    log_level = "error",
    num_train_epochs = 5,   # Decreased number of epochs
    learning_rate = 1e-4,   # Decreased learning rate
    lr_scheduler_type = "linear",
    warmup_steps = 90,
    optim = "adafactor",
    weight_decay = 0.01,
    per_device_train_batch_size = 8,   # Increased batch size
    per_device_eval_batch_size = 4,   
    gradient_accumulation_steps = 4,   # Decreased gradient accumulation steps
    evaluation_strategy = "steps",
    eval_steps = 200,   # Decreased evaluation frequency
    predict_with_generate=True,
    generation_max_length = 256,   # Increased summary length
    save_steps = 500,
    logging_steps = 10,
    push_to_hub = False
)"""


Build trainer. (Put it all together.)

Because the cost of evaluation computation (ROUGE scoring) is so high, I have then decreased the number of rows in validation set.

In [25]:
training_args = Seq2SeqTrainingArguments(
    output_dir="flant5-xlsum",
    learning_rate=3e-4,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    logging_dir="./logs",
    logging_first_step=True,
    overwrite_output_dir=True,
    max_steps=500,  # Terminate training after 500 steps
)


In [26]:
from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    compute_metrics = metrics_func,
    train_dataset = tokenized_ds["train"],
    eval_dataset = tokenized_ds["validation"].select(range(20)),
    tokenizer = t5_tokenizer,
)

Now let's run training.<br>
As you will find, ROUGE scores are growing during training.

> Note : As I have mentioned above, make sure that you have enough disk space.

In [27]:
trainer.train()



{'loss': 0.0, 'learning_rate': 0.0003, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 9.96e-05, 'epoch': 0.03}
{'train_runtime': 452.2943, 'train_samples_per_second': 17.688, 'train_steps_per_second': 1.105, 'train_loss': 0.0, 'epoch': 0.03}


TrainOutput(global_step=500, training_loss=0.0, metrics={'train_runtime': 452.2943, 'train_samples_per_second': 17.688, 'train_steps_per_second': 1.105, 'train_loss': 0.0, 'epoch': 0.03})

In order to use it later, you can save the trained model.

In [28]:
import os

os.makedirs("./trained_for_summarization_en", exist_ok=True)
if hasattr(trainer.model, "module"):
    trainer.model.module.save_pretrained("./trained_for_summarization_en")
else:
    trainer.model.save_pretrained("./trained_for_summarization_en")

Load pre-trained model from local.

In [29]:
from transformers import AutoModelForSeq2SeqLM

model = (AutoModelForSeq2SeqLM
         .from_pretrained("./trained_for_summarization_en")
         .to(device))

## Generate Text (Summarize) with Fine-Tuned Model

Now let's see how it generates text for summarization with fine-tuned model.<br>
Here I generate the summarized text of test data, which has not seen in the training set.

> Note : The article in XL-Sum dataset is created by removing the first sentence (headline sentence) of BBC news source, and the first sentence is then used for summary.<br>
>  For this reason, there might exist several mismatch between article and summary in test data. (Choose appropriate samples for checking.)

In [30]:
from torch.utils.data import DataLoader

# Predict with test data (first 5 rows)
sample_dataloader = DataLoader(
    tokenized_ds["test"].with_format("torch"),
    collate_fn=data_collator,
    batch_size=5)
for batch in sample_dataloader:
    with torch.no_grad():
        preds = model.generate(
            batch["input_ids"].to(device),
            num_beams=15,
            num_return_sequences=1,
            no_repeat_ngram_size=1,
            remove_invalid_values=True,
            max_length=128,
        )
    labels = batch["labels"]
    break

# Replace -100 (see above)
labels = np.where(labels != -100, labels, t5_tokenizer.pad_token_id)

# Convert id tokens to text
text_preds = t5_tokenizer.batch_decode(preds, skip_special_tokens=True)
text_labels = t5_tokenizer.batch_decode(labels, skip_special_tokens=True)

# Show result
print("***** Input's Text *****")
print(ds["test"]["text"][0])
print("***** Summary Text (True Value) *****")
print(text_labels[0])
print("***** Summary Text (Generated Text) *****")
print(text_preds[0])

***** Input's Text *****
By Kate DaileyBBC News Earlier this week, Trump posted a photo of himself sitting at a desk at Mar-a-Largo, a permanent marker hovering over a notepad. "Writing my inaugural address at the Winter White House, Mar-a-Lago, three weeks ago. Looking forward to Friday," he tweeted. Trump vows to end 'American carnage' Trump's angry call to arms Full text of Trump's inauguration speech It's unclear whether the president-elect actually wrote the speech himself, but the content was pure Trump: the same populist message that resonated throughout the primaries and the campaign. "Today, we are not merely transferring power from one administration to another, or from one party to another, but we are transferring power from Washington, DC, and giving it back to you, the people," he said at the beginning of his remarks. For some on Twitter, it bore an eerie similarity to the Batman villain Bane's speech in The Dark Night Rises, so much so that someone posted a 10-second mash

In [31]:
print("***** Input's Text *****")
print(ds["test"]["text"][2])
print("***** Summary Text (True Value) *****")
print(text_labels[2])
print("***** Summary Text (Generated Text) *****")
print(text_preds[2])

***** Input's Text *****
By Jon Welch and Paul MoseleyBBC News Details of health problems, family bereavements and personal issues were sent by the University of East Anglia (UEA) in Norwich to 298 students. Megan Baynes, 23, said she felt "sick and horrified" when she realised her details had been shared. The UEA apologised "unreservedly" and said an inquiry had begun. The email contained a spreadsheet listing 172 names and details extenuating circumstances in which extensions and other academic concessions were granted to 42 students. 'Felt sick' It was sent to nearly 300 undergraduates, including Ms Baynes, a former editor of student newspaper Concrete. She is currently awaiting the results of her American Literature and Creative Writing degree, and had been granted extensions for coursework because of an illness suffered by a family member. "I felt sick at seeing my personal situation written in a spreadsheet, and then seemingly sent to everyone on my course," she said. "My situati