## Fine Tuning the T5 Base Large Language Model with Low Rank Adaptation Parameter Efficient Fine Tuning for Summarization of BBC News Articles

In [2]:
#Install relevant packages
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Collecting pip
  Obtaining dependency information for pip from https://files.pythonhosted.org/packages/e0/63/b428aaca15fcd98c39b07ca7149e24bc14205ad0f1c80ba2b01835aedde1/pip-23.3-py3-none-any.whl.metadata
  Using cached pip-23.3-py3-none-any.whl.metadata (3.5 kB)
Using cached pip-23.3-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.2.1
    Uninstalling pip-23.2.1:
      Successfully uninstalled pip-23.2.1
Successfully installed pip-23.3
[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
pathos 0.3.1 requires dill>=0.3.7,

In [3]:
#Import Modules
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

### Load dataset from HuggingFace Hub (https://huggingface.co/datasets/gopalkalpande/bbc-news-summary) 
Dataset contains 2224 articles and their corresponding summaries. We will use these to FineTune the model. 

In [4]:
#Import dataset from HuggingFace Hub - BBC News article summarization
huggingface_dataset_name = "gopalkalpande/bbc-news-summary"

dataset = load_dataset(huggingface_dataset_name)

dataset

Found cached dataset csv (/root/.cache/huggingface/datasets/gopalkalpande___csv/gopalkalpande--bbc-news-summary-f610c9f6377bc0fc/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['File_path', 'Articles', 'Summaries'],
        num_rows: 2224
    })
})

### Load T5-Base model from HuggingFaceHub (https://huggingface.co/t5-base)
This LLM was chosen according to my limited compute budget. The T5-Base model is a relatively small model, with 223 million parameters. When compared to cutting edge LLMs such as the GPT4 model, which has 1.76 trillion parameters, we can expect significantly reduced baseline performance vs what would be achievable with a less restrictive compute budget.

We load the model using the AutoModelForSeq2SeqLM class from Transformers. The AutoModelForSeq2SeqLM class is suitable for sequence to sequence tasks such as text summarization. We load the pretrained model weights using the .from_pretrained method. 
By default, the T5-Base model weights are stored as floating point 32 values. However, we sacrifice some precision in the interest of reducing the computational cost by loading the model weights as "bfloat16". A bfloat16 number uses 16 bits, which is equivalent to 2 bytes, whereas a float32 number uses 32 bits, which is equivalent to 4 bytes.
Therefore, bfloat16 uses half the memory of float32. This process of reducing the precision of model weights is called Quantization.

We also load the associated tokenizer using the Transformer AutoTokenizer class.

In [5]:
#Load model as "original_model" and instantiate tokenizer
model_name='t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [6]:
#Print the number of parameters trainable in the original T5-Base model (just a sanity check, should be = 100%)
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 222903552
all model parameters: 222903552
percentage of trainable model parameters: 100.00%


### Generate an example summary using the T5-Base model prior to FineTuning.

It does a reasonable job of generating a summary, however we hope to improve this via PEFT. 

In [7]:
index = 42

dialogue = dataset['train'][index]['Articles']
summary = dataset['train'][index]['Summaries']

prompt = f"""
Summarize the following article.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE ARTICLE ABSTRACT:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following article.

Labour MP praises Tory campaign..The Conservatives have been "a lot smarter" in the way they have conducted the general election campaign, a Labour backbencher has said...Derek Wyatt said having a five month campaign "turned off voters" and suggested people were already "rather bored of the thing". He wants a greater campaigning role for Chancellor Gordon Brown. Labour said the economy was at the heart of the campaign and Mr Brown therefore had a prominent role. But Mr Wyatt argued: "By some way, he is currently the figure in all of the polls that people trust and see that has delivered over eight years an economy unmatched anywhere in the world. "So, it would be a tad foolish of the Labour Party if we did not use him as we have done over the past three elections."..Labour's election chief Alan Milburn denied there was an attempt to sideli

### Reduce dataset size for PEFT
We only need around 500-1000 training examples for PEFT. Therefore I take every other example and assign the others to a test dataset for model evaluation.

In [8]:
#Sample dataset, to reduce size of training set to reduce computational expense.
#PEFT training requires 500-1000 training examples.
test_dataset = dataset.filter(lambda example, index: index % 2 != 0, with_indices=True)
dataset = dataset.filter(lambda example, index: index % 2 == 0, with_indices=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/gopalkalpande___csv/gopalkalpande--bbc-news-summary-f610c9f6377bc0fc/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-ed3d58fccbba7cea.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/gopalkalpande___csv/gopalkalpande--bbc-news-summary-f610c9f6377bc0fc/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-9b6d52941bb1ee21.arrow


### Function to assemble the prompts and completions and to tokenize using the tokenizer
We want our training dataset to be in the prompt/completion format. 

In [9]:
def tokenize_function(example):
    start_prompt = 'Summarize the following article.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["Articles"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["Summaries"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['File_path', 'Articles', 'Summaries',])

Map:   0%|          | 0/1112 [00:00<?, ? examples/s]

### We will perform PEFT FineTuning using Low Rank Adaptation (LoRA)
Parameter Efficient Fine Tuning refers to the less computationally expensive optimization of an existing LLM by only adjusting a small portion of the model weights and freezing the rest. This is in contrast to full fine tuning, which involves the adjustment of all model weights. Not only is full fine tuning more computationally expensive, it also comes with a significant risk of catastrophic forgetting. By virtue of it only requiring the adjustment of a small portion of the model weights, PEFT largely avoids this problem and achieves almost comparable performance improvements.

Low Rank Adaptation is a method of PEFT, which involves creating 2 separate lower-rank matrices. The product of these two matrices is of the same shape as the LLM weight matrix. These two lower-rank matrices are adjusted via the FineTuning process to optimize performance on the specific task. After training, the 2 smaller lower-rank matrices are multiplied to recreate the original LLM matrix, thus allowing the model to be FineTuned whilst only adjusting a much smaller number of model weights.

Here we set the parameters of the instance of the LoraConfig class. The target modules are selected via inspection of the model structure, in accordance with the LoRA paper.

In [16]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM 
)

Sanity check the number of trainable weights in the peft model

In [17]:
peft_model = get_peft_model(original_model, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 226442496
percentage of trainable model parameters: 1.56%


### Define training parameters using the Transformers TrainingArguments and Trainer classes


In [22]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=1,
    logging_steps=1,
    max_steps=10
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

### Train LORA matrices and therefore FineTune model
We can see that training loss decreased across the 10 training steps. The training time was kept short to reduce computational expense, however in a production environment it is likely that improved performance could be achieved by extending the training times.

In [23]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)


Step,Training Loss
1,3.0938
2,2.1406
3,2.0625
4,0.8945
5,2.125
6,1.0156
7,1.2891
8,1.0859
9,1.6719
10,1.3984


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

### Instantiate new version of T5-Base model for comparison with PEFT model

In [24]:
model_name='t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(print_number_of_trainable_model_parameters(original_model))
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 222903552
all model parameters: 222903552
percentage of trainable model parameters: 100.00%
trainable model parameters: 3538944
all model parameters: 226442496
percentage of trainable model parameters: 1.56%


### Qualitative analysis of PEFT output vs T5-Base model
The summarization example from the PEFT model shows definite improvement over the T5-Base model. It draws more specific information from the article and is more similar to the human baseline.

Again, the limited size of this LLM means that the performance is still inferior to larger LLMs such as GPT4.

In [25]:
index = 8

dialogue = dataset['train'][index]['Articles']
human_baseline_summary = dataset['train'][index]['Summaries']

prompt = f"""
Summarize the following article.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'PEFT MODEL:\n{peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Ministers have insisted they are committed to free personal care for the elderly despite research suggesting the cost of the policy was under-estimated.But the Scottish National Party called on ministers to reassure people that enough funding is in place to support the free personal care policy."We will look in great detail at any contribution to this, because we need to be sure we can provide free personal care and nursing care for our older people into the future.A report by the Fraser of Allander Institute says the decision to push ahead with the flagship policy was based on flawed research.Ms Sturgeon said that while she had no reason to doubt the executive's support for the policy, there were questions which needed to be answered and, if necessary, sums redone.The rise in costs stems from a series of mistakes in the research used by the "care development grou

### Quantitative analysis of PEFT output vs T5-Base model
To quantitatively analyse the performance of the PEFT FineTuned model vs the T5-Base model, we will use Rouge metrics. 

ROUGE metrics (Recall-Oriented Understudy for Gisting Evaluation) are a set of metrics used to evaluate the quality of summaries by comparing them to reference summaries. 
Rouge1: measures the overlap of unigrams (single words) between the generated summary and the reference summary.
Rouge2: measures the overlap of bigrams (double words) between the generated summary and the reference summary.
RougeL: ROUGE Longest Common Subsequence measures the longest common subsequence between the generated summary and the reference summary.
RougeLsum: computes an average value for the RougeL metric acorss multiple summaries.

In [26]:
#Instantiate rouge object
rouge = evaluate.load('rouge')

In [27]:
#Compute rouge metrics for the summaries of 10 articles

dialogues = test_dataset['train'][0:10]['Articles']
human_baseline_summaries = test_dataset['train'][0:10]['Summaries']

original_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following article.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

Token indices sequence length is longer than the specified maximum sequence length for this model (683 > 512). Running this sequence through the model will result in indexing errors


ORIGINAL MODEL:
{'rouge1': 0.23195265368732992, 'rouge2': 0.12905813701163638, 'rougeL': 0.17414767538371534, 'rougeLsum': 0.17212970725660626}
PEFT MODEL:
{'rouge1': 0.2973414104103966, 'rouge2': 0.16645504819406864, 'rougeL': 0.19602264106657702, 'rougeLsum': 0.1956530983732903}


### Rouge Metrics show significant improvements in the performance of the PEFT FineTuned model vs the T5-Base model
Rouge1 increase = 28%

Rouge2 increase = 29%

RougeL increase = 13%

RougeLsum increase = 14%

This indicates the success of the PEFT FineTuning process.