# Fine tuning a generative AI Model for Dialgue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM


<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

To begin with, check that the kernel is selected correctly.

If you click on that (top right of the screen), you'll be able to see and check the details of the image, kernel, and instance type.

Now install the required packages for the LLM and datasets.


In [None]:
# Remove # to install the required packages for the LLM and datasets.

# pip install --upgrade pip
# pip install --disable-pip-version-check \
#    tokenizers==0.12.1 \
#    torch==1.13.1+cu117 torchvision>=0.13.1+cu117 torchaudio>=0.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 --no-cache-dir \
#    torchdata==0.5.1 --quiet
# 
# pip install \
#     transformers==4.27.2 \
#     datasets==2.11.0 \
#     evaluate==0.4.0 \
#     rouge_score==0.1.2 \
#     loralib==0.1.1 \
#     peft==0.3.0 --quiet

Import the necessary components. Some of them are new for this week, they will be discussed later in the notebook.

In [1]:
from datasets import load_dataset, list_datasets, load_from_disk
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, AutoModel, AutoModelForMaskedLM
import torch
import evaluate
import pandas as pd
import numpy as np

# Added for local install
import os

# Added to remove warnings
import warnings
warnings.filterwarnings('ignore')

<a name='1.2'></a>
### 1.2.1 - Load Dataset and LLM and run

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

If you want to load the dataset from the output directory instead of downloading it from Hugging Face every time, you can modify the code as follows:

In [2]:
# Define the output directory where the dataset is stored
output_dir_dataset = "./dialogsum"

# Ensure the output directory exists
os.makedirs(output_dir_dataset, exist_ok=True)

# Load the dataset from Hugging Face
dataset = load_dataset("knkarthick/dialogsum")

# Save the dataset to the output directory
dataset.save_to_disk(output_dir_dataset)

Found cached dataset csv (C:/Users/david/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-c8fac5d84cd35861/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/3 [00:00<?, ?it/s]

Saving the dataset (0/1 shards):   0%|          | 0/12460 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/500 [00:00<?, ? examples/s]

Load a previously downloaded copy of the dataset

In [3]:
# Define the output directory where the dataset is stored
output_dir_dataset = "./dialogsum"

# Load the dataset from the output directory
dataset = load_from_disk(output_dir_dataset)

dataset


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

To download and save the "google/flan-t5-base" model locally and run it with the specified code, you can use the following updated Python script. Make sure you have the transformers library installed by running pip install transformers.

In [4]:
# Define the model name
model_name = 'google/flan-t5-base'

# Define the output directory for the model
output_model_dir = "./flan-t5-base-checkpoint"

# Ensure the output directory exists
os.makedirs(output_model_dir, exist_ok=True)

# Download and save the model locally
loaded_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Save the model and tokenizer to the output directory
loaded_model.save_pretrained(output_model_dir)
tokenizer.save_pretrained(output_model_dir)


('./flan-t5-base-checkpoint\\tokenizer_config.json',
 './flan-t5-base-checkpoint\\special_tokens_map.json',
 './flan-t5-base-checkpoint\\spiece.model',
 './flan-t5-base-checkpoint\\added_tokens.json',
 './flan-t5-base-checkpoint\\tokenizer.json')

Load the model and tokenizer from the output directory

In [5]:
# Define the output directory for the model
output_model_dir = "./flan-t5-base-checkpoint"

# Load the model and tokenizer from the output directory
loaded_model = AutoModelForSeq2SeqLM.from_pretrained(output_model_dir, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(output_model_dir)

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [6]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params} \nall model parameters: {all_model_params}\npercentageof trainable model parameter: {((trainable_model_params/all_model_params)*100)}%"


print(print_number_of_trainable_model_parameters(loaded_model))

trainable model parameters: 247577856 
all model parameters: 247577856
percentageof trainable model parameter: 100.0%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [7]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

In [8]:
prompt = f"""
Summaise the following conversation

{dialogue}

Summary:
"""

inputs = tokenizer(dialogue, return_tensors='pt')
output = tokenizer.decode(
        loaded_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

In [9]:
dash_line = '-'.join('' for x in range(100))

print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summaise the following conversation

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

---------------------------------------------------------------------

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Dialog-Summary Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [10]:
def tokenize_function(example):
    start_prompt = 'Summarise the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors='pt').input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors='pt').input_ids

    return example

 
# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenizer_function code is handling all data accross all split in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])


Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
# To save some time in the lab, you can run a subsample the dataset, remove the comment 
# tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

In [11]:
print(f"Shapes of the dataset:")
print(f"Training:{tokenized_datasets['train'].shape}")
print(f"Validation:{tokenized_datasets['validation'].shape}")
print(f"Test:{tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the dataset:
Training:(12460, 2)
Validation:(500, 2)
Test:(1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

Output directory for fine tunned LLM

In [12]:
fine_tunned_LLM_dir = f"./flan-dialogue-summary-checkpoint"

Options to training on CUDA GPU

In [13]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

Verify Cuda is available

In [16]:
torch.cuda.is_available()

True



https://pytorch.org/docs/stable/generated/torch.zeros.html

In [17]:
torch.zeros(1).cuda()

tensor([0.], device='cuda:0')

In [15]:
training_args = TrainingArguments(
    output_dir=fine_tunned_LLM_dir,
    learning_rate=1e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=20,
    per_device_train_batch_size=4,  # Set the batch size according to your GPU memory
    per_device_eval_batch_size=4,  # Set the batch size according to your GPU memory
    gradient_accumulation_steps=8,  # Accumulate gradients for larger effective batch size
    evaluation_strategy="steps",
    eval_steps=100,  # Evaluate every 100 steps
    save_strategy="steps",
    save_steps=100,  # Save checkpoint every 100 steps
    report_to="none",  # Disable logging
    disable_tqdm=True,  # Disable tqdm progress bar
    fp16=False,  # Enable mixed-precision training
)

trainer = Trainer(
    model=loaded_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)


Train the LLM

In [18]:
trainer.train(resume_from_checkpoint=None)

{'loss': 50.2812, 'learning_rate': 9.5e-06, 'epoch': 0.0}
{'loss': 49.2188, 'learning_rate': 9e-06, 'epoch': 0.01}
{'loss': 49.4062, 'learning_rate': 8.5e-06, 'epoch': 0.01}
{'loss': 49.75, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.01}
{'loss': 49.625, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.01}
{'loss': 49.9688, 'learning_rate': 7e-06, 'epoch': 0.02}
{'loss': 50.5312, 'learning_rate': 6.5000000000000004e-06, 'epoch': 0.02}
{'loss': 48.8438, 'learning_rate': 6e-06, 'epoch': 0.02}
{'loss': 49.8438, 'learning_rate': 5.500000000000001e-06, 'epoch': 0.02}
{'loss': 48.875, 'learning_rate': 5e-06, 'epoch': 0.03}
{'loss': 50.0938, 'learning_rate': 4.5e-06, 'epoch': 0.03}
{'loss': 48.4688, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.03}
{'loss': 49.1875, 'learning_rate': 3.5e-06, 'epoch': 0.03}
{'loss': 48.0, 'learning_rate': 3e-06, 'epoch': 0.04}
{'loss': 50.625, 'learning_rate': 2.5e-06, 'epoch': 0.04}
{'loss': 49.2812, 'learning_rate': 2.0000000000000003e-06, 'epo

TrainOutput(global_step=20, training_loss=49.49375, metrics={'train_runtime': 68.4291, 'train_samples_per_second': 9.353, 'train_steps_per_second': 0.292, 'train_loss': 49.49375, 'epoch': 0.05})

Save the trained Model

In [19]:
trainer.save_model(fine_tunned_LLM_dir)

Load the saved model

In [20]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan-dialogue-summary-checkpoint", torch_dtype=torch.bfloat16)

# Remove # if you have the ability to run both models
# loaded_model = AutoModelForSeq2SeqLM.from_pretrained(loaded_model, torch_dtype=torch.bfloat16)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [21]:
index = 200

dialogues = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summaise the following conversation

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Remove # if you have the ability to run both models
#loaded_model_outputs = loaded_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, do_sample=True, tempature=0.1, num_beams=1))
#loaded_model_text_outputs = tokenizer.decode(loaded_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, do_sample=True, tempature=0.1, num_beams=1))
instruct_model_text_outputs = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)


print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')

# Remove # if you have the ability to run both models
#print(dash_line)
#print(f'Original Model:\n{loaded_model_text_outputs}\n')

print(dash_line)
print(f'Instruct Model:\n{instruct_model_text_outputs}')
print(dash_line)


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
Instruct Model:
#Person1#: Do you actually want to upgrade the hardware? #Person1#: If you're not, look into upgrading your system
---------------------------------------------------------------------------------------------------


<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [22]:
rouge = evaluate.load('rouge')

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [23]:

dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

# Remove # if you have the ability to run both models
# loaded_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
    
Summaise the following conversation

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Remove # if you have the ability to run both models
# loaded_model_outputs = loaded_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
# loaded_model_text_outputs = tokenizer.decode(loaded_model_outputs[0], skip_special_tokens=True)
# loaded_model_summaries.append(loaded_model_text_outputs)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
instruct_model_text_outputs = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
instruct_model_summaries.append(instruct_model_text_outputs)

# Remove # if you have the ability to run both models
# zipped_summaries = list(zip(human_baseline_summaries, loaded_model_summaries, instruct_model_summaries))

# df = pd.DataFrame(zipped_summaries, columns= ['human_baseline_summary', 'loaded_model_summaries', 'instruct_model_summaries'])
# df

zipped_summaries = list(zip(human_baseline_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns= ['human_baseline_summary', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summary,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"#Person1#: Happy birthday, Brian. #Person2#: I..."


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [None]:
# Remove # if you have the ability to run both models
# loaded_model_results = rouge.compute(
#     predictions=loaded_model_summaries,
#     references=human_baseline_summaries[0:len(loaded_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True
# )

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# Remove # if you have the ability to run both models
# print('ORIGINAL MODEL:')
# print(loaded_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

In [None]:
results = pd.read_csv("data/dialogue-summary-training-results.csv")

human_baseline_summaries = results['human_baseline_summaries'].values
loaded_model_summaries = results['loaded_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

loaded_model_results = rouge.compute(
    predictions=loaded_model_summaries,
    references=human_baseline_summaries[0:len(loaded_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(loaded_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

In [None]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(loaded_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [24]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [25]:
peft_model = get_peft_model(loaded_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944 
all model parameters: 251116800
percentageof trainable model parameter: 1.4092820552029972%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [36]:
peft_output_dir = f'./peft-dialogue-summary-training'

peft_training_args = TrainingArguments(
    output_dir=peft_output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=20,
    per_device_train_batch_size=4,  # Set the batch size according to your GPU memory
    per_device_eval_batch_size=4,  # Set the batch size according to your GPU memory
    gradient_accumulation_steps=8,  # Accumulate gradients for larger effective batch size
    evaluation_strategy="steps",
    eval_steps=100,  # Evaluate every 100 steps
    save_strategy="steps",
    save_steps=100,  # Save checkpoint every 100 steps
    report_to="none",  # Disable logging
    disable_tqdm=True,  # Disable tqdm progress bar
    fp16=False,  # Enable mixed-precision training
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['validation']
)



Now everything is ready to train the PEFT adapter and save the model.

In [37]:
peft_trainer.train(resume_from_checkpoint=None)

{'loss': 10.4766, 'learning_rate': 0.00095, 'epoch': 0.0}
{'loss': 6.8516, 'learning_rate': 0.0009000000000000001, 'epoch': 0.01}
{'loss': 4.9258, 'learning_rate': 0.00085, 'epoch': 0.01}
{'loss': 4.6953, 'learning_rate': 0.0008, 'epoch': 0.01}
{'loss': 4.4688, 'learning_rate': 0.00075, 'epoch': 0.01}
{'loss': 4.3828, 'learning_rate': 0.0007, 'epoch': 0.02}
{'loss': 4.293, 'learning_rate': 0.0006500000000000001, 'epoch': 0.02}
{'loss': 4.5195, 'learning_rate': 0.0006, 'epoch': 0.02}
{'loss': 4.1523, 'learning_rate': 0.00055, 'epoch': 0.02}
{'loss': 4.1094, 'learning_rate': 0.0005, 'epoch': 0.03}
{'loss': 3.9531, 'learning_rate': 0.00045000000000000004, 'epoch': 0.03}
{'loss': 3.9023, 'learning_rate': 0.0004, 'epoch': 0.03}
{'loss': 3.791, 'learning_rate': 0.00035, 'epoch': 0.03}
{'loss': 3.7324, 'learning_rate': 0.0003, 'epoch': 0.04}
{'loss': 3.584, 'learning_rate': 0.00025, 'epoch': 0.04}
{'loss': 3.4629, 'learning_rate': 0.0002, 'epoch': 0.04}
{'loss': 3.3887, 'learning_rate': 0.000

TrainOutput(global_step=20, training_loss=4.43095703125, metrics={'train_runtime': 64.1237, 'train_samples_per_second': 9.981, 'train_steps_per_second': 0.312, 'train_loss': 4.43095703125, 'epoch': 0.05})

In [38]:
peft_trainer.model.save_pretrained(peft_output_dir)
tokenizer.save_pretrained(peft_output_dir)

('./peft-dialogue-summary-training\\tokenizer_config.json',
 './peft-dialogue-summary-training\\special_tokens_map.json',
 './peft-dialogue-summary-training\\spiece.model',
 './peft-dialogue-summary-training\\added_tokens.json',
 './peft-dialogue-summary-training\\tokenizer.json')

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [39]:
from peft import PeftModel, PeftConfig

In [40]:
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(output_model_dir, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(output_model_dir)

In [41]:
peft_model = PeftModel.from_pretrained(peft_model_base,
                                       peft_output_dir,
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

In [42]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0 
all model parameters: 251116800
percentageof trainable model parameter: 0.0%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [44]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Remove # if you have the ability to run both models
# original_model_outputs = loaded_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

# Remove # if you have the ability to run both models
# instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
# print(dash_line)
# print(f'ORIGINAL MODEL:\n{original_model_text_output}')
# print(dash_line)
# print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
PEFT MODEL: You might also want to upgrade your hardware because it is pretty outdated now.


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [45]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    # Remove # if you have the ability to run both models
#    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
#    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)


    # Remove # if you have the ability to run both models
 #   instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
 #   instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    #original_model_summaries.append(original_model_text_output)
    #instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries


In [47]:
rouge = evaluate.load('rouge')

# Remove # if you have the ability to run both models
# original_model_results = rouge.compute(
#     predictions=original_model_summaries,
#     references=human_baseline_summaries[0:len(original_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

# Remove # if you have the ability to run both models
# instruct_model_results = rouge.compute(
#     predictions=instruct_model_summaries,
#     references=human_baseline_summaries[0:len(instruct_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# print('ORIGINAL MODEL:')
# print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

PEFT MODEL:
{'rouge1': 0.13439120370684718, 'rouge2': 0.012, 'rougeL': 0.11861276565495285, 'rougeLsum': 0.11890560332740871}


Notice, that PEFT model results are not too bad, while the training process was much easier!

You already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [None]:
human_baseline_summaries = results['human_baseline_summaries'].values
# original_model_summaries = results['original_model_summaries'].values
# instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values


# Remove # if you have the ability to run both models
# original_model_results = rouge.compute(
#     predictions=original_model_summaries,
#     references=human_baseline_summaries[0:len(original_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )
# 

# Remove # if you have the ability to run both models
# instruct_model_results = rouge.compute(
#     predictions=instruct_model_summaries,
#     references=human_baseline_summaries[0:len(instruct_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# print('ORIGINAL MODEL:')
# print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

In [None]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).

In [None]:
1