# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [15]:
# %pip install --upgrade pip
# %pip install --disable-pip-version-check \
#     torch==1.13.1 \
#     torchdata==0.5.1 --quiet

# %pip install \
#     transformers==4.27.2 \
#     datasets==2.11.0 \
#     evaluate==0.4.0 \
#     rouge_score==0.1.2 \
#     loralib==0.1.1 \
#     peft==0.3.0 --quiet



Import the necessary components. They will be discussed later in the notebook.

In [16]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [17]:
# from google.colab import drive
# drive.mount('/content/drive')

In [18]:
from datasets import load_dataset

huggingface_dataset_name = "Ankita802/formatted-data"
# huggingface_dataset_name = "Ketan3101/ConvoBrief"

dataset = load_dataset(huggingface_dataset_name)

dataset


DatasetDict({
    train: Dataset({
        features: ['input', ' result'],
        num_rows: 8
    })
    test: Dataset({
        features: ['input', ' result'],
        num_rows: 2
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [19]:
model_name='google/flan-t5-small'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [20]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 76961152
all model parameters: 76961152
percentage of trainable model parameters: 100.00%


In [21]:
sentence = "AS a CONNECT developer, I want all assertions upgrades to be completely tested so the code can be included in the next release"
tokenized_input = tokenizer(sentence)
# Print the tokenized input
print("Input IDs:", tokenized_input["input_ids"])
# print("Token Type IDs:", tokenized_input["token_type_ids"])
print("Attention Mask:", tokenized_input["attention_mask"])

# Decode the input tokens
decoded_input = tokenizer.decode(tokenized_input["input_ids"])
print("Decoded Input:", decoded_input)

Input IDs: [6157, 3, 9, 8472, 567, 14196, 7523, 6, 27, 241, 66, 27805, 7, 15694, 12, 36, 1551, 5285, 78, 8, 1081, 54, 36, 1285, 16, 8, 416, 1576, 1]
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Decoded Input: AS a CONNECT developer, I want all assertions upgrades to be completely tested so the code can be included in the next release</s>


In [22]:
print(type(dataset))

# Print the first few entries of the dataset
for i in range(2):
    print(dataset['train'][i])


<class 'datasets.dataset_dict.DatasetDict'>
{'input': 'AS a CONNECT developer, I want all assertions upgrades to be completely tested so the code can be included in the next release', ' result': 'The CONNECT developerï¿½s objective is to thoroughly test all assertion upgrades to ensure they are ready for inclusion in the upcoming release. This comprehensive testing is crucial for verifying that the new enhancements function correctly and meet quality standards, thereby contributing to the stability and reliability of the next version of the software.'}
{'input': 'As a Publisher, I would like a tool to check data availability persistence after publication.', ' result': 'As a Publisher, I need a tool to verify the persistence of data availability after publication." This user story succinctly conveys the requirement for a tool to ensure that published data remains available and accessible over time, addressing the concerns of data persistence.'}


In [23]:
index = 0
example = dataset['test'][index]
print(example.keys())

dict_keys(['input', ' result'])


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [26]:
index = 0

dialogue = dataset['test'][index]['input']
summary = dataset['test'][index][' result']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

As a user, I want to sync events created in NeuroHub with a web-based Calendar such as Google Calendar.

Summary:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Users seek the capability to synchronize events created within NeuroHub with a web-based calendar service like Google Calendar, facilitating seamless integration and access to scheduling information across platforms.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
W: Is there any way to sync events created in NeuroHub?


<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Dialog-Summary Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [29]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["input"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example[" result"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['input', ' result'])


To save some time in the lab, you will subsample the dataset:

In [None]:
#tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Check the shapes of all three parts of the dataset:

In [31]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
# print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (8, 2)
Test: (2, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 8
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 2
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [33]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=500,
    max_steps=-1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

In [34]:
trainer.train()

  0%|          | 0/1 [00:00<?, ?it/s]

{'train_runtime': 2886.0722, 'train_samples_per_second': 0.003, 'train_steps_per_second': 0.0, 'train_loss': 44.0, 'epoch': 1.0}


TrainOutput(global_step=1, training_loss=44.0, metrics={'train_runtime': 2886.0722, 'train_samples_per_second': 0.003, 'train_steps_per_second': 0.0, 'train_loss': 44.0, 'epoch': 1.0})

In [35]:
full_fine_tune_model_path="./full-fine-tune-code-generation-checkpoint-local"

trainer.model.save_pretrained(full_fine_tune_model_path)
tokenizer.save_pretrained(full_fine_tune_model_path)

('./full-fine-tune-code-generation-checkpoint-local\\tokenizer_config.json',
 './full-fine-tune-code-generation-checkpoint-local\\special_tokens_map.json',
 './full-fine-tune-code-generation-checkpoint-local\\spiece.model',
 './full-fine-tune-code-generation-checkpoint-local\\added_tokens.json',
 './full-fine-tune-code-generation-checkpoint-local\\tokenizer.json')

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [37]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(full_fine_tune_model_path, torch_dtype=torch.bfloat16)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [23]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
You'd like to make a few more flyers and banners.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
I'd like to add a CD-ROM drive to my software.


<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [24]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [25]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Messages are sent to all employees of the office.,Is this all correct?
1,In order to prevent employees from wasting tim...,#Person1#: I am a senior employee.,Is this all correct?
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: Please take a dictation for me.,Is this all correct?
3,#Person2# arrives late because of traffic jam....,The traffic jam was a nightmare.,"I'm sorry, but I'm not going to drive to work."
4,#Person2# decides to follow #Person1#'s sugges...,Taking the subway would be a great way to get ...,"I'm sorry, but I'm not going to drive to work."
5,#Person2# complains to #Person1# about the tra...,Get out and get ready to go to work.,"I'm sorry, but I'm not going to drive to work."
6,#Person1# tells Kate that Masha and Hero get d...,#Person1#: I'm not sure. But I'm not sure.,I think it's a good idea to have a couple of k...
7,#Person1# tells Kate that Masha and Hero are g...,The couple is getting divorced.,I think it's a good idea to have a couple of k...
8,#Person1# and Kate talk about the divorce betw...,"After a couple of days, they're divorced.",I think it's a good idea to have a couple of k...
9,#Person1# and Brian are at the birthday party ...,"Brian, thanks for the cake.","Brian, how are you?"


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [26]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.16261387474369216, 'rouge2': 0.026442307692307692, 'rougeL': 0.12915057221345255, 'rougeLsum': 0.12883994166752788}
INSTRUCT MODEL:
{'rouge1': 0.09064470772635075, 'rouge2': 0.013793103448275862, 'rougeL': 0.09325224576128359, 'rougeLsum': 0.09368816411183747}


The results show substantial improvement in all ROUGE metrics:

In [27]:
print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: -7.20%
rouge2: -1.26%
rougeL: -3.59%
rougeLsum: -3.52%


<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [28]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [29]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 1376256
all model parameters: 78337408
percentage of trainable model parameters: 1.76%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [30]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    #logging_steps=1,
    max_steps=-1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [None]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



Step,Training Loss
500,3.5448
1000,1.9017
1500,1.8499


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')



In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 1376256
all model parameters: 78337408
percentage of trainable model parameters: 1.76%


Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       '/content/peft-dialogue-summary-checkpoint-local/',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False).to('cuda')

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 78337408
percentage of trainable model parameters: 0.00%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
original_model=original_model.to('cuda')



In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

#instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
#instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
#print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1# thinks #Person2# is considering adding a painting program to the software, but #Person2# doesn't think a painting program would be a bonus. #Person2# suggests adding a painting program to the software. #Person1# wants to add a hard disc and a hard disc and a hard disc.
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person2# considers adding a painting program to #Person2#'s software. #Person2# wants to add a CD-ROM drive to #Person2#'s software. #Person2# wants to add a CD-ROM drive to #Pers

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1# asks #Person1# to go out as an intra...,Is this all correct?,#Person1# wants to take a dictation for #Perso...
1,In order to prevent employees from wasting tim...,#Person1# wants to take a dictation for all em...,Is this all correct?,#Person1# wants to take a dictation for #Perso...
2,Ms. Dawson takes a dictation for #Person1# abo...,Dawson and Ms. Dawson are not allowed to use i...,Is this all correct?,#Person1# wants to take a dictation for #Perso...
3,#Person2# arrives late because of traffic jam....,#Person1# thinks #Person2# is a good choice fo...,"I'm sorry, but I'm not going to drive to work.",#Person2# is stuck in traffic again and #Perso...
4,#Person2# decides to follow #Person1#'s sugges...,#Person2# asks #Person2# to find a way to get ...,"I'm sorry, but I'm not going to drive to work.",#Person2# is stuck in traffic again and #Perso...
5,#Person2# complains to #Person1# about the tra...,#Person1# got stuck in traffic and a terrible ...,"I'm sorry, but I'm not going to drive to work.",#Person2# is stuck in traffic again and #Perso...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced and they a...,I think it's a good idea to have a couple of k...,#Person2# is worried about the separation for ...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced and are ge...,I think it's a good idea to have a couple of k...,#Person2# is worried about the separation for ...
8,#Person1# and Kate talk about the divorce betw...,Kate and #Person2# are getting divorced and th...,I think it's a good idea to have a couple of k...,#Person2# is worried about the separation for ...
9,#Person1# and Brian are at the birthday party ...,Brian tells Brian that Brian is happy birthday...,"Brian, how are you?",Brian invites Brian to have a dance with Brian...


In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.3379713879972265, 'rouge2': 0.09917957989430226, 'rougeL': 0.2642753936216485, 'rougeLsum': 0.2656308762149359}
INSTRUCT MODEL:
{'rouge1': 0.09064470772635075, 'rouge2': 0.013793103448275862, 'rougeL': 0.09325224576128359, 'rougeLsum': 0.09368816411183747}
PEFT MODEL:
{'rouge1': 0.3007116454829918, 'rouge2': 0.08050042800988387, 'rougeL': 0.22274424500885284, 'rougeLsum': 0.22309502417319133}


Notice, that PEFT model results are not too bad, while the training process was much easier!

The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: -3.73%
rouge2: -1.87%
rougeL: -4.15%
rougeLsum: -4.25%


Now calculate the improvement of PEFT over a full fine-tuned model:

Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).