# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Load Required Dependencies, Dataset and LLM (5 points)

<a name='1.1'></a>
### 1.1 - Set up Required Dependencies (1 point)

Now install the required packages for the LLM and datasets.



In [1]:
# TODO
# dataset
!pip install -q \
    datasets==2.17.0 \
    evaluate \
    rouge_score \
    loralib \
    peft \
    "transformers[sentencepiece]"


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.4/166.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2023.10.0 which is incompatible.[0m[31m
[0m



Import the necessary components. Some of them are new for this week, they will be discussed later in the notebook.

In [2]:
# TODO
import time
import numpy as np
import pandas as pd

import torch

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import evaluate  # evaluation library

from rouge_score import rouge_scorer
import loralib as lora
from peft import LoraConfig, get_peft_model

In [3]:
import transformers, datasets, peft

print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("datasets:", datasets.__version__)
print("evaluate:", evaluate.__version__)
print("peft:", peft.__version__)

torch: 2.8.0+cu126
transformers: 4.57.1
datasets: 2.17.0
evaluate: 0.4.6
peft: 0.17.1


<a name='1.2'></a>
### 1.2 - Load Dataset and LLM (2 points)

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [4]:
huggingface_dataset_name = "knkarthick/dialogsum" # TODO

dataset = load_dataset(huggingface_dataset_name)# TODO

dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Generating validation split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Generating test split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [5]:
print("Dataset splits:", dataset.keys())
print("\nSample from the training set:")
sample = dataset["train"][0]
for k, v in sample.items():
    print(f"\n=== {k} ===")
    print(v)

Dataset splits: dict_keys(['train', 'validation', 'test'])

Sample from the training set:

=== id ===
train_0

=== dialogue ===
#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more 

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-small) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [6]:
model_name= "google/flan-t5-small" # TODO

original_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16
) # TODO
tokenizer = AutoTokenizer.from_pretrained(model_name) # TODO

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [7]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    # TODO
    for _, param in model.named_parameters():
        all_model_params += param.numel()  # total parameter numbers
        if param.requires_grad: # trainable model parameters
            trainable_model_params += param.numel()

    return (
        f"trainable model parameters: {trainable_model_params}\n"
        f"all model parameters: {all_model_params}\n"
        f"percentage of trainable model parameters: "
        f"{100 * trainable_model_params / all_model_params:.2f}%"
    )

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 76961152
all model parameters: 76961152
percentage of trainable model parameters: 100.00%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing (2 Points)

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [8]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

# prompt
prompt = f"Summarize the following dialogue:\n\n{dialogue}"  # TODO

# encode
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    padding=True,
    truncation=True,
) # TODO

# generate summary + decode
generated_ids = original_model.generate(
    **inputs,
    max_new_tokens=50,
    early_stopping=True,
)

output = tokenizer.decode(
    generated_ids[0], # TODO
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Summarize the following dialogue:

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
-----------------------------------------------------------------------------------

The FLAN-T5-small model generated a partial summary (“You’d like to add a CD-ROM drive to your software.”), which captures a specific action but misses the broader context of the dialogue.
In contrast, the human baseline summary effectively conveys the overall topic of upgrading both software and hardware.
This shows that while the model can identify some relevant details, it lacks holistic understanding and would benefit from fine-tuning on dialogue-specific data.

<a name='2'></a>
## 2 - Perform Full Fine-Tuning (10 points)

<a name='2.1'></a>
### 2.1 - Preprocess the Dialog-Summary Dataset (2 points)

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [9]:
def tokenize_function(example):
    # TODO
    dialogue_prompts = [
        f"Summarize the following conversation.\n\n{dialogue}\n\nSummary:"
        for dialogue in example["dialogue"]]

    # Tokenize input dialogues
    tokenized_inputs = tokenizer(
        dialogue_prompts,
        truncation=True,
        padding="max_length",
        max_length=512,
    )
    # Tokenize summaries (labels)
    tokenized_labels = tokenizer(
        example["summary"],
        truncation=True,
        padding="max_length",
        max_length=128,
    )

    # Add labels
    tokenized_inputs["labels"] = tokenized_labels["input_ids"]

    return tokenized_inputs

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True) # TODO

print("Tokenized Dataset Example (Train):")
print(tokenized_datasets["train"][0])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Tokenized Dataset Example (Train):
{'id': 'train_0', 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks

To save some time in the lab, you will subsample the dataset:

In [10]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [11]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (125, 7)
Validation: (5, 7)
Test: (15, 7)
DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 15
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset (3 points)

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [12]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

training_args = TrainingArguments( # TODO
    output_dir=output_dir,
    overwrite_output_dir=True,
    learning_rate=1e-3,
    num_train_epochs=30,
    weight_decay=0.01,
)

# Data collator for seq2seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=original_model)

trainer = Trainer(
    # TODO
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(


Start training process...



The code trainer.train() utilizes the Weights & Biases (wandb) library to track and visualize the training process. To proceed, you'll need to sign up for a wandb account using your Gmail and then enter your unique API token to authenticate and enable logging of the training progress.

In [13]:
# TODO
trainer.train() # Start fine-tuning
trainer.save_model(output_dir) # Save fine-tuned model

# Evaluate
results = trainer.evaluate()
print("Evaluation Results:", results)

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mem3907[0m ([33mem3907-columbia-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


Evaluation Results: {'eval_loss': 0.70703125, 'eval_runtime': 0.1585, 'eval_samples_per_second': 31.541, 'eval_steps_per_second': 6.308, 'epoch': 30.0}




In [19]:
output_dir # my fine-tuned model

'./dialogue-summary-training-1762724597'

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

I initially attempted to load the instructor-provided "flan-dialogue-summary-checkpoint" folder. However, this model’s configuration class was not compatible with the T5-based AutoModelForSeq2SeqLM (its config.json specified a model_type of "DiaConfig" instead of "T5Config"), which caused a loading error. Therefore, in this section, I decided to save my own fine-tuned model checkpoint from Section 1.2.2 and upload it to Google Drive. I then reloaded this checkpoint as the fine-tuned model for evaluation. This allows me to compare the pre-trained FLAN-T5-small model (original model) with my fine-tuned model on the same test examples, in order to evaluate how fine-tuning improves the summary quality in terms of coherence, relevance, and abstraction.


In [26]:
# prompt: I need to mount my google drive folder, MountDrive,  within which, I got a flan-diaglogue-summary-checkpoint folder, where the model checkpoint has

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


You can fine my fine-tuned model's checkpoint in [here](https://drive.google.com/drive/folders/1A_4lAnK_S0CKMEy0sB39vqXuHY8n-7R-?usp=sharing)

In [27]:
# copy my fine-tuned model's check point into MyDrive
!cp -r {output_dir} /content/drive/MyDrive/flan-dialogue-summary-checkpoint

Move the given [data directory](https://drive.google.com/drive/folders/1u9hwecVDt_1zIIbIPIlXW3t2F_sJ0AsS) to use these files afterward

In [18]:
!ls /content/drive/MyDrive/Data-for-Fine-Tuning-FLAN-T5-for-Dialogue-Summarization

data				   images
flan-diaglogue-summary-checkpoint  peft-dialogue-summary-checkpoint-from-s3


In [28]:
# Load tokenizer and models
# Import T5Tokenizer from transformers
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

# Define the model path using the config.json path
model_path = '/content/drive/MyDrive/flan-dialogue-summary-checkpoint'# TODO

# Load tokenizer and models
# Use the default T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small") # TODO  # or "t5-base", "t5-large", etc.

# Load the model in a way that is compatible with single-GPU environments
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_path, # use my fine-tuned model via the previous process
    dtype=torch.bfloat16,
    # The following line addresses the multi-GPU loading issue
    device_map="auto",
)

# Move model to GPU if available (optional, as device_map="auto" should handle it)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
instruct_model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

Load original pretrained model again to evaluate my pre-trained model

In [29]:
from transformers import AutoModelForSeq2SeqLM

original_model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-small",
    dtype=torch.bfloat16,
    device_map="auto",
).to(device)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation) (2 points)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [30]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n\n{dialogue}\n\nSummary:" # TODO

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Move input_ids to the same device as the model
input_ids = input_ids.to(device)

with torch.no_grad():
    original_model_outputs = original_model.generate(
        input_ids=input_ids,
        max_new_tokens=50,
        early_stopping=True
    ) # TODO
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True) # TODO

with torch.no_grad():
    instruct_model_outputs = instruct_model.generate(
        input_ids=input_ids,
        max_new_tokens=50,
        early_stopping=True
    )
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True) # TODO

print(dialogue)
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
--------------------

**Analysis**

The original model gives a short and vague answer that doesn’t summarize the dialogue. It repeats a general question and misses the main idea of the conversation.

The fine-tuned model shows a clear improvement. It captures the key points about upgrading the system and adding a CD-ROM drive, and the overall meaning of the exchange. Although it still contains a small factual error, its output is much closer to the human-written summary in both content and structure.

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric) (3 points)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [31]:
from evaluate import load
rouge = load("rouge") # TODO

Downloading builder script: 0.00B [00:00, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [32]:
### use helper function to generate summary
def generate_summary(model, tokenizer, dialogue, device):
    prompt = f"Summarize the following conversation.\n\n{dialogue}"

    # Move input_ids to the same device as the model
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=50,
            early_stopping=True,
        )

    summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return summary

In [33]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
  # user helper function
  original_model_summaries.append(
        generate_summary(original_model, tokenizer, dialogue, device))
  instruct_model_summaries.append(
        generate_summary(instruct_model, tokenizer, dialogue, device))

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"I'm sorry, sir. I'm sorry, sir. I'm sorry, sir...",#Person1# asks #Person2# to take a dictation a...
1,In order to prevent employees from wasting tim...,"I'm sorry, sir. I'm sorry, sir. I'm sorry, sir...",#Person1# asks #Person2# to take a dictation a...
2,Ms. Dawson takes a dictation for #Person1# abo...,"I'm sorry, sir. I'm sorry, sir. I'm sorry, sir...",#Person1# asks #Person2# to take a dictation a...
3,#Person2# arrives late because of traffic jam....,I'm not going to drive to work.,"#Person1# gets stuck in traffic, but #Person2#..."
4,#Person2# decides to follow #Person1#'s sugges...,I'm not going to drive to work.,"#Person1# gets stuck in traffic, but #Person2#..."
5,#Person2# complains to #Person1# about the tra...,I'm not going to drive to work.,"#Person1# gets stuck in traffic, but #Person2#..."
6,#Person1# tells Kate that Masha and Hero get d...,I think it's a good time to start a new year.,#Person1# tells Kate that #Person2# has filed ...
7,#Person1# tells Kate that Masha and Hero are g...,I think it's a good time to start a new year.,#Person1# tells Kate that #Person2# has filed ...
8,#Person1# and Kate talk about the divorce betw...,I think it's a good time to start a new year.,#Person1# tells Kate that #Person2# has filed ...
9,#Person1# and Brian are at the birthday party ...,"Brian, how are you?",Brian invites #Person2# to celebrate his birth...


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [34]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.08150925925925925), 'rouge2': np.float64(0.016), 'rougeL': np.float64(0.08256144781144781), 'rougeLsum': np.float64(0.08339478114478113)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.34529717169215945), 'rouge2': np.float64(0.0934230058074429), 'rougeL': np.float64(0.2691694043003165), 'rougeLsum': np.float64(0.26849958428174736)}


**Analysis**

When evaluating both models on a small subset of 10 test dialogues, the fine-tuned (instruct) model achieved significantly higher ROUGE scores compared to the original pre-trained model. These results show that the fine-tuned model generates summaries that are much closer to the human-written references, both in terms of lexical similarity and structural alignment.



The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

In [35]:
# Update the path to the CSV file
results_path = '/content/drive/MyDrive/Data-for-Fine-Tuning-FLAN-T5-for-Dialogue-Summarization/data/dialogue-summary-training-results.csv' # TODO
results = pd.read_csv(results_path)

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True, # TODO
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True, # TODO
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.2334158581572823), 'rouge2': np.float64(0.07603964187010573), 'rougeL': np.float64(0.20145520923859048), 'rougeLsum': np.float64(0.20145899339006135)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.42161291557556113), 'rouge2': np.float64(0.18035380596301792), 'rougeL': np.float64(0.3384439349963909), 'rougeLsum': np.float64(0.33835653595561666)}


**Analysis**

When evaluated on the full dataset using the instructor-provided CSV file, the fine-tuned (instruct) model again outperformed the original pre-trained model across all ROUGE metrics. ROUGE-1 improved from 0.23 to 0.42, ROUGE-2 from 0.08 to 0.18, and ROUGE-L from 0.20 to 0.34. These consistent gains indicate that fine-tuning substantially enhanced the model’s ability to capture both key words and multi-word expressions from the reference summaries.

The results show substantial improvement in all ROUGE metrics:

In [36]:
print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: 18.82%
rouge2: 10.43%
rougeL: 13.70%
rougeLsum: 13.69%


**Analysis**

To better quantify the improvement, I calculated the absolute ROUGE score differences between the two models. The instruct (fine-tuned) model shows an average increase of around 14 percentage points across all ROUGE metrics, with the largest gain observed in ROUGE-1 (+18.8%) and ROUGE-L (+13.7%). This confirms that fine-tuning not only improved overall summarization accuracy but also strengthened structural and contextual alignment with human-written summaries.

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT) (10 points)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning (2 points)

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [37]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [38]:
peft_model = get_peft_model(original_model, lora_config) # TODO
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 1376256
all model parameters: 78337408
percentage of trainable model parameters: 1.76%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter (3 points)

Define training arguments and create `Trainer` instance.

In [39]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate= 2e-3, # higher lr than full fine-tuning
    num_train_epochs=1,
    logging_steps=1,
    # max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  peft_trainer = Trainer(
The model is already on multiple devices. Skipping the move to device specified in `args`.


Now everything is ready to train the PEFT adapter and save the model.



In [40]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

# TODO - Evaluate the PEFT model
peft_results = peft_trainer.evaluate()
print("PEFT Model Evaluation Results:", peft_results)

tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
1,42.0
2,32.75
3,33.5
4,26.125
5,29.375
6,25.375
7,21.125
8,22.375
9,18.625
10,18.25


PEFT Model Evaluation Results: {'eval_loss': 12.0625, 'eval_runtime': 0.1014, 'eval_samples_per_second': 49.331, 'eval_steps_per_second': 9.866, 'epoch': 1.0}


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json')



You can find my PEFT model's checkpoint in [here](https://drive.google.com/drive/folders/1lkmCDhqvBsOC4QbCQp40JM1cosjMB5WO?usp=sharing)

In [41]:
# copy my fine-tuned model's check point into MyDrive
!cp -r {output_dir} /content/drive/MyDrive/peft-dialogue-summary-checkpoint-local

That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from Google Drive.

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [42]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-base", # original FLAN-T5 model
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")# TODO

# Update the PEFT model path - given model checkpoint
peft_model_path = '/content/drive/MyDrive/Data-for-Fine-Tuning-FLAN-T5-for-Dialogue-Summarization/peft-dialogue-summary-checkpoint-from-s3'

# base + LoRA adapter = final PEFT
peft_model = PeftModel.from_pretrained(
    peft_model_base,
    peft_model_path,
    torch_dtype=torch.bfloat16,
    is_trainable=False, # inference
) # TODO

# Move the entire peft_model to the device
peft_model = peft_model.to(device)


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [43]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.00%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation) (2 points)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [48]:
# 1.3: original model (pre-trained)
# 2.3: full fine-tuning model
# 3.2: PEFT model

## - original model
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
original_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
)
original_model = original_model.to(device)
original_model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

In [46]:
instruct_model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

In [47]:
peft_model

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 768)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 768)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
            

Note that the provided PEFT checkpoint is based on flan-t5-base (larger hidden size) while my fully fine-tuned model uses flan-t5-small. Therefore, part of the performance gap may come from the larger base model as well as the PEFT adapter.

In [50]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n\n{dialogue}\n\nSummary:" # TODO

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Move input_ids to the same device as the model
input_ids = input_ids.to(device)

# --original model--
with torch.no_grad():
    original_model_outputs = original_model.generate(
        input_ids=input_ids,
        max_new_tokens=50,
        num_beams=5,
        early_stopping=True,
    )
original_model_text_output = tokenizer.decode(
    original_model_outputs[0],
    skip_special_tokens=True,
)

# -- instruction model --
with torch.no_grad():
    instruct_model_outputs = instruct_model.generate(
        input_ids=input_ids,
        max_new_tokens=50,
        num_beams=5,
        early_stopping=True,
    )
instruct_model_text_output = tokenizer.decode(
    instruct_model_outputs[0],
    skip_special_tokens=True,
)

# -- PEFT model --
with torch.no_grad():
    peft_model_outputs = peft_model.generate(
        input_ids=input_ids,
        max_new_tokens=50,
        num_beams=5,
        early_stopping=True,
    )
peft_model_text_output = tokenizer.decode(
    peft_model_outputs[0],
    skip_special_tokens=True,
)

print(dialogue)
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
--------------------

**Analysis**

The original model focuses only on the CD-ROM detail, missing the broader idea of system upgrades.
The instruct model captures both the painting program and hardware upgrades but mixes up who suggested what.
The PEFT model best summarizes the dialogue, covering both software and hardware upgrades, though it repeats phrases and has minor grammatical issues.
Overall, PEFT gives the most complete and accurate summary among the three.

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric) (3 points)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [51]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    # Move input_ids to the same device as the model
    input_ids = input_ids.to(device)

    # Original Model
    with torch.no_grad():
        original_model_outputs = original_model.generate(
            input_ids=input_ids,
            max_new_tokens=50,
            early_stopping=True,
        )
    original_model_text_output = tokenizer.decode(
        original_model_outputs[0],
        skip_special_tokens=True,
    )

    # INSTRUCT (fully fine-tuned) MODEL
    with torch.no_grad():
        instruct_model_outputs = instruct_model.generate(
            input_ids=input_ids,
            max_new_tokens=50,
            early_stopping=True,
        )
    instruct_model_text_output = tokenizer.decode(
        instruct_model_outputs[0],
        skip_special_tokens=True,
    )

    # PEFT MODEL
    with torch.no_grad():
        peft_model_outputs = peft_model.generate(
            input_ids=input_ids,
            max_new_tokens=50,
            early_stopping=True,
        )
    peft_model_text_output = tokenizer.decode(
        peft_model_outputs[0],
        skip_special_tokens=True,
    )

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Is this all correct?,#Person1# asks #Person2# to take a dictation a...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,Is this all correct?,#Person1# asks #Person2# to take a dictation a...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,Is this all correct?,#Person1# asks #Person2# to take a dictation a...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,Talk to your boss.,"#Person1# gets stuck in traffic, but #Person2#...",#Person2# got stuck in traffic and #Person1# s...
4,#Person2# decides to follow #Person1#'s sugges...,Talk to your boss.,"#Person1# gets stuck in traffic, but #Person2#...",#Person2# got stuck in traffic and #Person1# s...
5,#Person2# complains to #Person1# about the tra...,Talk to your boss.,"#Person1# gets stuck in traffic, but #Person2#...",#Person2# got stuck in traffic and #Person1# s...
6,#Person1# tells Kate that Masha and Hero get d...,"Kate, you know, I'm not sure.",#Person1# tells Kate that #Person2# has filed ...,Kate tells #Person2# Masha and Hero are gettin...
7,#Person1# tells Kate that Masha and Hero are g...,"Kate, you know, I'm not sure.",#Person1# tells Kate that #Person2# has filed ...,Kate tells #Person2# Masha and Hero are gettin...
8,#Person1# and Kate talk about the divorce betw...,"Kate, you know, I'm not sure.",#Person1# tells Kate that #Person2# has filed ...,Kate tells #Person2# Masha and Hero are gettin...
9,#Person1# and Brian are at the birthday party ...,"Brian, how are you?",Brian invites #Person2# to celebrate his birth...,Brian remembers his birthday and invites #Pers...


In [52]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.07088226588226587), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.07132602904342034), 'rougeLsum': np.float64(0.07267870615696703)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.323320707440769), 'rouge2': np.float64(0.09579988084584637), 'rougeL': np.float64(0.2536029154690597), 'rougeLsum': np.float64(0.25393266441108686)}
PEFT MODEL:
{'rouge1': np.float64(0.4269315712346933), 'rouge2': np.float64(0.13315451912410747), 'rougeL': np.float64(0.3230054434505235), 'rougeLsum': np.float64(0.3231705534304423)}


**Analysis**

For the 10-example sample, the fully fine-tuned (instruct) model clearly improves over the original zero-shot FLAN-T5, and the PEFT model does even better on all ROUGE metrics.

- Original model has very low scores (ROUGE-1 = 0.07, ROUGE-2 = 0), which matches what we saw qualitatively: it barely produces meaningful summaries.
- Instruct model jumps to ROUGE-1 = 0.32 and ROUGE-2 = 0.096, showing that full fine-tuning makes the model capture the main content of the dialogue much more reliably.
- PEFT model achieves the highest scores (ROUGE-1 = 0.43, ROUGE-2 = 0.13, ROUGE-L/Lsum = 0.32), slightly outperforming the fully fine-tuned model on this small test slice, even though only a small adapter was trained.

Because this is only 10 dialogues, the numbers are noisy, but the pattern is consistent: both fine-tuning approaches are a big improvement over the original model, and PEFT can match or slightly beat full fine-tuning while updating far fewer parameters.

Notice, that PEFT model results are not too bad, while the training process was much easier!

You already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [53]:
results

Unnamed: 0.1,Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,0,Ms. Dawson helps #Person1# to write a memo to ...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
1,1,In order to prevent employees from wasting tim...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
2,2,Ms. Dawson takes a dictation for #Person1# abo...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
3,3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and got stuck i...
4,4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and got stuck i...
...,...,...,...,...,...
1495,1495,Matthew and Steve meet after a long time. Stev...,#Person1#: Hi! #Person2#: Hi! How are you? #Pe...,Steve hasn't seen Matthew for a year. He's bee...,Matthew and Steve are looking for a place to l...
1496,1496,Steve has been looking for a place to live. Ma...,#Person1#: Hi! #Person2#: Hi! How are you? #Pe...,Steve hasn't seen Matthew for a year. He's bee...,Matthew and Steve are looking for a place to l...
1497,1497,Frank invites Besty to the party to celebrate ...,Person1 is going to throw a party for all of h...,Frank invites Betsy to his promotion party on ...,Frank invites Betsy to a party for all of his ...
1498,1498,Frank invites Betsy to the big promotion party...,Person1 is going to throw a party for all of h...,Frank invites Betsy to his promotion party on ...,Frank invites Betsy to a party for all of his ...


In [54]:
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.2334158581572823), 'rouge2': np.float64(0.07603964187010573), 'rougeL': np.float64(0.20145520923859048), 'rougeLsum': np.float64(0.20145899339006135)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.42161291557556113), 'rouge2': np.float64(0.18035380596301792), 'rougeL': np.float64(0.3384439349963909), 'rougeLsum': np.float64(0.33835653595561666)}
PEFT MODEL:
{'rouge1': np.float64(0.40810631575616746), 'rouge2': np.float64(0.1633255794568712), 'rougeL': np.float64(0.32507074586565354), 'rougeLsum': np.float64(0.3248950182867091)}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [55]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.36%
rougeLsum: 12.34%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [56]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.35%
rouge2: -1.70%
rougeL: -1.34%
rougeLsum: -1.35%


Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).

**Analysis**

We evaluated the original T5 base model, the fully fine-tuned instruct model, and the PEFT model on the entire test dataset using ROUGE metrics. Compared to the original model, the PEFT model shows a substantial improvement, achieving +17.5% in ROUGE-1, +8.7% in ROUGE-2, and +12% in ROUGE-L/Lsum. When compared with the fully fine-tuned instruct model, PEFT demonstrates a very small decrease of roughly 1–2% across all ROUGE metrics.

These results indicate that PEFT effectively closes the gap with full fine-tuning, producing nearly equivalent summarization quality while requiring only a fraction of the computational cost and memory usage.
Although full fine-tuning slightly outperforms PEFT in raw ROUGE scores, the trade-off between efficiency and performance clearly favors PEFT, validating its practicality for large-scale or resource-limited scenarios.