# Fine-Tuning LLMs

In this exercise, you will fine-tune the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model for enhanced dialogue summarization. You will first explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter-Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Manish Kanuri
# 002315456

## 1. Set up Dependencies and Load Dataset and LLM

In [1]:
!pip install datasets evaluate rouge_score peft -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m110.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m96.9 MB/s[0m eta [3

In [2]:
import torch
import time
import evaluate
import pandas as pd
import numpy as np

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset

In [3]:
dataset = load_dataset('knkarthick/dialogsum')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Load the pre-trained [Flan-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of Flan-T5. Setting `torch_dtype=torch.bfloat16` specifies the data type to be used by this model, which can reduce GPU memory usage since `bfloat16` uses half as much memory per number compared to `float32`, the default precision for most models.

In [4]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

## 2. Test the Model with Zero-Shot Inferencing

Test the model with zero-shot inference.

In [5]:
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
inputs = tokenizer(prompt, return_tensors='pt')
output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

print(dash_line)
print(f'INPUT PROMPT:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{original_model_summary}\n')

----------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.
----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.
-------------------------------------------------------

You can see that the model struggles to summarize the dialogue compared to the baseline summary, and simply repeats the first sentence from the dialogue.

## 3. Perform Full Fine-Tuning

### 3.1 Preprocess the Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation.`, and to the start of the summary with `Summary:` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.
Alice: This is her part of the conversation.
Bob: This is his part of the conversation.    
Summary:
```

Training response (summary):
```
Both Alice and Bob participated in the conversation.
```

**Exercise**: Write a function to tokenize a batch of examples from the dialogue dataset. The function should concatentate the dialogues with the predefined prompt, tokenize them along with their summaries, and define the tokenized summaries as the labels.

In [6]:
def tokenize(examples):
    ### WRITE YOUR CODE HERE
    # Add the prompt prefix to each dialogue
    inputs = ["Summarize the following conversation.\n" + dialogue for dialogue in examples["dialogue"]]

    # Tokenize the inputs (dialogues)
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    # Tokenize the targets (summaries)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs









In [7]:
tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

### 3.2 Fine-Tune the Model

**Exercise**: Utilize the Hugging Face Trainer API for training the model on the preprocessed dataset. Define the training arguments, a data collator, and create a `Seq2SeqTrainer` instance. Train the model for one epoch.

In [14]:
from transformers import Seq2SeqTrainingArguments


In [15]:
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)

# Load model and tokenizer
model_ckpt = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

# ✅ Use correct training args class
training_args = Seq2SeqTrainingArguments(
    output_dir="./flan-t5-dialogsum",
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1,
    logging_dir="./logs",
    predict_with_generate=True,
    save_total_limit=1,
    report_to="none"
)

# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
)




  trainer = Seq2SeqTrainer(


Training a fully fine-tuned version of the model should take about 10 minutes on a Google Colab GPU machine.

In [16]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
500,1.3319
1000,1.2391
1500,1.2084
2000,1.1831
2500,1.1736
3000,1.1757


TrainOutput(global_step=3115, training_loss=1.2173051126887289, metrics={'train_runtime': 557.0079, 'train_samples_per_second': 22.37, 'train_steps_per_second': 5.592, 'total_flos': 5596398799220736.0, 'train_loss': 1.2173051126887289, 'epoch': 1.0})

Save the model to a local folder:

In [17]:
model_path = './flan-t5-base-dialogsum-checkpoint'

original_model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('./flan-t5-base-dialogsum-checkpoint/tokenizer_config.json',
 './flan-t5-base-dialogsum-checkpoint/special_tokens_map.json',
 './flan-t5-base-dialogsum-checkpoint/spiece.model',
 './flan-t5-base-dialogsum-checkpoint/added_tokens.json',
 './flan-t5-base-dialogsum-checkpoint/tokenizer.json')

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [18]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained('./flan-t5-base-dialogsum-checkpoint',
                                                       torch_dtype=torch.bfloat16)

Reload the original Flan-T5-base model:

In [19]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

### 3.3 Evaluate the Model Qualitatively (Human Evaluation)

**Exercise**: Make inferences for the same example as in Section 2, using the original model and the fully fine-tuned model.

In [21]:
# Move original model to same device as tokenizer inputs
original_model = original_model.to(model.device)

# Sample input
sample_input = "Summarize the following conversation.\nAlice: This is her part of the conversation.\nBob: This is his part of the conversation.\nSummary:"

# Tokenize input
inputs = tokenizer(sample_input, return_tensors="pt").to(model.device)

# Generate summary with original model
original_output = original_model.generate(**inputs, max_new_tokens=50)
original_summary = tokenizer.decode(original_output[0], skip_special_tokens=True)

# Generate summary with fine-tuned model
finetuned_output = model.generate(**inputs, max_new_tokens=50)
finetuned_summary = tokenizer.decode(finetuned_output[0], skip_special_tokens=True)

# Print both
print("🔹 Original model summary:")
print(original_summary)

print("\n🔸 Fine-tuned model summary:")
print(finetuned_summary)


🔹 Original model summary:
Bob and Alice are going to the park.

🔸 Fine-tuned model summary:
Bob and Alice are talking about the part of the conversation.


The fine-tuned model is able to create a much better summary of the dialogue compared to the original model.

### 3.4 Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [22]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

**Exercise**: Generate the outputs for a sample of the test set with the fine-tuned model (use only the first 10 dialogues and summaries to save time).

In [24]:
### WRITE YOUR CODE HERE
# Take first 10 test samples
original_texts = dataset["test"]["dialogue"][:10]
references = dataset["test"]["summary"][:10]

generated_summaries = []

# Generate predictions
for dialogue in original_texts:
    prompt = "Summarize the following conversation.\n" + dialogue + "\nSummary:"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_summaries.append(summary)

# Compute ROUGE
results = rouge.compute(predictions=generated_summaries, references=references)
print("ROUGE scores for the first 10 test examples:")
for k, v in results.items():
    print(f"{k}: {v:.4f}")


ROUGE scores for the first 10 test examples:
rouge1: 0.3802
rouge2: 0.1237
rougeL: 0.3156
rougeLsum: 0.3181


Evaluate the models computing ROUGE metrics:

In [26]:
# Step 1: Extract dialogues and references
test_dialogues = dataset["test"]["dialogue"][:10]
references = dataset["test"]["summary"][:10]

# Step 2: Generate summaries from both models
original_model_summaries = []
instruct_model_summaries = []

for dialogue in test_dialogues:
    prompt = "Summarize the following conversation.\n" + dialogue + "\nSummary:"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(model.device)

    # Original model
    orig_out = original_model.generate(**inputs, max_new_tokens=50)
    original_model_summaries.append(tokenizer.decode(orig_out[0], skip_special_tokens=True))

    # Fine-tuned model
    finetuned_out = model.generate(**inputs, max_new_tokens=50)
    instruct_model_summaries.append(tokenizer.decode(finetuned_out[0], skip_special_tokens=True))

# Step 3: Compute ROUGE
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=references
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=references
)

# Step 4: Display comparison
print("🔹 ORIGINAL MODEL ROUGE:")
for k, v in original_model_results.items():
    print(f"{k}: {v:.4f}")

print("\n🔸 FINE-TUNED MODEL ROUGE:")
for k, v in instruct_model_results.items():
    print(f"{k}: {v:.4f}")


🔹 ORIGINAL MODEL ROUGE:
rouge1: 0.2687
rouge2: 0.0953
rougeL: 0.2287
rougeLsum: 0.2313

🔸 FINE-TUNED MODEL ROUGE:
rouge1: 0.3565
rouge2: 0.1123
rougeL: 0.2877
rougeLsum: 0.2894


The results show substantial improvement in all ROUGE metrics:

In [27]:
print("Absolute percentage improvement of the instruct model over the original model:")

for key in instruct_model_results:
    improvement = instruct_model_results[key] - original_model_results[key]
    print(f'{key}: {improvement*100:.2f}%')

Absolute percentage improvement of the instruct model over the original model:
rouge1: 8.78%
rouge2: 1.70%
rougeL: 5.90%
rougeLsum: 5.81%


## 4. Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** instead of "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning, with comparable evaluation results as you will see soon.

One of the most popular PEFT methods is **Low-Rank Adaptation (LoRA)**, which  introduces low-rank matrices to adapt the LLM with minimal additional parameters. In most cases, when someone says PEFT, they typically mean LoRA.  After fine-tuning for a specific task with LoRA, the result is that the original LLM remains unchanged and a newly-trained "LoRA adapter" emerges. This LoRA adapter is much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

At inference time, the LoRA adapter is reunited and combined with its original LLM to serve the inference request. The benefit is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

### 4.1 Setup the LoRA model for Fine-Tuning

You first need to define the configuration of the LoRA model. Have a look at the configuration below. The key configuration element to adjust is the rank (`r`) of the adapter, which influences its capacity and complexity. Experiment with various ranks, such as 8, 16, or 32, and see how they affect the results.

In [28]:
from peft import LoraConfig, TaskType, get_peft_model

lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=32,
    lora_alpha=32,
    lora_dropout=0.1
)

Add LoRA adapter layers/parameters to the original LLM to be trained:

In [29]:
peft_model = get_peft_model(original_model, lora_config)

The number of trainable model parameters in the LoRA model is:

In [30]:
peft_model.print_trainable_parameters()

trainable params: 3,538,944 || all params: 251,116,800 || trainable%: 1.4093


### 4.2 Train the LoRA Adapter

**Exercise**: Define training arguments and create a `Seq2SeqTrainer` instance for the LoRA model. Use a higher learning rate than full fine-tuning (e.g., `1e-3`).

In [31]:
### WRITE YOUR CODE HERE

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from peft import get_peft_model, LoraConfig, TaskType

# 1. Define LoRA configuration
peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1
)

# 2. Apply LoRA to the model
lora_model = get_peft_model(model, peft_config)

# 3. Define training arguments (higher learning rate as suggested)
peft_training_args = Seq2SeqTrainingArguments(
    output_dir="./flan-t5-lora",
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=1e-3,  # 🔥 higher LR for PEFT
    num_train_epochs=1,
    logging_dir="./logs",
    save_total_limit=1,
    predict_with_generate=True,
    report_to="none"
)

# 4. Data collator (same)
peft_data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=lora_model)

# 5. Trainer
peft_trainer = Seq2SeqTrainer(
    model=lora_model,
    args=peft_training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=peft_data_collator
)




  peft_trainer = Seq2SeqTrainer(
No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Train the PEFT adapter. Training should take about 6 minutes on a Google Colab GPU machine.

In [32]:
peft_trainer.train()

Step,Training Loss
500,1.1644
1000,1.1602
1500,1.1502
2000,1.1357
2500,1.135
3000,1.1422


TrainOutput(global_step=3115, training_loss=1.1482696484218249, metrics={'train_runtime': 516.0792, 'train_samples_per_second': 24.144, 'train_steps_per_second': 6.036, 'total_flos': 5618611781038080.0, 'train_loss': 1.1482696484218249, 'epoch': 1.0})

Save the model to a local folder:

In [33]:
peft_model.save_pretrained('./flan-t5-base-dialogsum-lora')

Load the PEFT model:

In [34]:
from peft import AutoPeftModelForSeq2SeqLM
from transformers import AutoTokenizer

peft_model = AutoModelForSeq2SeqLM.from_pretrained('./flan-t5-base-dialogsum-lora')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

Reload the original Flan-T5-base model:

In [35]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

### 4.3 Evaluate the Model Qualitatively (Human Evaluation)

**Exercise**: Make inferences for the same example as in Sections 2 and 3, using the original model, the fully fine-tuned model and the PEFT model.

In [36]:
### WRITE YOUR CODE HERE
# Ensure all models are on the same device
original_model = original_model.to(model.device)
lora_model = lora_model.to(model.device)

# Sample test dialogue
dialogue = "Alice: I can’t believe the flight is delayed again.\nBob: Yeah, it’s been a rough travel day."
prompt = "Summarize the following conversation.\n" + dialogue + "\nSummary:"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(model.device)

# Original model summary
orig_out = original_model.generate(**inputs, max_new_tokens=50)
orig_summary = tokenizer.decode(orig_out[0], skip_special_tokens=True)

# Fully fine-tuned model summary
ft_out = model.generate(**inputs, max_new_tokens=50)
ft_summary = tokenizer.decode(ft_out[0], skip_special_tokens=True)

# LoRA (PEFT) model summary
lora_out = lora_model.generate(**inputs, max_new_tokens=50)
lora_summary = tokenizer.decode(lora_out[0], skip_special_tokens=True)

# Print results
print("🔹 Original Model Summary:\n", orig_summary)
print("\n🔸 Fully Fine-Tuned Model Summary:\n", ft_summary)
print("\n🟣 LoRA Adapter Model Summary:\n", lora_summary)



🔹 Original Model Summary:
 Alice's flight is delayed again.

🔸 Fully Fine-Tuned Model Summary:
 Alice and Bob are talking about the flight.

🟣 LoRA Adapter Model Summary:
 Bob and Alice are excited about the flight.


### 4.4 Evaluate the Model Quantitatively (with ROUGE Metric)

**Exercise**: Generate the outputs for a sample of the test set with the PEFT model (use only the first 10 dialogues and summaries to save time).

In [37]:
### WRITE YOUR CODE HERE
# Get first 10 dialogues and their gold summaries
test_dialogues = dataset["test"]["dialogue"][:10]
references = dataset["test"]["summary"][:10]

# Generate summaries using the LoRA model
lora_model_summaries = []

for dialogue in test_dialogues:
    prompt = "Summarize the following conversation.\n" + dialogue + "\nSummary:"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(model.device)
    outputs = lora_model.generate(**inputs, max_new_tokens=50)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    lora_model_summaries.append(summary)

# Compute ROUGE
peft_rouge_results = rouge.compute(predictions=lora_model_summaries, references=references)

# Display results
print("🔵 ROUGE Scores for LoRA PEFT Model:")
for metric, score in peft_rouge_results.items():
    print(f"{metric}: {score:.4f}")



🔵 ROUGE Scores for LoRA PEFT Model:
rouge1: 0.3866
rouge2: 0.1313
rougeL: 0.3206
rougeLsum: 0.3234


Compute ROUGE score for this subset of the data.

In [39]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=references[:len(original_model_summaries)]
)


In [41]:
# Step 1: Get test data
test_dialogues = dataset["test"]["dialogue"][:10]
references = dataset["test"]["summary"][:10]

# Step 2: Generate summaries from all 3 models
original_model = original_model.to(model.device)
lora_model = lora_model.to(model.device)

original_model_summaries = []
finetuned_model_summaries = []
lora_model_summaries = []

for dialogue in test_dialogues:
    prompt = "Summarize the following conversation.\n" + dialogue + "\nSummary:"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(model.device)

    # Original model
    orig_out = original_model.generate(**inputs, max_new_tokens=50)
    original_model_summaries.append(tokenizer.decode(orig_out[0], skip_special_tokens=True))

    # Fully fine-tuned model
    finetuned_out = model.generate(**inputs, max_new_tokens=50)
    finetuned_model_summaries.append(tokenizer.decode(finetuned_out[0], skip_special_tokens=True))

    # LoRA model
    lora_out = lora_model.generate(**inputs, max_new_tokens=50)
    lora_model_summaries.append(tokenizer.decode(lora_out[0], skip_special_tokens=True))

# Step 3: Compute ROUGE scores
original_model_results = rouge.compute(predictions=original_model_summaries, references=references)
finetuned_model_results = rouge.compute(predictions=finetuned_model_summaries, references=references)
lora_model_results = rouge.compute(predictions=lora_model_summaries, references=references)

# Step 4: Print results
print("🔹 ROUGE for Original Model:")
for k, v in original_model_results.items():
    print(f"{k}: {v:.4f}")

print("\n🔸 ROUGE for Fine-Tuned Model:")
for k, v in finetuned_model_results.items():
    print(f"{k}: {v:.4f}")

print("\n🟣 ROUGE for LoRA (PEFT) Model:")
for k, v in lora_model_results.items():
    print(f"{k}: {v:.4f}")


🔹 ROUGE for Original Model:
rouge1: 0.2687
rouge2: 0.0953
rougeL: 0.2287
rougeLsum: 0.2313

🔸 ROUGE for Fine-Tuned Model:
rouge1: 0.4166
rouge2: 0.1526
rougeL: 0.3256
rougeLsum: 0.3291

🟣 ROUGE for LoRA (PEFT) Model:
rouge1: 0.3471
rouge2: 0.0979
rougeL: 0.2789
rougeLsum: 0.2805


Notice, that PEFT model results are not too bad, while the training process was much easier!

Calculate the improvement of PEFT over the original model:

In [43]:
print("Absolute percentage improvement of the PEFT (LoRA) model over the original model:")

for key in lora_model_results:
    improvement = lora_model_results[key] - original_model_results[key]
    print(f"{key}: {improvement * 100:.2f}%")


Absolute percentage improvement of the PEFT (LoRA) model over the original model:
rouge1: 7.84%
rouge2: 0.27%
rougeL: 5.02%
rougeLsum: 4.92%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [45]:
print("Absolute percentage improvement of the PEFT (LoRA) model over the fully fine-tuned model:")

for key in lora_model_results:
    improvement = lora_model_results[key] - finetuned_model_results[key]
    print(f"{key}: {improvement * 100:.2f}%")


Absolute percentage improvement of the PEFT (LoRA) model over the fully fine-tuned model:
rouge1: -6.95%
rouge2: -5.47%
rougeL: -4.67%
rougeLsum: -4.86%


You can see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources.