

<p style="background-color:#e6f7ff; 
          padding:15px; 
          color:#111;
          font-size:16px;
          border-width:3px; 
          border-color:#d0eefc; 
          border-style:solid;
          border-radius:6px"> 🔍 This project focus on fine-tuning an existing LLM for enhanced dialogue summarization, we'll use the <a style="text-decoration: underline;" href="https://huggingface.co/docs/transformers/model_doc/flan-t5"><code>FLAN-T5</code></a> model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, we'll explore a full fine-tuning approach and evaluate the results with <code>ROUGE</code> metrics, then perform Parameter Efficient Fine-Tuning <code>(PEFT)</code>, evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.
</p>



$$$$

### Setup

In [1]:
!pip install --quiet evaluate
!pip install --quiet rouge_score
!pip install --quiet peft

In [41]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Load Dataset and LLM

In [3]:
dataset = load_dataset("knkarthick/dialogsum")
dataset

README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

**Inspect an example from the dataset**

In [4]:
dataset['train'][0]

{'id': 'train_0',
 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.",
 'summary': "Mr. Smith'

**Load the pre-trained FLAN-T5 model and its tokenizer**

In [None]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                torch_dtype=torch.bfloat16,
                                )
original_model = original_model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

**N° of model parameters,trainable parameters**

In [9]:
def number_of_trainable_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    print(f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%")

In [10]:
number_of_trainable_parameters(original_model)

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


$$$$

## Zero-shot Inference

<div style="
    background-color: #fff6ff;
    color: #111;
    font-size: 16px;
    padding: 15px;
    border-width: 3px;
    border-color: #efe6ef;
    border-style: solid;
    border-radius: 6px;
            ">
  📚 The <code>Zero-shot</code> approach allows us to generate summaries without needing additional fine-tuning on the specific dataset, relying instead on its pre-trained capabilities.</br>
The process involves creating a structured prompt that includes the dialogue text.
</div>

In [11]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

In [15]:
prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
gen_summ = original_model.generate(
        inputs["input_ids"].to(device), 
        max_new_tokens=200,
    )[0] 

output = tokenizer.decode(
    gen_summ,
    skip_special_tokens=True
)

In [16]:
dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

$$$$

## Full fine-tuning

<div style="background-color:#ffe6f7; 
          padding:15px; 
          color:#111;
          font-size:16px;
          border-width:3px; 
          border-color:#f5dce9; 
          border-style:solid;
          border-radius:6px"> 📄 In this section we'll finetune the model using <code>full-precision</code>, we need to convert the dialog-summary pairs into explicit instructions for the LLM.</br>
          We can use a simple instruction prompt to prepend the dialog with <code>Summarize the following conversation</code> and start the summary with <code>Summary</code> as follows:
</br></br>
    
<em>Training prompt (dialogue):</em>
<p style="
    background-color: #555;
    color: #fff;
    font-size: 16px;
    padding: 10px;
    border-width: 2px;
    border-color: #111;
    border-style: solid;
    display: inline-block;
    border-radius: 6px;"
        > Summarize the following conversation:</br>
         &nbsp; &nbsp; &nbsp; Chris: This is his part of the conversation.</br>
         &nbsp; &nbsp; &nbsp; Antje: This is her part of the conversation.</br>
    Summary: </br>
</p>
    </br>

<em>Training response (summary):</em>
<p style="
    background-color: #555;
    color: #fff;
    font-size: 16px;
    padding: 10px;
    border-width: 2px;
    border-color: #111;
    border-style: solid;
    display: inline-block;
    border-radius: 6px;"
        >    Both Chris and Antje participated in the    
</p>
</div>

**Tokenize Dataset**

In [17]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation:\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    
    example['input_ids'] = tokenizer(prompt,
                                     padding="max_length",
                                     truncation=True,
                                     return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"],
                                  padding="max_length",
                                  truncation=True,
                                  return_tensors="pt").input_ids
    
    return example

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

**Shapes of the dataset**

In [18]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)


### Fine-tune model with preprocessed Dataset

In [21]:
output_dir = f'./dialogue-summary-training'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=80,
    auto_find_batch_size=True,
    report_to='none'
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [None]:
trainer.train()

**Free up GPU memory**

In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [25]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(f'./{output_dir}/checkpoint-1558',
                                                       torch_dtype=torch.bfloat16)
instruct_model = instruct_model.to(device)


### Evaluate the Model Qualitatively (Human Evaluationel)

<div style="
    background-color: #e6ffe6; 
    color: #111;
    font-size: 16px;
    padding: 15px;
    border-width: 3px;
    border-color: #d9f5d9;
    border-style: solid;
    border-radius: 6px;
            ">
  📊 As with many GenAI applications, a <code>qualitative</code> approach is usually a good starting point. In the example below (the same one we started this notebook with).<br> We see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.
</div>



In [35]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

In [31]:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(device)

original_model_outputs = original_model.generate(input_ids=input_ids,
                               generation_config=GenerationConfig(max_new_tokens=200,
                                                                    num_beams=1)
                                                )
original_model_text_output = tokenizer.decode(original_model_outputs[0],
                                              skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids,
                                generation_config=GenerationConfig(max_new_tokens=200,
                                                                   num_beams=1)
                                                )
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0],
                                              skip_special_tokens=True)

In [32]:
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm not sure what you mean. #Person2#: Well, I'm not sure what you mean. I'm not sure what I'm doing wrong. I'm not sure what I'm doing wrong. I'm not sure what I'm doing wrong.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
You might want to upgrade your computer.



### Evaluate the Model Quantitatively (with ROUGE Metricng)

<div style="
    background-color: #fff6e4; 
    color: #111;
    font-size: 16px;
    padding: 15px;
    border-width: 3px;
    border-color: #f5ecda;
    border-style: solid;
    border-radius: 6px;
            ">
  🤖 The <code>ROUGE</code> metric helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.
</div>




In [38]:
rouge = evaluate.load('rouge')

**Generate summaries with both original and Instruct models**

In [None]:
dialogues = dataset['test']['dialogue']
human_baseline_summaries = dataset['test']['summary']

original_model_summaries = []
instruct_model_summaries = []

for dialogue in tqdm(dialogues,desc='dialogues'):
    prompt = f"""
    Summarize the following conversation.

    {dialogue}

    Summary: """
    input_ids = tokenizer(prompt,truncation=True, return_tensors="pt").input_ids
    input_ids = input_ids.to(device)

    # original model generation
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    # instruct model generation
    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

dialogues:  17%|█▋        | 261/1500 [04:09<28:21,  1.37s/it]  Token indices sequence length is longer than the specified maximum sequence length for this model (1028 > 512). Running this sequence through the model will result in indexing errors
dialogues:  46%|████▌     | 683/1500 [10:11<15:45,  1.16s/it]  

**Compute Rouge score for both models**

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

In [50]:
pd.DataFrame({
    'Original Model': original_model_results,
    'Instruct Model': instruct_model_results
}).T

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
Original Model,0.200099,0.058277,0.172409,0.172571
Instruct Model,0.22283,0.076568,0.193753,0.194324


In [46]:
print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: 2.27%
rouge2: 1.83%
rougeL: 2.13%
rougeLsum: 2.18%


$$$$

## Perform Parameter Efficient Fine-Tuning 

<p style="background-color:#e6f7ff; 
          padding:15px; 
          color:#111;
          font-size:16px;
          border-width:3px; 
          border-color:#d0eefc; 
          border-style:solid;
          border-radius:6px"> 💡 Now, let's perform <code>Parameter Efficient Fine-Tuning (PEFT)</code> fine-tuning as opposed to **full fine-tuning** as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results. </br> </br>
PEFT is a generic term that includes <code>Low-Rank Adaptation (LoRA)</code> and <code>prompt tuning</code>. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources. After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged (frozen weights) and a newly-trained <code>LoRA adapter</code> emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  </br> </br>
That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.</p>

### Setup the PEFT/LoRA model for Fine-Tuning

<p style="background-color:#e6f7ff; 
          padding:15px; 
          color:#111;
          font-size:16px;
          border-width:3px; 
          border-color:#d0eefc; 
          border-style:solid;
          border-radius:6px"> 💡 Let's set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA freezes the underlying LLM weights and only training the adapter. the rank <code>r</code> of LoRA is hyper-parameter, which defines the rank/dimension of the adapter to be trained.</p>

In [75]:
from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

In [78]:
from peft import get_peft_model

peft_model = get_peft_model(original_model, 
                            lora_config)
peft_model = peft_model.to(device)
number_of_trainable_parameters(peft_model)

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


### Train PEFT Adapter

In [79]:
output_dir = f'./peft-dialogue-summary-training'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=150,
    eval_strategy='epoch',
    save_strategy='epoch',
    report_to='none',
    eval_on_start=True
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['validation']
)

In [None]:
peft_trainer.train()

In [58]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [59]:
peft_model_path="./peft-dialogue-summary-checkpoint"
peft_trainer.model.save_pretrained(peft_model_path)

**Prepare PEFT model by adding an adapter to the original FLAN-T5 model**

In [60]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       peft_model_path, 
                                       is_trainable=False)
peft_model = peft_model.to(device)

### Evaluate the Model Qualitatively (Human Evaluation)

In [63]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

In [64]:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(device)

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1# suggests upgrading #Person2#'s system to make up some flyers and banners. #Person2# also suggests upgrading #Person1#'s hardware.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
You might want to upgrade your computer.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person2# wants to upgrade #Person2#'s system and hardware. #Person1# recommends adding a painting program to #Person2#'s software and adding a CD-ROM drive.


### Evaluate the Model Quantitatively (with ROUGE Metric)

In [65]:
dialogues = dataset['test']['dialogue']
human_baseline_summaries = dataset['test']['summary']

peft_model_summaries = []

for dialogue in tqdm(dialogues,desc='dialogue'):
    prompt = f"""
        Summarize the following conversation.

        {dialogue}

        Summary: """
    
    input_ids = tokenizer(prompt,truncation=True,
                          return_tensors="pt").input_ids
    input_ids = input_ids.to(device)
    
    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    peft_model_summaries.append(peft_model_text_output)

dialogue: 100%|██████████| 1500/1500 [33:16<00:00,  1.33s/it]


In [70]:
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries,
        columns = ['human_baseline', 'original_model', 'instruct_model', 'peft_model'])
df = df.sample(1)
styled_df = df.style.set_properties(
    **{
        'text-align': 'left',  
        'white-space': 'pre-wrap',
        'max-width': '300px', 
    }
).set_table_attributes('style="width: 100%;"')
styled_df

Unnamed: 0,human_baseline,original_model,instruct_model,peft_model
1307,Greg Sonders calls Mary to ask whether Mary is interested in sports and tells Mary to wait for final admission decision later.,"Mr. Sonders, please tell me a little bit about yourself.",Greg Sonders from Brown College is speaking to Mary.,Greg Sonders is speaking to Mary. He asks Mary if she'd be interested in college sports. Mary tells Greg Sonders she plays volleyball and she's impressed.


In [71]:
rouge = evaluate.load('rouge')

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

In [72]:
pd.DataFrame({
    'Original Model': original_model_results,
    'Instruct Model': instruct_model_results,
    'PEFT Model': peft_model_results
}).T

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
Original Model,0.200099,0.058277,0.172409,0.172571
Instruct Model,0.22283,0.076568,0.193753,0.194324
PEFT Model,0.403744,0.154918,0.321592,0.321941


**Improvement of PEFT model over original model**

In [73]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 20.36%
rouge2: 9.66%
rougeL: 14.92%
rougeLsum: 14.94%


**Improvement of PEFT model over Instruct model**

In [74]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: 18.09%
rouge2: 7.83%
rougeL: 12.78%
rougeLsum: 12.76%
