# | NLP | PEFT/LoRA | DialogSum | Dialog Summarize |

## NLP (Natural Language Processing) with PEFT (Parameter Efficient Fine-Tuning) and LoRA (Low-Rank Adaptation) for Dialogue Summarization

# <b>1 <span style='color:#78D118'>|</span> Introduction</b>

This project delves into the capabilities of LLM (Language Model) with a specific focus on leveraging Parameter Efficient Fine-Tuning (PEFT) for enhancing dialogue summarization using the FLAN-T5 model.

Our goal is to enhance the quality of dialogue summarization by employing a comprehensive fine-tuning approach and evaluating the results using ROUGE metrics. Additionally, we will explore the advantages of Parameter Efficient Fine-Tuning (PEFT), demonstrating that its benefits outweigh any potential minor performance trade-offs.

 - NOTE: This is an example and we not using the entirety of the data used for PERF / LoRA.
 
## Objectives :
 - Train LLM for Dialogue Summarization.
 
 
 ## The DialogSum Dataset:
The [DialogSum Dataset](https://huggingface.co/datasets/knkarthick/dialogsum) DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics.

## Project Workflow:

- **Setup**: Import necessary libraries and define project parameters.
- **Dataset Exploration**: Discovering DialogSum Dataset.
- **Test Model Zero Shot Inferencing**: Initially, test the FLAN-T5 model for zero-shot inferencing on dialogue summarization tasks to establish a baseline performance.
- **Dataset Preprocess Dialog and Summary**: Preprocess the dialog and its corresponding summary from the dataset to prepare for the train.
-  **Perform Parameter Efficient Fine-Tuning (PEFT)**: Implement Parameter Efficient Fine-Tuning (PEFT), a more efficient fine-tuning approach that can significantly reduce training time while maintaining performance.
-  **Evaluation**:
    - Perform human evaluation to gauge the model's output in terms of readability and coherence. This can involve annotators ranking generated summaries for quality.
    - Utilize ROUGE metrics to assess the quality of the generated summaries. ROUGE measures the overlap between generated summaries and human-written references.

# <b>2<span style='color:#78D118'>|</span> Setup</b>
## <b>2.1 <span style='color:#78D118'>|</span> Imports</b>

In [1]:
# %pip install --upgrade pip
# %pip install --disable-pip-version-check \
#     torch==1.13.1 \
#     torchdata==0.5.1 --quiet

# %pip install \
#     transformers==4.27.2 \
#     datasets==2.11.0 \
#     evaluate==0.4.0 \
#     rouge_score==0.1.2 \
#     loralib==0.1.1 \
#     peft==0.3.0 --quiet

In [2]:
# %pip install --upgrade pip
# %pip install \
#     torch \
#     torchdata --quiet

# %pip install \
#     transformers \
#     datasets\
#     evaluate \
#     rouge_score \
#     loralib \
#     peft --quiet

In [1]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np
from peft import LoraConfig, get_peft_model, TaskType
from peft import PeftModel, PeftConfig

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
rouge = evaluate.load('rouge')
dash_line = '-'.join('' for x in range(100))

Load the dataset

In [5]:
# %pip install -U datasets

In [3]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

Found cached dataset csv (C:/Users/Prajwal-S-Yallur/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-3005b557c2c04c1d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)
100%|██████████| 3/3 [00:00<00:00, 38.83it/s]


Load the pre-trained [FLAN-T5 model](https://huggingface.co/google/flan-t5-base) and its tokenizer directly from Hugging Face. We'll be using the smaller version of FLAN-T5 for this project.

To optimize memory usage, set `torch_dtype=torch.bfloat16` to specify the memory type used by this model.

In [9]:
device = torch.device("cpu")

In [10]:
model_name='google/flan-t5-base'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

## <b>2.2 <span style='color:#78D118'>|</span> Methods</b>

In [11]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# <b>3<span style='color:#78D118'>|</span> Data Exploration</b>

In [12]:
print(dash_line)
print(print_number_of_trainable_model_parameters(original_model))
print(dash_line)

---------------------------------------------------------------------------------------------------
trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%
---------------------------------------------------------------------------------------------------


In [13]:
print(
    """
---------------------------------------------------------------------------------------------------

PROMPT:

Summarize the following conversation.


#Person1#: Have you considered upgrading your system?

#Person2#: Yes, but I'm not sure what exactly I would need.

#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.

#Person2#: That would be a definite bonus.

#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.

#Person2#: How can we do that?

#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?

#Person2#: No.

#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.

#Person2#: That sounds great. Thanks.


Summary:

---------------------------------------------------------------------------------------------------

HUMAN SUMMARY:

#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
    """
)


---------------------------------------------------------------------------------------------------

PROMPT:

Summarize the following conversation.


#Person1#: Have you considered upgrading your system?

#Person2#: Yes, but I'm not sure what exactly I would need.

#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.

#Person2#: That would be a definite bonus.

#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.

#Person2#: How can we do that?

#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?

#Person2#: No.

#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.

#Person2#: That sounds great. Thanks.


Summary:

------------------------------------------------------------

# <b>4<span style='color:#78D118'>|</span> Test Model Zero Shot Inferencing</b>

Test the model using zero-shot inference. It's evident that the model faces challenges in summarizing the dialogue when compared to the baseline summary. However, it manages to extract some crucial information from the text, suggesting that fine-tuning.

In [14]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt').to(device)
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)
print(dash_line)
print("ZERO SHOT")
print(dash_line)
print(f'PROMPT:\n{prompt}')
print(dash_line)
print(f'HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'ORIGINAL MODEL SUMMARY:\n{output}')
print(dash_line)

---------------------------------------------------------------------------------------------------
ZERO SHOT
---------------------------------------------------------------------------------------------------
PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: T

# <b>5<span style='color:#78D118'>|</span> Dataset Preprocess Dialog and Summary</b>

Transform the dialog-summary (prompt-response) pairs by adding specific instructions for the Language Model (LLM). Add the instruction "Summarize the following conversation" at the beginning of the dialog and "Summary" at the beginning of the summary like this:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary: 
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Now we preprocess the prompt-response dataset by tokenizing the text and extracting their input_ids, with one input_id assigned per token.

In [16]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

In [15]:
# tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 10 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

 - NOTE: This is an example and we not using the entirety of the data used for PERF / LoRA.

In [17]:
print(dash_line)
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")
print(tokenized_datasets)
print(dash_line)

---------------------------------------------------------------------------------------------------
Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
})
---------------------------------------------------------------------------------------------------


Check the shapes of all three parts of the dataset:

# <b>6 <span style='color:#78D118'>|</span> Dataset Preprocess Dialog and Summary</b>

Let's delve into the process of Parameter Efficient Fine-Tuning (PEFT), which offers a more efficient alternative to full fine-tuning. PEFT encompasses various techniques, including Low-Rank Adaptation (LoRA) and prompt tuning (distinct from prompt engineering).

PEFT, it typically involves Low-Rank Adaptation (LoRA).

LoRA, in essence, enables fine-tuning of your model with significantly fewer computational resources, sometimes even just a single GPU. After fine-tuning for a specific task, use case, or tenant using LoRA, the original Language Model (LLM) remains unchanged, while a newly-trained "LoRA adapter" emerges. This LoRA adapter is substantially smaller than the original LLM, often only a fraction of its size (in megabytes rather than gigabytes).

However, during inference, the LoRA adapter needs to be reintegrated and combined with its original LLM to fulfill the inference request. The advantage lies in the fact that multiple LoRA adapters can reuse the same original LLM, reducing overall memory requirements when serving multiple tasks and use cases.

## <b>6.1 <span style='color:#78D118'>|</span> PEFT/LoRA model for Fine-Tuning</b>

To configure the PEFT/LoRA model for fine-tuning with a new parameter adapter, we follow these steps:

1. **PEFT/LoRA Setup**: 
   - We are using PEFT/LoRA, which means we freeze the underlying Language Model (LLM) and focus on training only the adapter.

2. **Adapter Configuration**:
   - LoRA configuration below, the `rank (r)` hyper-parameter. This hyper-parameter determines the rank or dimensionality of the adapter that will be trained.

By employing PEFT/LoRA, we ensure that the core LLM remains unchanged while adapting a separate parameterized layer for our specific task or use case. The `rank (r)` hyper-parameter plays a critical role in determining the adapter's complexity and capacity for the target task.

In [20]:
lora_config = LoraConfig(
    r=64, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.15,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Incorporate LoRA adapter layers and parameters into the original Language Model (LLM) for training.

In [21]:
peft_model = get_peft_model(original_model, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 7077888
all model parameters: 254655744
percentage of trainable model parameters: 2.78%


## <b>6.2 <span style='color:#78D118'>|</span> Train PEFT/LoRA Adapter</b>

In [26]:
output_dir = f'./peft-dialogue-summary-training'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=100,
    do_eval=True
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"]
)

In [27]:
# record start time
start = time.time()

peft_trainer.train()

# record end time
end = time.time()


# print the difference between start 
# and end time in milli. secs
print("Model Training time is:",
      (end-start), "s")

Step,Training Loss
1,26.375
2,23.25
3,19.5
4,16.625
5,12.25
6,8.3125
7,5.9688
8,4.9375
9,4.7188
10,4.5


Model Training time is: 130031.78906440735 ms


In [28]:
peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

In [None]:
peft_model_path="./peft-dialogue-summary-checkpoint-local"

In [29]:
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       peft_model_path, 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

In [30]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 254655744
percentage of trainable model parameters: 0.00%


# <b>7 <span style='color:#78D118'>|</span> Evaluation</b>

## <b>7.1 <span style='color:#78D118'>|</span> Evaluate the Model Qualitatively (Human Evaluation)</b>

In [31]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summaries = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").to(device).input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'HUMAN SUMMARY:\n{human_baseline_summaries}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')
print(dash_line)

---------------------------------------------------------------------------------------------------
HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1## wants to upgrade his hardware. #Person2## needs a faster processor, to begin with.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person1# wants to upgrade his hardware because it is pretty outdated now. #Person1# wants to upgrade his hardware because it is pretty outdated now. #Person1# wants to upgrade his computer. #Person1# wants to add a painting program to his software.
---------------------------------------------------------------------------------------------------


In [32]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []


# record start time
start = time.time()

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").to(device).input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)
    
#     print(dash_line)
#     print(f'HUMAN SUMMARY:\n{human_baseline_summaries}')
#     print(dash_line)
#     print(f'ORIGINAL MODEL:\n{original_model_text_output}')
#     print(dash_line)
#     print(f'PEFT MODEL: {peft_model_text_output}')
#     print(dash_line)
#     print(dash_line)

# record end time
end = time.time()


# print the difference between start 
# and end time in milli. secs
print("Model Inference time is:",
      (end-start), "s")

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Model Inference time is: 65.43029761314392 s


Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Memo1### is a #Person1## dicting for #Person1#...,#Person1# needs to take a dictation for #Perso...
1,In order to prevent employees from wasting tim...,#Person1## needs to take a dictation to #Perso...,#Person1# needs to take a dictation for #Perso...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1# needs to take a dictation for #Perso...,#Person1# needs to take a dictation for #Perso...
3,#Person2# arrives late because of traffic jam....,#Person1# feels bad about his car's congestion...,#Person1# is going to have to consider a diffe...
4,#Person2# decides to follow #Person1#'s sugges...,"You're finally here, #Person1#: #Person1# thin...",#Person1# is going to have to consider a diffe...
5,#Person2# complains to #Person1# about the tra...,#Person1# is stuck in traffic jam near the Car...,#Person1# is going to have to consider a diffe...
6,#Person1# tells Kate that Masha and Hero get d...,#Person1##: Masha and Hero are getting divorce...,"#Person1# is having a separation for 2 months,..."
7,#Person1# tells Kate that Masha and Hero are g...,"#Person1#: #Person1##, Masha and Hero are havi...","#Person1# is having a separation for 2 months,..."
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are having a separation for 2 m...,"#Person1# is having a separation for 2 months,..."
9,#Person1# and Brian are at the birthday party ...,"You are always popular with everyone, and you ...",#Person1# is always popular with everyone. #Pe...


In [None]:
dataset

In [None]:
dialogues = dataset['test'][:]['dialogue']
human_baseline_summaries = dataset['test'][:]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []


# record start time
start = time.time()


for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").to(device).input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)
    
#     print(dash_line)
#     print(f'HUMAN SUMMARY:\n{human_baseline_summaries}')
#     print(dash_line)
#     print(f'ORIGINAL MODEL:\n{original_model_text_output}')
#     print(dash_line)
#     print(f'PEFT MODEL: {peft_model_text_output}')
#     print(dash_line)
#     print(dash_line)
    if idx % 100 == 0:
        print(idx)
    else:
        print(idx, end=", ")
        
# record end time
end = time.time()

# print the difference between start 
# and end time in milli. secs
print("Model Inference time is:",
      (end-start), "s")

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

0
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222

Token indices sequence length is longer than the specified maximum sequence length for this model (1028 > 512). Running this sequence through the model will result in indexing errors


260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 46

## <b>7.2 <span style='color:#78D118'>|</span> Evaluate the Model Quantitatively (ROUGE Metric)</b>

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) is a valuable tool for assessing the quality of summaries generated by models. It evaluates these summaries by comparing them to a "baseline" summary, typically crafted by a human. Although not flawless, the ROUGE metric provides insights into the improvement in the overall effectiveness of summarization achieved through fine-tuning.

In [37]:
human_baseline_summaries = df['human_baseline_summaries'].values
original_model_summaries = df['original_model_summaries'].values
peft_model_summaries     = df['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print(dash_line)
print('ORIGINAL MODEL:')
print(original_model_results)
print(dash_line)
print('PEFT MODEL:')
print(peft_model_results)
print(dash_line)

---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
{'rouge1': 0.22736904083260462, 'rouge2': 0.05697901366251684, 'rougeL': 0.18679210189076217, 'rougeLsum': 0.18687460055156696}
---------------------------------------------------------------------------------------------------
PEFT MODEL:
{'rouge1': 0.19934745161507816, 'rouge2': 0.057275918136077124, 'rougeL': 0.17116570635572304, 'rougeLsum': 0.17125606161128704}
---------------------------------------------------------------------------------------------------


In [38]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: -2.80%
rouge2: 0.03%
rougeL: -1.56%
rougeLsum: -1.56%


## References

The creation of this document was greatly influenced by the following key sources of information:

1. [DialogSum Dataset](https://huggingface.co/datasets/knkarthick/dialogsum) DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics.
2. [Generative AI with Large Language Models | Coursera](https://www.coursera.org/learn/generative-ai-with-llms?utm_medium=sem&utm_source=gg&utm_campaign=B2C_NAMER_generative-ai-with-llms_deeplearning-ai_FTCOF_learn_country-US-country-CA&campaignid=20534248984&adgroupid=160068579824&device=c&keyword=&matchtype=&network=g&devicemodel=&adposition=&creativeid=673251286004&hide_mobile_promo&gclid=CjwKCAjwg4SpBhAKEiwAdyLwvEW_WnNyptOwzHtsGmn5-OxT5BKsQeUXHPahO-opBJ0JjsSynHkPAxoCaoAQAvD_BwE) - An informative guide that provides in-depth explanations and examples on various LLMs.