# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

- 原链接：https://www.kaggle.com/code/paultimothymooney/fine-tune-flan-t5-with-peft-lora-deeplearning-ai
Source:
- https://www.coursera.org/learn/generative-ai-with-llms
- https://www.coursera.org/learn/generative-ai-with-llms/gradedLti/x0gc1/lab-2-fine-tune-a-generative-ai-model-for-dialogue-summarization
- https://creativecommons.org/licenses/by-sa/2.0/legalcode


# Table of Contents
- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [2]:
# Now install the required packages for the LLM and datasets.

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

!pip install awscli 

Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.2
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting awscli
  Downloading awscli-1.32.23-py3-none-any.whl.metadata (11 kB)
Collecting botocore==1.34.23 (from awscli)
  Downloading botocore-1.34.23-py3-none-any.whl.metadata (5.6 kB)
Collecting docutils<0.17,>=0.10 (from awscli)
  Downloading docutils-0.16-py2.py3-none-any.whl (548 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.2/548.2 kB[0m [31m1

In [3]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics. 

In [4]:
# 加载sft语料
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
dataset

# 打印语料明细
for index, row in enumerate(dataset['train']):
   if index < 2:
       print(row)
   else:
       break

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'id': 'train_0', 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.", 'summary': "Mr. Smith's 

* #Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n
* #Person2#: I found it would be a good idea to get a check-up.\n
* #Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n
* #Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n
* #Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n
* #Person2#: Ok.\n
* #Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
* \n#Person2#: Yes.\n
* #Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n
* #Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n
* #Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
* \n#Person2#: Ok, thanks doctor.",
*  'summary': "Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll give some information about their classes and medications to help Mr. Smith quit smoking.", 
* 'topic': 'get a check-up'}

In [5]:
# 加载预训练模型   FLAN-T5 model 及对应的 tokenizer
# model_name='google/flan-t5-base'
model_name='/kaggle/input/flan-t5/pytorch/base/4'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) # 按照模型名称加载模型
tokenizer = AutoTokenizer.from_pretrained(model_name)  # 按照模型名称加载模型
print('Model Loaded')

# 备注：Notice that you will be using the small version of FLAN-T5. Setting torch_dtype=torch.bfloat16 
#      specifies the memory type to be used by this model.

Model Loaded


In [6]:
# 函数：提取模型参数的数量，并找出其中有多少是可训练的

def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()  # 模型的全部参数
        if param.requires_grad:
            trainable_model_params += param.numel() # 模型的可训练参数（支持梯度计算=可训练）
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='1.3'></a>
### 1.3 - 模型试用(no-sft) Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [7]:
# tokenizer 测试
prompt = f"""I HAVE A CAR"""
inputs = tokenizer(prompt, return_tensors='pt') 
inputs 

# 输出：{'input_ids': tensor([[   27, 21490,    71,   205,  4280,     1]]), 
#       'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
# 问题：是每个单词对应着input_ids的1个值么？（经测试不是的）


# input_ids: PyTorch张量，代表了prompt文本的Token IDs。
# Token IDs是模型训练时使用的整数标识符，每个 unique token都有一个对应的ID。
# 在这里，input_ids是tensor([[27, 43, 3, 9, 443, 3, 1]])，
# 表示prompt文本被转换为了一个整数序列，其中每个数字代表了词汇表中每个单词的ID。

# attention_mask: PyTorch张量，代表了input_ids中每个token的有效性（即是否应该关注这个token）。
# 在这里，attention_mask是tensor([[1, 1, 1, 1, 1, 1, 1]])，这意味着所有输入的token都应该被模型关注。

# attention_mask用于标识实际的Token和填充Token的掩码。在处理文本序列时，由于序列的长度可能不同，
# 因此需要使用填充Token来对齐不同长度的序列。attention_mask的作用是告诉模型哪些位置上的Token是实际的Token，
# 哪些位置上的Token是填充的Token。这样，模型在计算注意力权重时可以避免对填充的Token给予过多的关注。

# Token IDs 和 attention_mask 是同维度的

{'input_ids': tensor([[   27, 21490,    71,   205,  4280,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [8]:
index = 200

dialogue = dataset['test'][index]['dialogue']  # 待summary文本
summary = dataset['test'][index]['summary']    # summary

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt') # return_tensors='pt' ：将输出转换为tensors
# 将prompt文本转换为模型可以接受的格式

output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

<a name='2'></a>
## 2 - Perform Full Fine-Tuning （全参数微调）

<a name='2.1'></a>
### 2.1 - Preprocess the Dialog-Summary Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary: 
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [73]:
# 数据预处理
# 构建“prompt-response”的pair 的input_id
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.


tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

print(dataset.shape)
# {'train': (12460, 4), 'test': (1500, 4), 'validation': (500, 4)}
print(tokenized_datasets.shape)
#{'train': (12460, 2), 'test': (1500, 2), 'validation': (500, 2)}

{'train': (12460, 4), 'test': (1500, 4), 'validation': (500, 4)}
{'train': (12460, 2), 'test': (1500, 2), 'validation': (500, 2)}


To save some time in the lab, you will subsample the dataset:

In [74]:
# 数据抽样
# To save some time in the lab, you will subsample the dataset:

print('采样前：',tokenized_datasets.shape)
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 1 == 0, with_indices=True)
                    # 保留tokenized_datasets数据集中的每100个元素的子集


# 查看抽样后，train、test、valid各自的数据量
# Check the shapes of all three parts of the dataset:
print('采样后：')
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

采样前： {'train': (12460, 2), 'test': (1500, 2), 'validation': (500, 2)}


Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

采样后：
Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
})


<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [11]:

import os
import time
from transformers import TrainingArguments, Trainer
 
# 设置环境变量以禁用 Weights & Biases 跟踪
os.environ["WANDB_DISABLED"] = "true"
 
# 创建一个输出目录，该目录包含当前时间戳，以便每个训练任务都有独特的存储位置
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'
 
# 定义训练参数
training_args = TrainingArguments(
   output_dir=output_dir,  # 输出目录
   learning_rate=1e-5,      # 学习率
   num_train_epochs=1,      # 训练轮数
   weight_decay=0.01,       # 权重衰减
   logging_steps=1,         # 日志记录步数
   max_steps=1,             # 最大训练步数
   report_to=None           # 不报告训练进度
)
 
# 初始化 Trainer 类
trainer = Trainer(
   model=original_model,  # 用于微调的预训练模型
   args=training_args,    # 训练参数
   train_dataset=tokenized_datasets['train'],  # 训练数据集
   eval_dataset=tokenized_datasets['validation']  # 验证数据集
)
 
# 注意：以上代码假设您已经定义了 'original_model' 和 'tokenized_datasets'。
# 'tokenized_datasets' 是一个包含已经分词和编码的数据集的字典。

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Start training process...



In [12]:
# 训练...启动！！！
# 记得切换GPU哦
trainer.train()



Step,Training Loss
1,48.5


TrainOutput(global_step=1, training_loss=48.5, metrics={'train_runtime': 1.6936, 'train_samples_per_second': 4.724, 'train_steps_per_second': 0.59, 'total_flos': 5478058819584.0, 'train_loss': 48.5, 'epoch': 0.06})

* Training a fully fine-tuned version of the model would take a few hours on a GPU. To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model** in this lab.
* 在GPU上训练一个完全微调的模型版本需要几个小时。为了节省时间，请下载一个完全微调模型的检查点，用于此笔记本的其余部分。这个完全微调的模型也将被称为本实验室的指导模型。

In [13]:
!aws s3 cp --recursive s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/ ./flan-dialogue-summary-checkpoint/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
fatal error: Unable to locate credentials


The size of the downloaded instruct model is approximately 1GB.

In [14]:
!ls -alh /kaggle/working/flan-dialogue-summary-checkpoint/pytorch_model.bin

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ls: cannot access '/kaggle/working/flan-dialogue-summary-checkpoint/pytorch_model.bin': No such file or directory


Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [15]:
# 定义“初始模型”
original_model = original_model.to('cpu')

In [16]:
# # 定义“全参微调后的模型”
# instruct_model ： fully fine-tuned model 
#instruct_model = AutoModelForSeq2SeqLM.from_pretrained("/kaggle/input/generative-ai-with-llms-lab-2/lab_2/flan-dialogue-summary-checkpoint/", 
#                                                       torch_dtype=torch.bfloat16).to('cpu')

# 以上代码有报错，无法载入“全量微调后的模型”，暂时先用微调前模型替代
instruct_model = original_model

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)（定性评价）
### 查看全量微调后的模型效果（主观评价）

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [17]:
index = 200  # 在测试集中随机抽取1条数据
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# input输入model，model推理output
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
# 将推理output解码为自然语言

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1: Have you considered upgrading your system? #Person2: Yes, I'm not sure what exactly I would need. #Person1: You could consider adding a painting program to your software. #Person2: You might want to add a CD-ROM drive. #Person1: No, I'm not sure. #Person2: I'm not sure what I would need. #Person2: I'd probably need a more powerful processor, to begin with. #Person1: You might want to upgrade your hardware because it's pretty outdated now. #Person2: You could also consider adding a CD-ROM drive. #Person2: I'm not sure what I would need.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Per

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)（定量评价）

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [18]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [19]:
dialogues = dataset['test'][0:10]['dialogue']  # 为了节省时间，从测试集抽取10条样本用于评价
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"#Person1#: Ms. Dawson, this memo is for the af...",#Person2#: I need to take a dictation for you....
1,In order to prevent employees from wasting tim...,This memo is for all employees.,The Office of the President and CEO has issued...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1: Please type in a memo to all employe...,Employees are required to use instant messaging.
3,#Person2# arrives late because of traffic jam....,#Person1: I'm here! #Person2: I'm finally here...,The driver of the car is stuck in traffic.
4,#Person2# decides to follow #Person1#'s sugges...,#Person1#: You're finally here! #Person2#: I'm...,The Carrefour office is closed at the end of t...
5,#Person2# complains to #Person1# about the tra...,#Person1: You're here! #Person2: I'm glad you'...,#Person1: I'm a traffic expert. #Person2: I'm ...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,#Person1: Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,#Person1#: Masha and Hero are getting divorced...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are divorced.,#Person1: Masha and Hero are getting divorced....
9,#Person1# and Brian are at the birthday party ...,Brian's birthday party is going to be a great ...,People are celebrating Brian's birthday.


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [20]:
# 假设 rouge 是一个已经导入的 Rouge 评估库的实例，用于评估自动生成的摘要的质量和准确性。
# original_model_summaries 是使用原始模型生成的摘要列表。
# human_baseline_summaries 是一个包含人类生成的基准摘要列表。

# 微调前模型
original_model_results = rouge.compute(
   predictions=original_model_summaries,  # 预测的摘要列表
   references=human_baseline_summaries[0:len(original_model_summaries)],  # 基准摘要列表，长度与预测摘要相同
   use_aggregator=True,  # 使用聚合器来计算平均分数
   use_stemmer=True  # 使用词干提取器来简化单词
)
 
# # 微调后模型
# instruct_model_summaries 是使用指导模型（可能是在特定指令下训练的模型）生成的摘要列表。
instruct_model_results = rouge.compute(
   predictions=instruct_model_summaries,  # 预测的摘要列表
   references=human_baseline_summaries[0:len(instruct_model_summaries)],  # 基准摘要列表，长度与预测摘要相同
   use_aggregator=True,  # 使用聚合器来计算平均分数
   use_stemmer=True  # 使用词干提取器来简化单词
)
 
# 打印原始模型和指导模型的评估结果
print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2625896510948609, 'rouge2': 0.09789459936773273, 'rougeL': 0.2202812192302071, 'rougeLsum': 0.22361512426226335}
INSTRUCT MODEL:
{'rouge1': 0.23379888309748356, 'rouge2': 0.06717851592851594, 'rougeL': 0.20473865264656105, 'rougeLsum': 0.20677207847136939}


- The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:
- 文件data/dialogue-summary-training-results.csv包含一个预先填充的所有模型结果列表，您可以使用该列表对更大的数据部分进行评估。让我们为每个模型执行以下操作：

In [38]:
results

Unnamed: 0.1,Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,0,Ms. Dawson helps #Person1# to write a memo to ...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
1,1,In order to prevent employees from wasting tim...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
2,2,Ms. Dawson takes a dictation for #Person1# abo...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
3,3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and got stuck i...
4,4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and got stuck i...
...,...,...,...,...,...
1495,1495,Matthew and Steve meet after a long time. Stev...,#Person1#: Hi! #Person2#: Hi! How are you? #Pe...,Steve hasn't seen Matthew for a year. He's bee...,Matthew and Steve are looking for a place to l...
1496,1496,Steve has been looking for a place to live. Ma...,#Person1#: Hi! #Person2#: Hi! How are you? #Pe...,Steve hasn't seen Matthew for a year. He's bee...,Matthew and Steve are looking for a place to l...
1497,1497,Frank invites Besty to the party to celebrate ...,Person1 is going to throw a party for all of h...,Frank invites Betsy to his promotion party on ...,Frank invites Betsy to a party for all of his ...
1498,1498,Frank invites Betsy to the big promotion party...,Person1 is going to throw a party for all of h...,Frank invites Betsy to his promotion party on ...,Frank invites Betsy to a party for all of his ...


In [21]:
# 评价全量预测结果的评估分数
# 文件结果来自全量微调后的模型的推理
results = pd.read_csv("/kaggle/input/dialogue-summary-training-results/dialogue-summary-training-results.csv")
#results = pd.read_csv("/kaggle/input/generative-ai-with-llms-lab-2/lab_2/data/dialogue-summary-training-results.csv")
# df里包含index、人工的总结、微调前模型的总结、全参微调后模型的总结、参数微调后模型的总结
# human_baseline_summaries	original_model_summaries	instruct_model_summaries	peft_model_summaries


human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

# 指标评估 ROUGE分数
# ROUGE是一种用于评估文本摘要（或其他自然语言处理任务）质量的指标。
# 与BLEU不同，ROUGE主要关注机器生成的摘要中是否捕捉到了参考摘要的信息，着重于涵盖参考摘要的内容和信息的完整性。
# ROUGE通过计算N-gram的共现情况，来评估机器生成的摘要的召回率（Recall）

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}


#### Rouge 分数是评估自动生成的文本摘要质量的一种常用指标，它主要包括以下几种评分方式：
* Rouge-1: 最常用，测量的是精确匹配的长度为 1 的单词或短语的数量。简单来说，它计算的是模型生成的摘要中与基准摘要完全相同的单字或短语的数量。
* Rouge-2: 测量的是长度为 2 的完全匹配的数量，以及长度为 1 的单词或短语的精确匹配的数量。这有助于评估模型摘要中是否有更长的短语与基准摘要相匹配。
* Rouge-L: Rouge-1和Rouge-2的加权组合，使用长度为1和2的单词或短语的匹配计算分数。Rouge-L更全面，因为考虑了不同长度的匹配，但它仍侧重于较短的匹配(1和2长度)
* Rouge-Lsum: 改进的 Rouge-L 分数，考虑所有单词的精准匹配，不仅仅是长度为1和2的短语。这意味着 Rouge-Lsum 会给出更全面的摘要质量评估，因为它不限制匹配的长度。
* 总的来说，Rouge 分数越高，表示自动生成的摘要与人类生成的摘要越相似，质量越高。不同的 Rouge 分数类型关注不同的细节，但它们都是评估摘要质量的有用工具。在实际应用中，通常会报告这些分数的综合表现，以便全面了解模型的性能。

#### The results show substantial improvement in all ROUGE metrics:
#### 结果来看，全参微调模型的表现明显优于微调前模型

In [39]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")
print("INSTRUCT MODEL相对于人类基线的绝对百分比改进")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
INSTRUCT MODEL相对于人类基线的绝对百分比改进
rouge1: 18.82%
rouge2: 10.43%
rougeL: 13.70%
rougeLsum: 13.69%


<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [95]:
from peft import LoraConfig, get_peft_model, TaskType
 
# 导入LoRA配置类和获取LoRA模型的函数
# 以及任务类型的枚举，这里用于指定序列到序列的语言模型任务
 
# 第1步：创建LoRA配置对象
lora_config = LoraConfig(
   r=32,  # LoRA的秩，即低秩分解的秩，这里设置为32，hyper-parameter, which defines the rank/dimension of the adapter to be trained.
   lora_alpha=32,  # LoRA的超参数alpha，控制正则化的强度，这里设置为32
   target_modules=["q", "v"],  # 指定要进行LoRA适应的模型模块，这里是指定查询(q)和值(v)模块
   lora_dropout=0.05,  # LoRAdropout率，这里设置为0.05
   bias="none",  # 指定是否在LoRA适配层中使用偏置，这里设置为"none"表示不使用偏置
   task_type=TaskType.SEQ_2_SEQ_LM  # 指定任务类型为序列到序列的语言模型(FLAN-T5)
)
 
# 注意：FLAN-T5是一个特定的模型架构，它使用了FLAN（Fine-tuning with LoRA Adaptation Network）方法
# 以及T5（Text-to-Text Transfer Transformer）模型的架构。这里的TaskType.SEQ_2_SEQ_LM是指这种类型的任务。
# 一旦配置好LoRA参数，就可以使用get_peft_model函数根据这些参数获取LoRA适配的模型

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [96]:
# 第2步：加载待sft的模型
# Add LoRA adapter layers/parameters to the original LLM to be trained.
# 在模型上面加一个LoRA adapter层
peft_model = get_peft_model(original_model,   # 微调前模型
                            lora_config)      # sft参数
print(print_number_of_trainable_model_parameters(peft_model)) 
# 支持训练的参数其实就是新增的lora层的参数

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [103]:
output_dir = f'/kaggle/working/peft-dialogue-summary-training-{str(int(time.time()))}'
 
# 设置PEFT训练参数
peft_training_args = TrainingArguments(
  output_dir=output_dir,  # 设置输出目录为当前时间戳指定的路径
  auto_find_batch_size=True,  # 自动找到合适的批量大小
  learning_rate=0.001,                      # 设置学习率为1e-3（0.001），这比完全微调时的学习率要高
  num_train_epochs=10,  # 设置训练轮数为1
  logging_steps=1,  # 设置日志记录步数为1
  max_steps=1  # 设置最大训练步数为1
)
 
peft_trainer = Trainer(
  model=peft_model,  # 使用PEFT适配后的模型
  args=peft_training_args,  # 训练参数
  train_dataset=tokenized_datasets["train"],  # 训练数据集
)
 
tokenized_datasets["train"].shape
    
# 注意：代码中缺少了eval_dataset参数，如果需要评估，应该添加eval_dataset参数
# 例如：eval_dataset=tokenized_datasets["validation"]
 
# 一旦配置好训练参数，就可以使用Trainer类来执行训练过程

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


(12460, 2)

Now everything is ready to train the PEFT adapter and save the model.



In [104]:
import time
 
# start time
start_time = time.time()
 
    
# 训练... 启动!
peft_trainer.train()

peft_model_path="/kaggle/working/peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)


# end time
end_time = time.time()
training_duration = end_time - start_time
print(f"Training duration: {training_duration} seconds")

# 0.001   Training Loss=48.5 1轮
# 0.0001  Training Loss=48.5

Step,Training Loss
1,33.0


Training duration: 2.362297773361206 seconds




That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from S3.

In [27]:
!aws s3 cp --recursive s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/ /kaggle/working/peft-dialogue-summary-checkpoint-from-s3/ 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
fatal error: Unable to locate credentials


Check that the size of this model is much less than the original LLM:

In [28]:
!ls -al /kaggle/working/peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ls: cannot access '/kaggle/working/peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin': No such file or directory


Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [29]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("/kaggle/input/flan-t5/pytorch/base/4", torch_dtype=torch.bfloat16)



#tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/flan-t5/pytorch/base/4")

#peft_model = PeftModel.from_pretrained(peft_model_base, 
#                                       '/kaggle/input/generative-ai-with-llms-lab-2/lab_2/peft-dialogue-summary-checkpoint-from-s3', 
#                                       torch_dtype=torch.bfloat16,
#                                      is_trainable=False
#                                      )
peft_model = peft_model_base
# 因报错修改

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [30]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [31]:
peft_model = peft_model.to('cpu')
instruct_model = instruct_model.to('cpu')
original_model = original_model.to('cpu')

In [32]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm not sure what I'm talking about. #Person2#: I'm not sure what I'm talking about. #Person1#: I'm not sure what I'm talking about. #Person2#: I'm not sure what I'm talking about. #Person1#: I'm not sure what I'm talking about. #Person2#: I'm not sure what I'm talking about. #Person1#: I'm not sure what I'm talking about. #Person2#: I'm not sure what I'm talking about. #Person1#: I'm not sure what I'm talking about. #Person1#: I'm not sure what I'm talking about.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1#: Have you considered upgrading your computer? #Person1#: Yes, but 

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time). 

In [33]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Ms. Dawson will be out for an intra-office mem...,This memo is for employees only.,#Person1#: I need to take a dictation for you.
1,In order to prevent employees from wasting tim...,Employees will be required to use instant mess...,"#Person1#: Ms. Dawson, please type this memo i...",#Person1#: I need to take a dictation for you.
2,Ms. Dawson takes a dictation for #Person1# abo...,The following memo is to be sent out to all em...,This is an intra-office memo.,#Person1#: I need to take a dictation for you.
3,#Person2# arrives late because of traffic jam....,#Person1: You're here!,The conversation is about the future of public...,The traffic jam at the Carrefour intersection ...
4,#Person2# decides to follow #Person1#'s sugges...,People are talking about the traffic jams in t...,#Person1: I'm going to the office to work. #Pe...,The traffic jam at the Carrefour intersection ...
5,#Person2# complains to #Person1# about the tra...,#P Person1#: I'm sorry to hear about your car ...,Taking public transport to work is a good option.,The traffic jam at the Carrefour intersection ...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are divorced.,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,People are talking about Masha and Hero.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy Birthday, Brian. #Person2#: Y...","#Person1#: Happy Birthday, Brian.","#Person1#: Happy birthday, Brian. #Person2#: I..."


In [34]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.21688642011010434, 'rouge2': 0.07751515151515151, 'rougeL': 0.18549546944283785, 'rougeLsum': 0.1884084141978879}
INSTRUCT MODEL:
{'rouge1': 0.28422212707926997, 'rouge2': 0.09696904830586307, 'rougeL': 0.2237588601874316, 'rougeLsum': 0.22708841423127135}
PEFT MODEL:
{'rouge1': 0.24089921652421653, 'rouge2': 0.11769053708439897, 'rougeL': 0.22001958689458687, 'rougeLsum': 0.22134175465057818}


Notice, that PEFT model results are not too bad, while the training process was much easier!

You already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [35]:
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}
PEFT MODEL:
{'rouge1': 0.40810631575616746, 'rouge2': 0.1633255794568712, 'rougeL': 0.32507074586565354, 'rougeLsum': 0.3248950182867091}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [36]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.36%
rougeLsum: 12.34%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [37]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.35%
rouge2: -1.70%
rougeL: -1.34%
rougeLsum: -1.35%


Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).

Source:
 - https://www.coursera.org/learn/generative-ai-with-llms
 - https://www.coursera.org/learn/generative-ai-with-llms/gradedLti/x0gc1/lab-2-fine-tune-a-generative-ai-model-for-dialogue-summarization
 
https://creativecommons.org/licenses/by-sa/2.0/legalcode