# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

## Set up Kernel, Load Required Dependencies, Dataset and LLM

### Set up Kernel and Required Dependencies

In [1]:
# %pip install \
#     torch
#     torchdata
#     transformers \
#     datasets \
#     evaluate \
#     rouge_score \
#     loralib \
#     peft \
#     trl \
#     awscli --quiet

Loading necessary packages

In [2]:
import time
import random
import torch
import evaluate
import pandas as pd
import numpy as np

from datasets import \
    load_dataset, \
    Dataset

from transformers import \
        AutoModelForSeq2SeqLM, \
        AutoTokenizer, \
        GenerationConfig, \
        TrainingArguments, \
        Trainer, \
        AutoModelForCausalLM

from trl import \
    SFTConfig, \
    SFTTrainer

from peft import \
    LoraConfig, \
    TaskType, \
    PeftModel

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

In [3]:
torch.cuda.is_available()

True

### Setup GPU ID

In [10]:
## Specify which GPU to use (0, 1, or 2)
# gpu_id = 0
# device = torch.device(f'cuda:{gpu_id}' if torch.cuda.is_available() else 'cpu')

# force the code to only use CPU
# torch.cuda.is_available = lambda : False
device = torch.device(f'cuda' if torch.cuda.is_available() else 'cpu')

# Move the model to the specified device
# original_model = original_model.to(device)

In [11]:
#version check
print(torch.__version__, device)

2.2.2+cu121 cpu


### Load Dataset from Huggingface

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics. 

In [12]:
huggingface_dataset_name = "knkarthick/dialogsum"

# dataset = load_dataset(huggingface_dataset_name, split='train')
dataset = load_dataset(huggingface_dataset_name)

# show summary
display(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

### Load Dataset from Local Drive

In [13]:
def load_csv_dataset(file_path, train_size=0.95, random_state=42):
    df = pd.read_csv(file_path)
    
    df['text'] = df.apply(lambda row: f"Translate the following natural language question to First Order Logic (FOL). Please respond with only the FOL statement. Don't include additional text.\nQuestion: {row['Question']}\nFOL Query: {row['FOL Query']}", axis=1)
    
    train_df, val_df = train_test_split(df, train_size=train_size, random_state=random_state)
    
    return {
        "train": Dataset.from_pandas(train_df),
        "validation": Dataset.from_pandas(val_df)
    }
#end def

In [14]:
dataset_fol = load_csv_dataset("question_query_train.csv")
display(dataset_fol)

{'train': Dataset({
     features: ['Question', 'FOL Query', 'text', '__index_level_0__'],
     num_rows: 438
 }),
 'validation': Dataset({
     features: ['Question', 'FOL Query', 'text', '__index_level_0__'],
     num_rows: 24
 })}

### Load an LLM from Huggingface for Seq2Seq tasks
It requires loading the model and its tokenizer

In [15]:
# Loading a medium model: T-5
model_name_1='google/flan-t5-base'
original_model_t5 = AutoModelForSeq2SeqLM.from_pretrained(model_name_1, torch_dtype=torch.bfloat16)
original_model_t5.to(device)

tokenizer = AutoTokenizer.from_pretrained(model_name_1)

### Loading an LLM for Causal tasks

In [16]:
model_name_2='gpt2'
# device_map = 'auto'

original_model_gpt2 = AutoModelForCausalLM.from_pretrained(model_name_2, trust_remote_code=True)
original_model_gpt2.to(device)

tokenizer_gpt2 = AutoTokenizer.from_pretrained(model_name_2)
tokenizer_gpt2.pad_token = tokenizer.eos_token

### Identify what weights to tune

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it. 

In [17]:
def count_trainable_parameters(model):
    # numel stands for "number of elements"
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def identify_trainable_layers(model):
    trainable_layers = []
    for name, param in model.named_parameters():
        if param.requires_grad:
            trainable_layers.append(name)
    return trainable_layers

In [18]:
# Count trainable parameters
trainable_params = count_trainable_parameters(original_model_t5)
print(f"Total trainable parameters: {trainable_params} for T5")

# Identify trainable layers
trainable_layers = identify_trainable_layers(original_model_t5)

# Print the total number of parameters
total_params = sum(p.numel() for p in original_model_t5.parameters())
print(f"Total parameters: {total_params}")
print(f"Percentage of trainable parameters: {trainable_params / total_params * 100:.2f}%")

print("Trainable layers:")
for layer in trainable_layers:
    print(f"- {layer}")

Total trainable parameters: 247577856 for T5
Total parameters: 247577856
Percentage of trainable parameters: 100.00%
Trainable layers:
- shared.weight
- encoder.block.0.layer.0.SelfAttention.q.weight
- encoder.block.0.layer.0.SelfAttention.k.weight
- encoder.block.0.layer.0.SelfAttention.v.weight
- encoder.block.0.layer.0.SelfAttention.o.weight
- encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight
- encoder.block.0.layer.0.layer_norm.weight
- encoder.block.0.layer.1.DenseReluDense.wi_0.weight
- encoder.block.0.layer.1.DenseReluDense.wi_1.weight
- encoder.block.0.layer.1.DenseReluDense.wo.weight
- encoder.block.0.layer.1.layer_norm.weight
- encoder.block.1.layer.0.SelfAttention.q.weight
- encoder.block.1.layer.0.SelfAttention.k.weight
- encoder.block.1.layer.0.SelfAttention.v.weight
- encoder.block.1.layer.0.SelfAttention.o.weight
- encoder.block.1.layer.0.layer_norm.weight
- encoder.block.1.layer.1.DenseReluDense.wi_0.weight
- encoder.block.1.layer.1.DenseReluDense.wi_

In [19]:
trainable_params = count_trainable_parameters(original_model_gpt2)
print(f"Total trainable parameters: {trainable_params} for GPT2")

# Identify trainable layers
trainable_layers = identify_trainable_layers(original_model_gpt2)

# Print the total number of parameters
total_params = sum(p.numel() for p in original_model_gpt2.parameters())
print(f"Total parameters: {total_params}")
print(f"Percentage of trainable parameters: {trainable_params / total_params * 100:.2f}%")

print("Trainable layers:")
for layer in trainable_layers:
    print(f"- {layer}")

Total trainable parameters: 124439808 for GPT2
Total parameters: 124439808
Percentage of trainable parameters: 100.00%
Trainable layers:
- transformer.wte.weight
- transformer.wpe.weight
- transformer.h.0.ln_1.weight
- transformer.h.0.ln_1.bias
- transformer.h.0.attn.c_attn.weight
- transformer.h.0.attn.c_attn.bias
- transformer.h.0.attn.c_proj.weight
- transformer.h.0.attn.c_proj.bias
- transformer.h.0.ln_2.weight
- transformer.h.0.ln_2.bias
- transformer.h.0.mlp.c_fc.weight
- transformer.h.0.mlp.c_fc.bias
- transformer.h.0.mlp.c_proj.weight
- transformer.h.0.mlp.c_proj.bias
- transformer.h.1.ln_1.weight
- transformer.h.1.ln_1.bias
- transformer.h.1.attn.c_attn.weight
- transformer.h.1.attn.c_attn.bias
- transformer.h.1.attn.c_proj.weight
- transformer.h.1.attn.c_proj.bias
- transformer.h.1.ln_2.weight
- transformer.h.1.ln_2.bias
- transformer.h.1.mlp.c_fc.weight
- transformer.h.1.mlp.c_fc.bias
- transformer.h.1.mlp.c_proj.weight
- transformer.h.1.mlp.c_proj.bias
- transformer.h.2.ln_

#### The Naming Convention of Weights

##### Key Components
- **c_attn**: Prepares inputs for attention
- **c_proj**: Processes the output of attention
- **c_fc**: Part of the feed-forward network in each transformer block

##### Historical Context
The "c" prefix originates from the original OpenAI GPT (Generative Pre-trained Transformer) implementation, which was based on the codebase for the Transformer model introduced in the "Attention Is All You Need" paper.

##### Meaning of "c"
The "c" likely stands for "convolution" or "conv". This might seem odd at first, given that transformers don't use convolutions in the same way convolutional neural networks do. However, this terminology draws an analogy between attention mechanisms and dynamic convolutions.

##### Special Note on c_attn
`c_attn` is a merged representation of the Query (Q), Key (K), and Value (V) weight matrices. This is a key characteristic of the GPT-2 implementation that differentiates it from some other transformer architectures.

##### Comparison with Standard Transformer Architecture
1. **Standard Transformer:**
   - Q, K, and V are separate linear projections:
     - W_q for Query
     - W_k for Key
     - W_v for Value

2. **GPT-2 Implementation:**
   - These three projections are combined into a single larger matrix: `c_attn.weight`
   - The matrix is effectively [W_q || W_k || W_v], where '||' represents concatenation

##### Implications
- This merged approach in GPT-2 can be more efficient in terms of computation and memory usage.
- It allows for a single large matrix multiplication instead of three separate ones.

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1 1" width="800" height="600">  
  <style>
    .small { font: italic 0.0216666667px sans-serif; }
    .heavy { font: bold 0.05px sans-serif; }
    .caption { font: 0.025px sans-serif; }
  </style>
  
  <!-- Input Embeddings -->
  <rect x="0.0625" y="0.0833333333" width="0.25" height="0.1333333333" fill="#FFB3BA" stroke="black" stroke-width="0.001667" />
  <text x="0.1875" y="0.1583333333" text-anchor="middle" class="caption">Input Embeddings</text>
  <text x="0.1875" y="0.1916666667" text-anchor="middle" class="small">wte.weight, wpe.weight</text>
  
  <!-- Transformer Blocks -->
  <rect x="0.0625" y="0.25" width="0.25" height="0.6666666667" fill="#BAFFC9" stroke="black" stroke-width="0.001667" />
  <text x="0.1875" y="0.2916666667" text-anchor="middle" class="caption">Transformer Blocks</text>
  
  <!-- Block 0 -->
  <rect x="0.0875" y="0.3333333333" width="0.2" height="0.1666666667" fill="#BAE1FF" stroke="black" stroke-width="0.001667" />
  <text x="0.1875" y="0.3666666667" text-anchor="middle" class="small">Block 0 (h.0)</text>
  <text x="0.1875" y="0.4" text-anchor="middle" class="small">ln_1, attn, ln_2, mlp</text>
  <text x="0.1875" y="0.4333333333" text-anchor="middle" class="small">weights and biases</text>
  
  <!-- Block 1 -->
  <rect x="0.0875" y="0.5333333333" width="0.2" height="0.1666666667" fill="#BAE1FF" stroke="black" stroke-width="0.001667" />
  <text x="0.1875" y="0.5666666667" text-anchor="middle" class="small">Block 1 (h.1)</text>
  <text x="0.1875" y="0.6" text-anchor="middle" class="small">ln_1, attn, ln_2, mlp</text>
  <text x="0.1875" y="0.6333333333" text-anchor="middle" class="small">weights and biases</text>
  
  <!-- Block 2 -->
  <rect x="0.0875" y="0.7333333333" width="0.2" height="0.1666666667" fill="#BAE1FF" stroke="black" stroke-width="0.001667" />
  <text x="0.1875" y="0.7666666667" text-anchor="middle" class="small">Block 2 (h.2)</text>
  <text x="0.1875" y="0.8" text-anchor="middle" class="small">ln_1, attn, ln_2, mlp</text>
  <text x="0.1875" y="0.8333333333" text-anchor="middle" class="small">weights and biases</text>
  
  <!-- Legend -->
  <rect x="0.375" y="0.0833333333" width="0.5625" height="0.8333333333" fill="#FFFFC9" stroke="black" stroke-width="0.001667" />
  <text x="0.65625" y="0.1333333333" text-anchor="middle" class="caption">Legend: Weight and Bias Descriptions</text>
  
  <text x="0.4" y="0.1833333333" class="small">wte.weight: Token embedding weights</text>
  <text x="0.4" y="0.2166666667" class="small">wpe.weight: Position embedding weights</text>
  <text x="0.4" y="0.2666666667" class="small">For each block (h.0, h.1, h.2):</text>
  <text x="0.425" y="0.3" class="small">ln_1.weight, ln_1.bias: Layer norm 1</text>
  <text x="0.425" y="0.3333333333" class="small">attn.c_attn.weight, attn.c_attn.bias: Attention input</text>
  <text x="0.425" y="0.3666666667" class="small">attn.c_proj.weight, attn.c_proj.bias: Attention output</text>
  <text x="0.425" y="0.4" class="small">ln_2.weight, ln_2.bias: Layer norm 2</text>
  <text x="0.425" y="0.4333333333" class="small">mlp.c_fc.weight, mlp.c_fc.bias: MLP first layer</text>
  <text x="0.425" y="0.4666666667" class="small">mlp.c_proj.weight, mlp.c_proj.bias: MLP second layer</text>
</svg>

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1 1" width="800" height="600">
  <style>
    .small { font: italic 0.0216666667px sans-serif; }
    .caption { font: bold 0.0266666667px sans-serif; }
    .subcaption { font: 0.0233333333px sans-serif; }
  </style>
  
  <!-- Input -->
  <rect x="0.0625" y="0.0833333333" width="0.125" height="0.0666666667" fill="#FFB3BA" stroke="black" stroke-width="0.001667" />
  <text x="0.125" y="0.125" text-anchor="middle" class="subcaption">Input</text>

  <!-- Layer Norm 1 -->
  <rect x="0.0625" y="0.1833333333" width="0.125" height="0.1" fill="#BAFFC9" stroke="black" stroke-width="0.001667" />
  <text x="0.125" y="0.2333333333" text-anchor="middle" class="subcaption">Layer Norm 1</text>
  <text x="0.125" y="0.2666666667" text-anchor="middle" class="small">ln_1.weight</text>
  <text x="0.125" y="0.2916666667" text-anchor="middle" class="small">ln_1.bias</text>

  <!-- Self-Attention -->
  <rect x="0.0625" y="0.3166666667" width="0.375" height="0.3333333333" fill="#BAE1FF" stroke="black" stroke-width="0.001667" />
  <text x="0.25" y="0.3583333333" text-anchor="middle" class="caption">Self-Attention</text>
  
  <!-- QKV Projection -->
  <rect x="0.0875" y="0.3833333333" width="0.325" height="0.1" fill="#FFD700" stroke="black" stroke-width="0.001667" />
  <text x="0.25" y="0.425" text-anchor="middle" class="subcaption">QKV Projection</text>
  <text x="0.25" y="0.4583333333" text-anchor="middle" class="small">attn.c_attn.weight</text>
  <text x="0.25" y="0.4833333333" text-anchor="middle" class="small">attn.c_attn.bias</text>

  <!-- Attention Computation -->
  <rect x="0.0875" y="0.5" width="0.325" height="0.0666666667" fill="#FF69B4" stroke="black" stroke-width="0.001667" />
  <text x="0.25" y="0.5416666667" text-anchor="middle" class="subcaption">Attention Computation</text>

  <!-- Output Projection -->
  <rect x="0.0875" y="0.5833333333" width="0.325" height="0.1" fill="#FFD700" stroke="black" stroke-width="0.001667" />
  <text x="0.25" y="0.625" text-anchor="middle" class="subcaption">Output Projection</text>
  <text x="0.25" y="0.6583333333" text-anchor="middle" class="small">attn.c_proj.weight</text>
  <text x="0.25" y="0.6833333333" text-anchor="middle" class="small">attn.c_proj.bias</text>

  <!-- Layer Norm 2 -->
  <rect x="0.0625" y="0.6833333333" width="0.125" height="0.1" fill="#BAFFC9" stroke="black" stroke-width="0.001667" />
  <text x="0.125" y="0.7333333333" text-anchor="middle" class="subcaption">Layer Norm 2</text>
  <text x="0.125" y="0.7666666667" text-anchor="middle" class="small">ln_2.weight</text>
  <text x="0.125" y="0.7916666667" text-anchor="middle" class="small">ln_2.bias</text>

  <!-- Feed Forward Network -->
  <rect x="0.0625" y="0.8166666667" width="0.375" height="0.1666666667" fill="#FFA07A" stroke="black" stroke-width="0.001667" />
  <text x="0.25" y="0.8583333333" text-anchor="middle" class="caption">Feed Forward Network</text>
  
  <!-- First FC Layer -->
  <rect x="0.0875" y="0.8833333333" width="0.15" height="0.0833333333" fill="#98FB98" stroke="black" stroke-width="0.001667" />
  <text x="0.1625" y="0.925" text-anchor="middle" class="small">mlp.c_fc.weight</text>
  <text x="0.1625" y="0.95" text-anchor="middle" class="small">mlp.c_fc.bias</text>

  <!-- Second FC Layer -->
  <rect x="0.2625" y="0.8833333333" width="0.15" height="0.0833333333" fill="#98FB98" stroke="black" stroke-width="0.001667" />
  <text x="0.3375" y="0.925" text-anchor="middle" class="small">mlp.c_proj.weight</text>
  <text x="0.3375" y="0.95" text-anchor="middle" class="small">mlp.c_proj.bias</text>

  <!-- Output -->
  <rect x="0.3125" y="0.0833333333" width="0.125" height="0.0666666667" fill="#FFB3BA" stroke="black" stroke-width="0.001667" />
  <text x="0.375" y="0.125" text-anchor="middle" class="subcaption">Output</text>

  <!-- Arrows -->
  <line x1="0.125" y1="0.15" x2="0.125" y2="0.1833333333" stroke="black" stroke-width="0.003333" />
  <line x1="0.125" y1="0.2833333333" x2="0.125" y2="0.3166666667" stroke="black" stroke-width="0.003333" />
  <line x1="0.25" y1="0.65" x2="0.25" y2="0.6833333333" stroke="black" stroke-width="0.003333" />
  <line x1="0.125" y1="0.7833333333" x2="0.125" y2="0.8166666667" stroke="black" stroke-width="0.003333" />
  <line x1="0.375" y1="0.0833333333" x2="0.375" y2="0.05" stroke="black" stroke-width="0.003333" />
  <line x1="0.375" y1="0.05" x2="0.0375" y2="0.05" stroke="black" stroke-width="0.003333" />
  <line x1="0.0375" y1="0.05" x2="0.0375" y2="0.95" stroke="black" stroke-width="0.003333" />
  <line x1="0.0375" y1="0.95" x2="0.0625" y2="0.95" stroke="black" stroke-width="0.003333" />
</svg>

### Test the Model with Zero Shot Inferencing: T-5

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [20]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device) 

output = tokenizer.decode(
    original_model_t5.generate(
        input_ids, 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

### Test the Model with Zero Shot Inferencing: GPT2


In [21]:
index = 3

# test_questions = random.sample(list(zip(dataset['validation']['Question'], dataset['validation']['FOL Query'])), 1)


question = dataset_fol['validation'][index]['Question']
fol_gt = dataset_fol['validation'][index]['FOL Query']

prompt = f"""
Translate the following natural language question to First Order Logic (FOL). \
Please respond with only the FOL statement. Don't include additional text.

Question: {question}

FOL Query:
"""

input_ids = tokenizer_gpt2(prompt, return_tensors="pt").input_ids.to(original_model_gpt2.device)
output = original_model_gpt2.generate(input_ids, max_new_tokens=100, num_return_sequences=1, temperature=0.7)

generated_text = tokenizer_gpt2.decode(output[0], skip_special_tokens=True)
fol_model = generated_text.split("FOL Query:")[-1].strip()
        


dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'Ground truth:\n{fol_gt}')


print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{fol_model}')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Translate the following natural language question to First Order Logic (FOL). Please respond with only the FOL statement. Don't include additional text.

Question: Can you spot a Mini Cooper on the left?

FOL Query:

---------------------------------------------------------------------------------------------------
Ground truth:
TypeOf(x, MiniCooper)^InitialLocation(x, NearLeft)
---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:



## Perform Full Fine-Tuning

### Preprocess the Dialog-Summary Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary: 
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [22]:
def tokenize_function(example):
    # Define the start and end prompts for the summarization task
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    
    # Create a list of prompts by combining start_prompt, dialogue, and end_prompt
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    
    # Tokenize the prompts and store the input_ids in the example dictionary
    example['input_ids'] = tokenizer(
        prompt, 
        padding="max_length", 
        truncation=True, 
        return_tensors="pt"
    ).input_ids
    
    # Tokenize the summaries and store the input_ids as labels in the example dictionary
    example['labels'] = tokenizer(
        example["summary"], 
        padding="max_length", 
        truncation=True, 
        return_tensors="pt"
    ).input_ids
    
    print(example['input_ids'].shape, example['labels'].shape)
    
    return example

# The dataset actually contains 3 different splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Remove unnecessary columns from the tokenized datasets
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([460, 512]) torch.Size([460, 512])


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

torch.Size([500, 512]) torch.Size([500, 512])


Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

torch.Size([1000, 512]) torch.Size([1000, 512])
torch.Size([500, 512]) torch.Size([500, 512])


To save some time in the lab, you will subsample the dataset:

In [17]:
tokenized_datasets = tokenized_datasets.filter(lambda _, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [11]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
})


The output dataset is ready for fine-tuning.

### Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [23]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1,
    no_cuda= True, # This forces CPU usage
    use_cpu = True # This is an additional safeguard
)

trainer = Trainer(
    model=original_model_t5,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

max_steps is given, it will override any value given in num_train_epochs


Start training process...



In [None]:
trainer.train()

[2024-10-10 14:53:27,104] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)


df: /home/skb5969/.triton/autotune: No such file or directory




Training a fully fine-tuned version of the model would take a few hours on a GPU. To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model** in this lab.

In [19]:
instruct_model_name_from_huggingface = "truocpham/flan-dialogue-summary-checkpoint"
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(instruct_model_name_from_huggingface, torch_dtype=torch.bfloat16)
# instruct_model.to(device)

The size of the downloaded instruct model is approximately 1GB.

### Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [21]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model_t5.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1: I'm thinking of upgrading your software.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# suggests #Person2# adding a painting program to #Person2#'s software and upgrading the hardware. #Person2# also wants to add a CD-ROM drive.


<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [22]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ROUGE metric is commonly used in natural language processing for evaluating text summarization and translation tasks.

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It was introduced by Chin-Yew Lin in 2004. The basic idea behind ROUGE is to compare an automatically generated summary or translation against a set of reference (human-written) summaries or translations.

Here are the main types of ROUGE metrics:

1. ROUGE-N:
   This measures the overlap of n-grams between the generated and reference texts.
   - ROUGE-1: Measures unigram overlap
   - ROUGE-2: Measures bigram overlap
   - ROUGE-3: Measures trigram overlap, and so on

2. ROUGE-L:
   This measures the longest common subsequence (LCS) between the generated and reference texts. It's more flexible than ROUGE-N as it allows for gaps in the match.

3. ROUGE-W:
   A weighted version of ROUGE-L that gives more importance to consecutive matches.

4. ROUGE-S:
   This measures the overlap of skip-bigrams between the generated and reference texts. Skip-bigrams are pairs of words in their sentence order, allowing for gaps between them.

For each type of ROUGE, we typically calculate three scores:

1. Precision: The ratio of the number of overlapping units to the total number of units in the generated text.
2. Recall: The ratio of the number of overlapping units to the total number of units in the reference text.
3. F1-score: The harmonic mean of precision and recall, providing a balanced measure.

Here's a simple example to illustrate ROUGE-1:

Reference: "The cat sat on the mat."
Generated: "The cat lay on the rug."

ROUGE-1 would count the overlapping unigrams:
Overlapping: "The", "cat", "on", "the"
Precision: 4/6 (4 overlapping words out of 6 in the generated text)
Recall: 4/6 (4 overlapping words out of 6 in the reference text)
F1-score: 2 * (4/6 * 4/6) / (4/6 + 4/6) = 0.667

ROUGE scores range from 0 to 1, where higher scores indicate better performance.

It's worth noting that while ROUGE is widely used, it has limitations. It focuses on lexical overlap and doesn't capture semantic similarity or factual correctness. Therefore, it's often used in conjunction with other metrics or human evaluation for a more comprehensive assessment.

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [23]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model_t5.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Employees are required to make a memorandum of...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,"Ms. Dawson, please send me a memo on the new p...",#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1: This memo is to be distributed to al...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,Person1: I'm finally here! #Person2: I'm final...,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,The driver is stuck in a traffic jam.,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,Then there's the traffic jams.,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,"#Person1: Kate, you can't believe what happene...",Masha and Hero are getting divorced. Kate can'...
7,#Person1# tells Kate that Masha and Hero are g...,#Porning #Porning #Porning #Porning #Porning #...,Masha and Hero are getting divorced. Kate can'...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Brian, I'm so sorry to hear of you.",Brian's birthday is coming. #Person1# invites ...


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [26]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.17051333563821996, 'rouge2': 0.04698412698412698, 'rougeL': 0.15834672773022485, 'rougeLsum': 0.1590422279445028}
INSTRUCT MODEL:
{'rouge1': 0.4015906463624618, 'rouge2': 0.17568542724181807, 'rougeL': 0.2874569966059625, 'rougeLsum': 0.2886327613084294}


The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

In [42]:
results = pd.read_csv("data/dialogue-summary-training-results.csv")

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}


The results show substantial improvement in all ROUGE metrics:

In [43]:
print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: 18.82%
rouge2: 10.43%
rougeL: 13.70%
rougeLsum: 13.69%


## Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

### Use GPU

In [19]:
def get_available_device():
    if torch.cuda.is_available():
        # Get the number of available GPUs
        num_gpus = torch.cuda.device_count()
        
        if num_gpus > 0:
            # Get the GPU with the most free memory
            free_memory = []
            for i in range(num_gpus):
                total_memory = torch.cuda.get_device_properties(i).total_memory
                reserved_memory = torch.cuda.memory_reserved(i)
                allocated_memory = torch.cuda.memory_allocated(i)
                free_memory.append(total_memory - reserved_memory - allocated_memory)
            
            device_id = free_memory.index(max(free_memory))
            device = torch.device(f'cuda:{device_id}')
            print(f"Using GPU {device_id} with {free_memory[device_id] / 1e9:.2f} GB free memory")
        else:
            device = torch.device('cpu')
            print("No GPUs available. Using CPU.")
    else:
        device = torch.device('cpu')
        print("CUDA is not available. Using CPU.")
    
    return device

# Use the function to get the best available device
device = get_available_device()

Using GPU 2 with 8.88 GB free memory


### Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

### Fine-Tuning T-5 with `transformer.Trainer`
The Trainer class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs.

In [20]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

peft_model = get_peft_model(original_model_t5, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1    
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Add LoRA adapter layers/parameters to the original LLM to be trained.

Check that the size of this model is much less than the original LLM:



Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

### Fine-Tuning GPT-2 with `trl.SFTTrainer`

The SFTTrainer is a subclass of the Trainer from the transformers library and supports all the same features, including logging, evaluation, and checkpointing, but adds additiional quality of life features, including: Dataset formatting, including conversational and instruction format.

In [2]:
from peft import LoraConfig
from transformers import TrainingArguments
from trl import SFTTrainer

lora_r = 8  # Reduced from 16 to 8 for GPT-2
lora_alpha = 16
lora_dropout = 0.05
target_modules = ["c_attn", "c_proj", "c_fc"]  # GPT-2 specific target modules

# LoraConfig object is created with the following parameters:
# 'r' (rank of the low-rank approximation) is set to 16,
# 'lora_alpha' (scaling factor) is set to 16,
# 'lora_dropout' dropout probability for Lora layers is set to 0.05,
# 'task_type' (set to TaskType.CAUSAL_LM indicating the task type),
# 'target_modules' (the modules to which LoRA is applied) choosing linear layers except the output layer..

peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    task_type=TaskType.CAUSAL_LM,
    target_modules=target_modules,
    bias="none",  # GPT-2 doesn't use bias in attention layers
)

# 'TrainingArguments' is a class that holds the arguments for training a model.
# 'output_dir' is the directory where the model and its checkpoints will be saved.
# 'evaluation_strategy' is set to "steps", meaning that evaluation will be performed after a certain number of training steps.
# 'do_eval' is set to True, meaning that evaluation will be performed.
# 'optim' is set to "adamw_torch", meaning that the AdamW optimizer from PyTorch will be used.
# 'per_device_train_batch_size' and 'per_device_eval_batch_size' are set to 1, meaning that the batch size for training and evaluation will be 4 per device.
# 'gradient_accumulation_steps' is set to 8, meaning that gradients will be accumulated over 8 steps before performing a backward/update pass.
# 'log_level' is set to "info", meaning that all log messages will be printed.
# 'save_strategy' is set to "epoch", meaning that the model will be saved after each epoch.
# 'logging_steps' is set to 100, meaning that log messages will be printed every 100 steps.
# 'learning_rate' is set to 5e-5, which is the learning rate for the optimizer.
# 'fp16' is set to the opposite of whether bfloat16 is supported on the current CUDA device and the model.
# 'bf16' is set to whether bfloat16 is supported on the current CUDA device and the model..
# 'eval_steps' is set to 100, meaning that evaluation will be performed every 100 steps.
# 'num_train_epochs' is set to 200, meaning that the model will be trained for 200 epochs.
# 'warmup_ratio' is set to 0.1, meaning that 10% of the total training steps will be used for the warmup phase.
# 'lr_scheduler_type' is set to "cosine", meaning that a cosine learning rate scheduler will be used.
# 'seed' is set to 42, which is the seed for the random number generator.

# Training arguments
args = TrainingArguments(
    output_dir="./gpt2-LoRA-nl-to-fol-2",
    evaluation_strategy="steps",
    do_eval=True,
    optim="adamw_torch",
    per_device_train_batch_size=1, 
    gradient_accumulation_steps=8, 
    per_device_eval_batch_size=1,  
    log_level="info",  
    save_strategy="epoch",
    logging_steps=100, 
    learning_rate=5e-5, 
    fp16=True,  
    eval_steps=100,  
    num_train_epochs=200, 
    warmup_ratio=0.1,
    lr_scheduler_type="cosine", 
    seed=42,
)

# Initialize Trainer
# 'model' is the model that will be trained.
# 'train_dataset' and 'eval_dataset' are the datasets that will be used for training and evaluation, respectively.
# 'peft_config' is the configuration for peft, which is used for instruction tuning.
# 'dataset_text_field' is set to "text", meaning that the 'text' field of the dataset will be used as the input for the model.
# 'max_seq_length' is set to 256, meaning that the maximum length of the sequences that will be fed to the model is 256 tokens.
# 'tokenizer' is the tokenizer that will be used to tokenize the input text.
# 'args' are the training arguments that were defined earlier.

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=256, 
    tokenizer=tokenizer,
    args=args,
)

trainer.train()

# Save the fine-tuned model
trainer.save_model("./gpt-2-LoRA-final")

# Load the fine-tuned model
fine_tuned_model = PeftModel.from_pretrained(model, "./gpt-2-LoRA-final")

# Save the entire model (base GPT-2 + LoRA) to a single file
fine_tuned_model = fine_tuned_model.merge_and_unload()
fine_tuned_model.save_pretrained("./gpt-2-merged")
tokenizer.save_pretrained("./gpt-2-merged")
print("Merged model saved to: ./gpt-2-merged")

Now you can use the model saved in `./gpt-2-merged` folder to run inference.

### Testing fine-tuned T-5 Model

In [49]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to(device)

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model_name_from_huggingface = "z7ye/peft-dialogue-summary-checkpoint"

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       peft_model_name_from_huggingface, 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False).to(device)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [50]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
 all model parameters: 251116800
 percentage of trainable model parameters: 0.00%


### Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [51]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model_t5.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: Have you considered upgrading your system?
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# suggests #Person2# adding a painting program to #Person2#'s software and upgrading the hardware. #Person2# also wants to add a CD-ROM drive.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person1# recommends adding a painting program to #Person2#'s software and upgrading hardware. #Person2# also wants to upgrade the hardware because it's outdated now.


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time). 

In [52]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model_t5.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for this...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,New communication policy at the Office of the ...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,This memo is for all employees.,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,The traffic jams are a problem in the city.,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
4,#Person2# decides to follow #Person1#'s sugges...,The public transport system is good for the en...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
5,#Person2# complains to #Person1# about the tra...,#Person1#: I'm sorry for the traffic jam. #Per...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are divorced.,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
9,#Person1# and Brian are at the birthday party ...,People are celebrating Brian's birthday.,Brian's birthday is coming. #Person1# invites ...,Brian remembers his birthday and invites #Pers...


Compute ROUGE score for this subset of the data. 

In [53]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2288514902063289, 'rouge2': 0.08646891938646062, 'rougeL': 0.2000912588009362, 'rougeLsum': 0.20380631316115186}
INSTRUCT MODEL:
{'rouge1': 0.4015906463624618, 'rouge2': 0.17568542724181807, 'rougeL': 0.2874569966059625, 'rougeLsum': 0.2886327613084294}
PEFT MODEL:
{'rouge1': 0.3725351062275605, 'rouge2': 0.12138811933618107, 'rougeL': 0.27620639623170606, 'rougeLsum': 0.2758134870822362}


Notice, that PEFT model results are not too bad, while the training process was much easier!

We have already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [45]:
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}
PEFT MODEL:
{'rouge1': 0.40810631575616746, 'rouge2': 0.1633255794568712, 'rougeL': 0.32507074586565354, 'rougeLsum': 0.3248950182867091}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [46]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.36%
rougeLsum: 12.34%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [47]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.35%
rouge2: -1.70%
rougeL: -1.34%
rougeLsum: -1.35%


Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).