# Fine-Tuning LLMs

In this Proeject, I will fine-tune the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model for enhanced dialogue summarization. You will first explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter-Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

## 1. Set up Dependencies and Load Dataset and LLM

In [2]:
!pip install datasets evaluate rouge_score peft -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/510.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.

In [3]:
import torch
import time
import evaluate
import pandas as pd
import numpy as np
from transformers import DataCollatorForSeq2Seq


from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset

In [4]:
dataset = load_dataset('knkarthick/dialogsum')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Load the pre-trained [Flan-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of Flan-T5. Setting `torch_dtype=torch.bfloat16` specifies the data type to be used by this model, which can reduce GPU memory usage since `bfloat16` uses half as much memory per number compared to `float32`, the default precision for most models.

In [5]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

## 2. Test the Model with Zero-Shot Inferencing

Test the model with zero-shot inference.

In [None]:
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
inputs = tokenizer(prompt, return_tensors='pt')
output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

print(dash_line)
print(f'INPUT PROMPT:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{original_model_summary}\n')

----------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.
----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.
-------------------------------------------------------

You can see that the model struggles to summarize the dialogue compared to the baseline summary, and simply repeats the first sentence from the dialogue.

## 3. Perform Full Fine-Tuning

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [7]:
dataset['train']['dialogue'][0]



"#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor."

In [8]:
dataset['train']['summary'][0]


"Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll give some information about their classes and medications to help Mr. Smith quit smoking."

### 3.1 Preprocess the Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation.`, and to the start of the summary with `Summary:` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.
Alice: This is her part of the conversation.
Bob: This is his part of the conversation.    
Summary:
```

Training response (summary):
```
Both Alice and Bob participated in the conversation.
```

**Exercise**: Write a function to tokenize a batch of examples from the dialogue dataset. The function should concatentate the dialogues with the predefined prompt, tokenize them along with their summaries, and define the tokenized summaries as the labels.

In [9]:
def tokenize(examples, max_length=512, summary_max_length=150):
    prompts = ["Summarize the following conversation.\n" + dialogue for dialogue in examples['dialogue']]
    summaries = ["Summary:\n" + summary for summary in examples['summary']]

    # Tokenize prompts and summaries
    model_inputs = tokenizer(prompts, max_length=max_length, truncation=True, padding="max_length", return_tensors="pt")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(summaries, max_length=summary_max_length, truncation=True, padding="max_length", return_tensors="pt")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [10]:
tokenized_dataset=dataset.map(tokenize,batched=True)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [11]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1500
    })
})

In [None]:
#tokenized_eval = tokenize(dataset['validation'],tokenizer)

In [12]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1500
    })
})

### 3.2 Fine-Tune the Model

**Exercise**: Utilize the Hugging Face Trainer API for training the model on the preprocessed dataset. Define the training arguments, a data collator, and create a `Seq2SeqTrainer` instance. Train the model for one epoch.

In [13]:
import os

# Set environment variable for CUDA memory allocation
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

# Now you can import PyTorch and continue with your script
import torch
# ... rest of your code ...





In [14]:
import torch
torch.cuda.empty_cache()


In [18]:


#original_model

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
training_args = Seq2SeqTrainingArguments(

    output_dir="./results",
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    predict_with_generate=False,
    evaluation_strategy="steps",
    #gradient_accumulation_steps=2,  # Adjusted for memory management
    load_best_model_at_end=True,
    fp16=True,  # Enable mixed precision training
)




In [19]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator= DataCollatorForSeq2Seq(tokenizer,model=model),
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Training a fully fine-tuned version of the model should take about 10 minutes on a Google Colab GPU machine.

In [20]:
trainer.train()

Step,Training Loss,Validation Loss
500,0.0,
1000,0.0,
1500,0.0,


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=1558, training_loss=0.0, metrics={'train_runtime': 275.788, 'train_samples_per_second': 45.18, 'train_steps_per_second': 5.649, 'total_flos': 8532076611502080.0, 'train_loss': 0.0, 'epoch': 1.0})

Save the model to a local folder:

In [21]:
model_path = './flan-t5-base-dialogsum-checkpoint'

original_model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('./flan-t5-base-dialogsum-checkpoint/tokenizer_config.json',
 './flan-t5-base-dialogsum-checkpoint/special_tokens_map.json',
 './flan-t5-base-dialogsum-checkpoint/spiece.model',
 './flan-t5-base-dialogsum-checkpoint/added_tokens.json',
 './flan-t5-base-dialogsum-checkpoint/tokenizer.json')

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [22]:
instruct_model=AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

Reload the original Flan-T5-base model:

In [23]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

### 3.3 Evaluate the Model Qualitatively (Human Evaluation)

**Exercise**: Make inferences for the same example as in Section 2, using the original model and the fully fine-tuned model.

In [24]:
# Prepare the input prompt

fine_tuned_model =instruct_model
index = 42
dialogue = dataset['test'][index]['dialogue']
prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"

# Tokenize the input prompt
inputs = tokenizer(prompt, return_tensors='pt')

# Generate a summary using the original model
output_original = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output_original, skip_special_tokens=True)

# Generate a summary using the fine-tuned model
output_fine_tuned = fine_tuned_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
fine_tuned_model_summary = tokenizer.decode(output_fine_tuned, skip_special_tokens=True)

print("\nFine-tuned Model Summary:")
print(fine_tuned_model_summary)


Fine-tuned Model Summary:
Person1: I'm not sure how to adjust my life.


The fine-tuned model is able to create a much better summary of the dialogue compared to the original model.

### 3.4 Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [25]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

**Exercise**: Generate the outputs for a sample of the test set with the fine-tuned model (use only the first 10 dialogues and summaries to save time).

In [26]:
### WRITE YOUR CODE HERE

# Sample the first 10 dialogues from the test set
sample_dialogues = dataset['test'][:10]['dialogue']

# Prepare prompts for the model
sample_prompts = [f"Summarize the following conversation.\n{dialogue}\nSummary:\n" for dialogue in sample_dialogues]

# Tokenize the prompts
encoded_inputs = tokenizer(sample_prompts, return_tensors='pt', padding=True, truncation=True, max_length=512)

# Move the tensor to the same device as model
encoded_inputs = {k: v.to(fine_tuned_model.device) for k, v in encoded_inputs.items()}

# Generate summaries
generated_summaries_ids = fine_tuned_model.generate(**encoded_inputs, max_length=150, num_beams=5, early_stopping=True)

# Decode generated ids to texts
generated_summaries = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_summaries_ids]

# Print the generated summaries
for i, summary in enumerate(generated_summaries):
    print(f"Summary {i+1}: {summary}\n")


Summary 1: #Person1#: I need to take a dictation from Ms. Dawson.

Summary 2: #Person1#: I need to take a dictation from Ms. Dawson.

Summary 3: #Person1#: I need to take a dictation from Ms. Dawson.

Summary 4: #Person1#: You're finally here! #Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection. #Person1#: It's always rather congested down there during rush hour. #Person2#: Perhaps it would be better for the environment, too. #Person1#: Taking the subway would be a lot less stressful than driving. #Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.

Summary 5: #Person1#: You're finally here! #Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection. #Person1#: It's always rather congested down there during rush hour. #Person2#: Perhaps it would be better for the environment, too. #Person1#: Taking the subway would be a lot less stre

In [29]:
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
inputs = tokenizer(prompt, return_tensors='pt')
output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

print(dash_line)
print(f'INPUT PROMPT:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{original_model_summary}\n')

----------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.
----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.
-------------------------------------------------------

Evaluate the models computing ROUGE metrics:

In [30]:
human_baseline_summaries = summary
instruct_model_summaries = generated_summaries

original_model_results = rouge.compute(
    predictions=original_model_summary,
    references=human_baseline_summaries[0:len(original_model_summary)]
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)]
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0, 'rougeLsum': 0.0}
INSTRUCT MODEL:
{'rouge1': 0.0025974025974025974, 'rouge2': 0.0, 'rougeL': 0.0025974025974025974, 'rougeLsum': 0.0025974025974025974}


The results show substantial improvement in all ROUGE metrics:

In [31]:
print("Absolute percentage improvement of the instruct model over the original model:")

for key in instruct_model_results:
    improvement = instruct_model_results[key] - original_model_results[key]
    print(f'{key}: {improvement*100:.2f}%')

Absolute percentage improvement of the instruct model over the original model:
rouge1: 0.26%
rouge2: 0.00%
rougeL: 0.26%
rougeLsum: 0.26%


## 4. Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** instead of "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning, with comparable evaluation results as you will see soon.

One of the most popular PEFT methods is **Low-Rank Adaptation (LoRA)**, which  introduces low-rank matrices to adapt the LLM with minimal additional parameters. In most cases, when someone says PEFT, they typically mean LoRA.  After fine-tuning for a specific task with LoRA, the result is that the original LLM remains unchanged and a newly-trained "LoRA adapter" emerges. This LoRA adapter is much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

At inference time, the LoRA adapter is reunited and combined with its original LLM to serve the inference request. The benefit is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

### 4.1 Setup the LoRA model for Fine-Tuning

You first need to define the configuration of the LoRA model. Have a look at the configuration below. The key configuration element to adjust is the rank (`r`) of the adapter, which influences its capacity and complexity. Experiment with various ranks, such as 8, 16, or 32, and see how they affect the results.

In [32]:
!pip install peft



In [33]:
from peft import LoraConfig, TaskType, get_peft_model

lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=32,
    lora_alpha=32,
    lora_dropout=0.1
)

In [34]:
peft_model = get_peft_model(original_model, lora_config)

Add LoRA adapter layers/parameters to the original LLM to be trained:

The number of trainable model parameters in the LoRA model is:

In [35]:
peft_model.print_trainable_parameters()

trainable params: 3,538,944 || all params: 251,116,800 || trainable%: 1.4092820552029972


### 4.2 Train the LoRA Adapter

**Exercise**: Define training arguments and create a `Seq2SeqTrainer` instance for the LoRA model. Use a higher learning rate than full fine-tuning (e.g., `1e-3`).

In [37]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoModelForSeq2SeqLM, AutoTokenizer

# Assuming peft_model is defined and loaded correctly
model_name = 'google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

training_args = Seq2SeqTrainingArguments(
    output_dir="./lora_results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=1e-3,
    eval_steps=9,
    save_total_limit=1,
    #weight_decay=0.02,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    fp16=False,  # Ensure this is set correctly
)

lora_trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['train']
)

# Start training without converting to any unsupported data type
lora_trainer.train()


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,0.0,
2,0.0,
3,0.0,


TrainOutput(global_step=4674, training_loss=0.0, metrics={'train_runtime': 1024.5827, 'train_samples_per_second': 36.483, 'train_steps_per_second': 4.562, 'total_flos': 2.559622983450624e+16, 'train_loss': 0.0, 'epoch': 3.0})

In [38]:
peft_model = get_peft_model(original_model, lora_config)

In [39]:
peft_model.print_trainable_parameters()

trainable params: 3,538,944 || all params: 251,116,800 || trainable%: 1.4092820552029972


Train the PEFT adapter. Training should take about 6 minutes on a Google Colab GPU machine.

In [40]:
peft_model.train()

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 768)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 768)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
             

Save the model to a local folder:

In [41]:
peft_model.save_pretrained('./flan-t5-base-dialogsum-lora')

Load the PEFT model:

In [42]:
from peft import AutoPeftModelForSeq2SeqLM
from transformers import AutoTokenizer

peft_model = AutoModelForSeq2SeqLM.from_pretrained('./flan-t5-base-dialogsum-lora')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

Reload the original Flan-T5-base model:

In [43]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

### 4.3 Evaluate the Model Qualitatively (Human Evaluation)

**Exercise**: Make inferences for the same example as in Sections 2 and 3, using the original model, the fully fine-tuned model and the PEFT model.

In [44]:
### WRITE YOUR CODE HERE


index = 42  # Adjust as per your dataset
dialogue = dataset['test'][index]['dialogue']

# Create prompts
prompt = f"Summarize the following conversation.\n{dialogue}"

# Function to generate summary
def generate_summary(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True)
    outputs = model.generate(inputs['input_ids'], max_length=150, num_beams=5, early_stopping=True)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary


peft_summary = generate_summary(peft_model, tokenizer, prompt)



In [45]:
peft_summary

"#Person1#: You look pale, don't you? #Person1#: Yes, I'm pale."

### 4.4 Evaluate the Model Quantitatively (with ROUGE Metric)

**Exercise**: Generate the outputs for a sample of the test set with the PEFT model (use only the first 10 dialogues and summaries to save time).

In [46]:



from datasets import load_metric
import numpy as np

# Load the rouge scoring function
rouge = load_metric("rouge")

# Function to evaluate the model on a subset of the test dataset
def evaluate_model(model, tokenizer, dataset, num_samples=10):
    summaries = []
    references = []

    for i in range(num_samples):
        dialogue = dataset['test'][i]['dialogue']
        summary = dataset['test'][i]['summary']
        prompt = f"Summarize the following conversation.\n{dialogue}"

        # Generate summary
        inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True)
        outputs = model.generate(inputs['input_ids'], max_length=150, num_beams=5, early_stopping=True)
        generated_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

        summaries.append(generated_summary)
        references.append(summary)

    # Compute ROUGE scores
    results = rouge.compute(predictions=summaries, references=references)
    return results

# Evaluate the PEFT model
rouge_scores = evaluate_model(peft_model, tokenizer, dataset)
print("ROUGE Scores:")
for key, value in rouge_scores.items():
    print(f"{key}: {np.mean(value)}")



  rouge = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

ROUGE Scores:
rouge1: 0.32828210928595764
rouge2: 0.13234878439797917
rougeL: 0.29201580071277483
rougeLsum: 0.2928005367107028


Compute ROUGE score for this subset of the data.

In [48]:
human_baseline_summaries= summary
instruct_model_summaries=fine_tuned_model_summary
peft_model_summaries = peft_summary

original_model_results = rouge.compute(
    predictions=original_model_summary,
    references=human_baseline_summaries[0:len(original_model_summary)],
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rougeL': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rougeLsum': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0))}
INSTRUCT MODEL:
{'rouge1': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rouge2': AggregateScore(low=Score(precision=0.0

Notice, that PEFT model results are not too bad, while the training process was much easier!

Calculate the improvement of PEFT over the original model:

In [49]:

for key in peft_model_results:
    # Extract fmeasure values from each model's results
    peft_score = peft_model_results[key].mid.fmeasure
    original_score = original_model_results[key].mid.fmeasure

    # Calculate the percentage improvement
    if original_score != 0:  # Prevent division by zero
        improvement = (peft_score - original_score) / original_score * 100
        print(f'{key}: {improvement:.2f}% improvement')
    else:
        print(f'{key}: Cannot calculate improvement due to zero original score')


rouge1: Cannot calculate improvement due to zero original score
rouge2: Cannot calculate improvement due to zero original score
rougeL: Cannot calculate improvement due to zero original score
rougeLsum: Cannot calculate improvement due to zero original score


Now calculate the improvement of PEFT over a full fine-tuned model:

In [50]:

for key in peft_model_results:
    # Extract fmeasure values from each model's results
    peft_score = peft_model_results[key].mid.fmeasure
    instruct_score = instruct_model_results[key].mid.fmeasure

    # Calculate the percentage improvement
    if instruct_score != 0:  # Prevent division by zero
        improvement = (peft_score - instruct_score) / instruct_score * 100
        print(f'{key}: {improvement:.2f}% improvement')
    else:
        print(f'{key}: Cannot calculate improvement due to zero score from instruct model')


rouge1: Cannot calculate improvement due to zero score from instruct model
rouge2: Cannot calculate improvement due to zero score from instruct model
rougeL: Cannot calculate improvement due to zero score from instruct model
rougeLsum: Cannot calculate improvement due to zero score from instruct model


You can see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources.