# Introduction to LoRA and Prompt Tuning using PEFT

In this lab, you will explore two efficient fine-tuning techniques, LoRA (Low-Rank Adaptation) and Prompt Tuning, using the [PEFT (Parameter-Efficient Fine-Tuning) framework](https://huggingface.co/docs/peft/index). These techniques are gaining popularity for their ability to adapt pre-trained language models like FLAN-T5 to specific tasks, while only modifying a small percentage of model parameters. This approach reduces the computational resources needed, making it more feasible to fine-tune large models on tasks like text summarization or translation. By the end of this lab, you will have a practical understanding of full fine-tuning, LoRA, and prompt tuning, comparing their performance in both qualitative and quantitative terms. You'll be using the DialogSum dataset to fine-tune FLAN-T5 models, analyzing their results with the ROUGE metric, and reflecting on the efficiency of each method.

In [None]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

device = torch.device("mps" if torch.cuda.is_available() else "cpu")

  from .autonotebook import tqdm as notebook_tqdm


You are going to experiment with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [None]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version of FLAN-T5](https://huggingface.co/google/flan-t5-small). Setting torch_dtype=torch.bfloat16 specifies the memory type to be used by this model.

In [13]:
model_name='google/flan-t5-small'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

It is possible to pull out the number of model parameters and find out how many of them are trainable.

In [14]:
def print_number_of_trainable_model_parameters(model):
    """
    Prints the number of trainable and total model parameters.

    This function iterates through the parameters of a given model and calculates:
    1. The total number of model parameters.
    2. The number of trainable parameters (those with `requires_grad=True`).

    It then returns a formatted string with the number of trainable parameters, total parameters,
    and the percentage of parameters that are trainable.

    Args:
        model (torch.nn.Module): The neural network model from which parameters are being counted.

    Returns:
        str: A string displaying the total number of parameters, trainable parameters, and
        the percentage of trainable parameters.

    Example:
        >>> model = YourModel()
        >>> print(print_number_of_trainable_model_parameters(model))
        trainable model parameters: 123456
        all model parameters: 234567
        percentage of trainable model parameters: 52.63%
    """
    # TODO: Implement the function
    trainable_model_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    all_model_params = sum(p.numel() for p in model.parameters())

    # TODO: Iterate through the parameters of the model and count the number of trainable and total parameters
    for _, param in model.named_parameters():
        if param.requires_grad == True:
          trainable_model_params += 1
        all_model_params += 1
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 76961342
all model parameters: 76961342
percentage of trainable model parameters: 100.00%


Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [15]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

## Perform Full Fine-Tuning

### Preprocess the Dialog-Summary Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with Summary as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.

Summary:
```
Training response (summary):

`Both Chris and Antje participated in the conversation.`

Then preprocess the prompt-response dataset into tokens and pull out their input_ids (1 per token).

In [16]:
def tokenize_function(example):
    """
    Tokenizes a given dialogue-summary example for model input.

    This function preprocesses an example from the dataset by constructing a prompt
    for summarization. It adds an instruction prompt before the dialogue and a "Summary"
    tag before the summary. Then, the input dialogue and the summary are tokenized,
    with padding and truncation applied to ensure the tokenized sequences fit the model's
    input size requirements.

    Args:
        example (dict): A dictionary containing two keys:
            - "dialogue" (list of str): List of dialogue strings to be summarized.
            - "summary" (list of str): Corresponding summaries for the dialogues.

    Returns:
        dict: A dictionary with the following updated keys:
            - "input_ids" (torch.Tensor): The tokenized input prompts.
            - "labels" (torch.Tensor): The tokenized summaries.
    """
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

To save some time in the lab, you will subsample the dataset:

In [17]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 1000 == 0, with_indices=True)

Check the shapes of all three parts of the dataset:

In [18]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (13, 2)
Validation: (1, 2)
Test: (2, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 13
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 2
    })
})


### Fine-Tune the Model
Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment. This fully fine-tuned model will also be referred to as the instruct model in this lab.

In [19]:
from copy import deepcopy

instruct_model = deepcopy(original_model)

In [20]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

# TODO: Play with different hyperparameters and training configurations, be careful with the training time
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_steps=1,
    save_strategy="no",
    report_to="none",
)

trainer = Trainer(
    model=instruct_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [21]:
trainer.train()

Step,Training Loss
1,58.25
2,57.25
3,56.75
4,60.25
5,58.0
6,56.75
7,57.25
8,55.25
9,57.0
10,56.0


TrainOutput(global_step=20, training_loss=57.4, metrics={'train_runtime': 10.3175, 'train_samples_per_second': 12.6, 'train_steps_per_second': 1.938, 'total_flos': 24165765611520.0, 'train_loss': 57.4, 'epoch': 10.0})

### Evaluate the model qualitatively

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model

In [22]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
How would you like to upgrade your computer?
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
How do you get the software?


### Evaluate model quantitatively (with ROUGE metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric) ) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [14]:
! pip install rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [23]:
rouge = evaluate.load('rouge')

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [24]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Is this all correct?,"No, I don't. I don't think I'll be able to han..."
1,In order to prevent employees from wasting tim...,Is this all correct?,"- I'm sorry, sir. I'm sorry."
2,Ms. Dawson takes a dictation for #Person1# abo...,Is this all correct?,#Person1#: I'm going to be able to take a dict...
3,#Person2# arrives late because of traffic jam....,Talk to your boss.,Leaving the car is a good way to get home.
4,#Person2# decides to follow #Person1#'s sugges...,Talk to your boss.,"When I got home, I was driving to work."
5,#Person2# complains to #Person1# about the tra...,Talk to your boss.,Person1#: I'm not going to work.
6,#Person1# tells Kate that Masha and Hero get d...,"Kate, you know, I'm not sure.",Those are the two things that are going to hap...
7,#Person1# tells Kate that Masha and Hero are g...,"Kate, you know, I'm not sure.",#Person1#: I'm a little girl.
8,#Person1# and Kate talk about the divorce betw...,"Kate, you know, I'm not sure.","Masha and Hero divorced in the summer of 2014,..."
9,#Person1# and Brian are at the birthday party ...,"Brian, how are you?","Brian, how's your birthday?"


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [25]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.07088226588226587), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.07132602904342034), 'rougeLsum': np.float64(0.07267870615696703)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.16185858054200197), 'rouge2': np.float64(0.04228520017993702), 'rougeL': np.float64(0.1299042730297662), 'rougeLsum': np.float64(0.12921575197797286)}


The results show substantial improvement in all ROUGE metrics:

In [26]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 9.10%
rouge2: 4.23%
rougeL: 5.86%
rougeLsum: 5.65%


## Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform Parameter Efficient Fine-Tuning (PEFT) fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes Low-Rank Adaptation (LoRA) and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request. The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

### Brief introduction to LoRA Tuning
LoRA is a re-parameterization technique. Its operation is simple, complex, and brilliant at the same time. It involves reducing the size of the matrices to be trained by dividing them in such a way that when multiplied, they yield the original matrix.

The weights that are modified are those of the reduced matrices, not the original matrix. It's better visualized in an image.

![](resources/lora_matrix_multiplication.webp)

We have an original matrix of 50x50, which means we would have to modify about 2500 parameters. However, as we know, if we multiply two matrices of (2x50) and (50x2), we obtain a 50x50 matrix. Yet, these two matrices are formed by only 100 parameters each. In other words, for the reduced matrices, we need to modify a total of 200 parameters compared to the 2500 of the original matrix. This represents a 92% reduction, and the larger the original matrix, the greater the percentage of savings.

In Language Models like GPT-3 or any of the current ones with LoRA, it's possible that we only need to train about 0.02% of the original parameters. This varies for each model. The best part is that the obtained result is very similar to that of full fine-tuning, in some cases, it can even be better.

#### Setup the LoRA model for Fine-Tuning

You need to set up the LoRA model for fine-tuning with a new layer/parameter adapter. Using LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [27]:
from peft import LoraConfig, get_peft_model, TaskType

# TODO: Play with different hyperparameters and training configurations, be careful with the training time
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="lora_only",  # this specifies if the bias parameter should be trained.
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [28]:
lora_model = get_peft_model(deepcopy(original_model), lora_config)
print(print_number_of_trainable_model_parameters(lora_model))

trainable model parameters: 1376352
all model parameters: 78337694
percentage of trainable model parameters: 1.76%


#### Train LoRA Adapter

In [29]:
output_dir = f'./lora-dialogue-summary-training-{str(int(time.time()))}'

# TODO: Play with different hyperparameters and training configurations, be careful with the training time
lora_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=100,
    logging_steps=10,
    save_strategy="no",
    report_to="none",
)

lora_trainer = Trainer(
    model=lora_model,
    args=lora_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [30]:
lora_trainer.train()



Step,Training Loss
10,41.0375
20,20.725
30,8.2937
40,5.6188
50,4.7875
60,4.5469
70,4.3312
80,4.1
90,3.8547
100,3.625


TrainOutput(global_step=200, training_loss=6.5771875, metrics={'train_runtime': 90.9669, 'train_samples_per_second': 14.291, 'train_steps_per_second': 2.199, 'total_flos': 247153872076800.0, 'train_loss': 6.5771875, 'epoch': 100.0})

#### Evaluate the model qualitatively

#### Evaluate the model qualitatively

In [31]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

lora_model_outputs = lora_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
lora_model_text_output = tokenizer.decode(lora_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'LoRA MODEL: {lora_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
How would you like to upgrade your computer?
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Inventors are a great choice.
---------------------------------------------------------------------------------------------------
LoRA MODEL: #Person2#: #Person2#: Adding a painting program to your software would allow you to make up your own flyers and banners for advertising. #Person2#: Adding a painting program to your computer would allow you to make up your own flyers and banners for advertising. #Person2#: Incorporated into your own flyers and banners. #Person2#: How can we do it? #Person2#: Incorpor

#### Evaluate the model quantitatively (with ROUGE metric)

In [32]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
lora_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    lora_model_outputs = lora_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200))
    lora_model_text_output = tokenizer.decode(lora_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    lora_model_summaries.append(lora_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, lora_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'lora_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,lora_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Is this all correct?,I'm going to ask you to take a dictation for t...,#Person1##: This memo should be distributed to...
1,In order to prevent employees from wasting tim...,Is this all correct?,DG: Is there a way to terminate the employee?,#Person1#: This should go out as an intra-offi...
2,Ms. Dawson takes a dictation for #Person1# abo...,Is this all correct?,"You know, I'm not going to be able to handle t...","#Person1#: No, sir. #Person2##: No, sir. #Pers..."
3,#Person2# arrives late because of traffic jam....,Talk to your boss.,It's a good idea to start driving.,#Person1#: Taking the subway to work and takin...
4,#Person2# decides to follow #Person1#'s sugges...,Talk to your boss.,Get ready for work.,#Person1# #Person##: Taking the subway would b...
5,#Person2# complains to #Person1# about the tra...,Talk to your boss.,People: I'm not going to be driving to work.,#Person1# is now here! #Person1## is the best ...
6,#Person1# tells Kate that Masha and Hero get d...,"Kate, you know, I'm not sure.",#Person1#: I think they are getting divorced.,"#Person1##: Kate, you don't think what's happe..."
7,#Person1# tells Kate that Masha and Hero are g...,"Kate, you know, I'm not sure.","Kate, I'm sorry, but I think it is the best th...",Masha and Hero are getting divorced. Masha and...
8,#Person1# and Kate talk about the divorce betw...,"Kate, you know, I'm not sure.",Talk to Kate.,Masha and Hero get custody of the kids and the...
9,#Person1# and Brian are at the birthday party ...,"Brian, how are you?","Happy Birthday, you're welcome.",#Person's birthday is #Person1#: Happy Birthda...


In [33]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

lora_model_results = rouge.compute(
    predictions=lora_model_summaries,
    references=human_baseline_summaries[0:len(lora_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('LoRA MODEL:')
print(lora_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.07088226588226587), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.07132602904342034), 'rougeLsum': np.float64(0.07267870615696703)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.15087294839257093), 'rouge2': np.float64(0.02790697674418604), 'rougeL': np.float64(0.1196357804003981), 'rougeLsum': np.float64(0.12053682309172212)}
LoRA MODEL:
{'rouge1': np.float64(0.23964105322995477), 'rouge2': np.float64(0.051669114154919156), 'rougeL': np.float64(0.195129626878431), 'rougeLsum': np.float64(0.19591454117754542)}


Calculate the improvement of LoRA over the original model:

In [34]:
print("Absolute percentage improvement of LoRA MODEL over HUMAN BASELINE")

improvement = (np.array(list(lora_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(lora_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of LoRA MODEL over HUMAN BASELINE
rouge1: 16.88%
rouge2: 5.17%
rougeL: 12.38%
rougeLsum: 12.32%


Now calculate the improvement of LoRA over a full fine-tuned model:

In [35]:
print("Absolute percentage improvement of LoRA MODEL over INSTRUCT MODEL")

improvement = (np.array(list(lora_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(lora_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of LoRA MODEL over INSTRUCT MODEL
rouge1: 8.88%
rouge2: 2.38%
rougeL: 7.55%
rougeLsum: 7.54%


### Brief introduction to Prompt Tuning

It’s an Additive Fine-Tuning technique for models. This means that we WILL NOT MODIFY ANY WEIGHTS OF THE ORIGINAL MODEL. You might be wondering, how are we going to perform fine-tuning then? Well, we will train additional layers that are added to the model. That’s why it’s called an Additive technique.

Considering it’s an Additive technique and its name is Prompt-Tuning, it seems clear that the layers we’re going to add and train are related to the prompt.

![](resources/prompt_tuning.jpg)

We are creating a type of superprompt by enabling a model to enhance a portion of the prompt with its acquired knowledge. However, that particular section of the prompt cannot be translated into natural language. It's as if we've mastered expressing ourselves in embeddings and generating highly effective prompts.

In each training cycle, the only weights that can be modified to minimize the loss function are those integrated into the prompt.

The primary consequence of this technique is that the number of parameters to train is genuinely small. However, we encounter a second, perhaps more significant consequence, namely that, since we do not modify the weights of the pretrained model, it does not alter its behavior or forget any information it has previously learned.

The training is faster and more cost-effective. Moreover, we can train various models, and during inference time, we only need to load one foundational model along with the new smaller trained models because the weights of the original model have not been altered

#### Setup the Prompt tuning model for Fine-Tuning

You need to set up the Prompt tuning model for fine-tuning with a new layer/parameter adapter.

In [36]:
from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit

NUM_VIRTUAL_TOKENS = 20 #Number of virtual tokens to be added and trained.

# TODO: Play with different hyperparameters and training configurations, be careful with the training time
prompt_config = PromptTuningConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, #This type indicates the model will generate text.
    prompt_tuning_init=PromptTuningInit.RANDOM,  #The added virtual tokens are initializad with random numbers
    num_virtual_tokens=NUM_VIRTUAL_TOKENS, #Number of virtual tokens to be added and trained.
    tokenizer_name_or_path=model_name #The pre-trained model.
)

Add Prompt tuning adapter layers/parameters to the original LLM to be trained.

In [37]:
prompt_model = get_peft_model(deepcopy(original_model),
                            lora_config)
print(print_number_of_trainable_model_parameters(prompt_model))

trainable model parameters: 1376352
all model parameters: 78337694
percentage of trainable model parameters: 1.76%


#### Train Prompt tuning Adapter

In [38]:
output_dir = f'./prompt-tuning-dialogue-summary-training-{str(int(time.time()))}'

# TODO: Play with different hyperparameters and training configurations, be careful with the training time
prompt_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=100,
    logging_steps=10,
    save_strategy="no",
    report_to="none",
)

prompt_trainer = Trainer(
    model=prompt_model,
    args=prompt_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [39]:
prompt_trainer.train()



Step,Training Loss
10,41.0125
20,20.2063
30,7.4781
40,5.3781
50,4.7469
60,4.5031
70,4.2938
80,4.1031
90,3.8781
100,3.6328


TrainOutput(global_step=200, training_loss=6.464609375, metrics={'train_runtime': 91.0175, 'train_samples_per_second': 14.283, 'train_steps_per_second': 2.197, 'total_flos': 247153872076800.0, 'train_loss': 6.464609375, 'epoch': 100.0})

#### Evaluate the model qualitatively

In [40]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

lora_model_outputs = lora_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
lora_model_text_output = tokenizer.decode(lora_model_outputs[0], skip_special_tokens=True)

prompt_model_outputs = prompt_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
prompt_model_text_output = tokenizer.decode(prompt_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'LoRA MODEL: {lora_model_text_output}')
print(dash_line)
print(f'PROMPT-TUNING MODEL: {prompt_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
How would you like to upgrade your computer?
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Inventors are a great choice.
---------------------------------------------------------------------------------------------------
LoRA MODEL: #Person2#: #Person2#: Adding a painting program to your software would allow you to make up your own flyers and banners for advertising. #Person2#: Adding a painting program to your computer would allow you to make up your own flyers and banners for advertising. #Person2#: Incorporated into your own flyers and banners. #Person2#: How can we do it? #Person2#: Incorpor

#### Evaluate the model quantitatively (with ROUGE metric)

In [41]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
lora_model_summaries = []
prompt_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    lora_model_outputs = lora_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200))
    lora_model_text_output = tokenizer.decode(lora_model_outputs[0], skip_special_tokens=True)

    prompt_model_outputs = lora_model.generate(input_ids=input_ids.to(device), generation_config=GenerationConfig(max_new_tokens=200))
    prompt_model_text_output = tokenizer.decode(prompt_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    lora_model_summaries.append(lora_model_text_output)
    prompt_model_summaries.append(prompt_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, lora_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'lora_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,lora_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Is this all correct?,#Person1#: I'm sorry to say that I'm sorry. I'...,#Person1#: This should go out as an intra-offi...
1,In order to prevent employees from wasting tim...,Is this all correct?,#Person1#: This should go out as an intra-offi...,"Yours. #Person1#: #Person1#: #Person1#: No, no..."
2,Ms. Dawson takes a dictation for #Person1# abo...,Is this all correct?,Employees will be allowed to use Instant Messa...,#Person1#: #Person1#: #Person1#####Person1##: ...
3,#Person2# arrives late because of traffic jam....,Talk to your boss.,"#Person1#: I'm not going to drive to work, but...",#Person2# #Person##: Taking the subway would b...
4,#Person2# decides to follow #Person1#'s sugges...,Talk to your boss.,Get out of traffic.,#Person1#: I'm going to quit driving to work a...
5,#Person2# complains to #Person1# about the tra...,Talk to your boss.,#Person1#: I'm going to drive to work.,@Person1#: I think that's a good idea to quit ...
6,#Person1# tells Kate that Masha and Hero get d...,"Kate, you know, I'm not sure.",Apparently the divorce is going to be between ...,#Person2#: Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,"Kate, you know, I'm not sure.",They are divorced.,Masha and Hero are getting divorced. #Person1#...
8,#Person1# and Kate talk about the divorce betw...,"Kate, you know, I'm not sure.",||||||||||||||||,Masha and Hero are getting divorced. #Person1#...
9,#Person1# and Brian are at the birthday party ...,"Brian, how are you?",#Person1#: Thanks for the invite.,#Person1#: Happy birthday to you.


In [43]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

lora_model_results = rouge.compute(
    predictions=lora_model_summaries,
    references=human_baseline_summaries[0:len(lora_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

prompt_model_results = rouge.compute(
    predictions=prompt_model_summaries,
    references=human_baseline_summaries[0:len(prompt_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('LoRA MODEL:')
print(lora_model_results)
print('PROMPT-TUNING MODEL:')
print(prompt_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.07088226588226587), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.07132602904342034), 'rougeLsum': np.float64(0.07267870615696703)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.1741389918243465), 'rouge2': np.float64(0.01950904392764858), 'rougeL': np.float64(0.1505039636183801), 'rougeLsum': np.float64(0.15082299241566977)}
LoRA MODEL:
{'rouge1': np.float64(0.24383552356506355), 'rouge2': np.float64(0.07740200945119963), 'rougeL': np.float64(0.19685001344064906), 'rougeLsum': np.float64(0.19906048597629938)}
PROMPT-TUNING MODEL:
{'rouge1': np.float64(0.20347502975319065), 'rouge2': np.float64(0.04160473391803565), 'rougeL': np.float64(0.1724890007832514), 'rougeLsum': np.float64(0.17150979566882546)}


Calculate the improvement of Prompt-tuning over the original model:

In [44]:
print("Absolute percentage improvement of PROMPT-TUNING MODEL over HUMAN BASELINE")

improvement = (np.array(list(prompt_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(prompt_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PROMPT-TUNING MODEL over HUMAN BASELINE
rouge1: 13.26%
rouge2: 4.16%
rougeL: 10.12%
rougeLsum: 9.88%


Calculate the improvement of LoRA over a full fine-tuned model:

In [45]:
print("Absolute percentage improvement of PROMPT-TUNING MODEL over INSTRUCT MODEL")

improvement = (np.array(list(prompt_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(prompt_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PROMPT-TUNING MODEL over INSTRUCT MODEL
rouge1: 2.93%
rouge2: 2.21%
rougeL: 2.20%
rougeLsum: 2.07%


Now, calculate the improvement of Prompt-tuning over a LoRA:

In [46]:
print("Absolute percentage improvement of PROMPT-TUNING MODEL over LoRA MODEL")

improvement = (np.array(list(prompt_model_results.values())) - np.array(list(lora_model_results.values())))
for key, value in zip(prompt_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PROMPT-TUNING MODEL over LoRA MODEL
rouge1: -4.04%
rouge2: -3.58%
rougeL: -2.44%
rougeLsum: -2.76%


# Questions

## Preprocessing and Tokenization:

- Why is it important to prepend instructions like "Summarize the following conversation" when constructing prompts for training a language model?
>>> Porque le dice al modelo qué tarea debe hacer. Sin esa instrucción, el modelo solo vería texto y no sabría si debe resumir, traducir o continuar.

- How does tokenization affect the model’s performance? What challenges might arise from long input sequences in tasks like summarization?
>>> La tokenización afecta al rendimiento porque convierte el texto en tokens y, si genera muchos, el modelo necesita más memoria y tarda más en procesar. Una tokenización eficiente mejora velocidad y comprensión.
En secuencias largas, el modelo tiene un límite de tokens, así que puede perder información del principio y hacer resúmenes menos precisos.

## Model Performance and Training:

Why do you think full fine-tuning achieves better results than zero-shot learning but might be less efficient for large-scale applications?
>>> El full fine-tuning da mejores resultados porque el modelo aprende directamente de los datos específicos de la nueva tarea, mientras que en zero-shot solo usa su conocimiento previo. Sin embargo, es menos eficiente en aplicaciones grandes porque requiere actualizar todos los parámetros, lo que consume mucha memoria, tiempo y recurso

## LoRA Fine-Tuning:

- How does LoRA reduce the number of trainable parameters compared to full fine-tuning, and why might this be beneficial for larger models?
>>> LoRA reduce los parámetros entrenables añadiendo pequeñas matrices que se entrenan aparte, en lugar de modificar todo el modelo. Esto es útil en modelos grandes porque mantiene un rendimiento alto usando mucha menos memoria y potencia de cómputo.

- LoRA modifies certain attention weights in the model. Why do you think only specific parts of the model are updated, and how does this affect its generalization to new tasks?
>>> LoRA solo actualiza las capas de atención, que son las más importantes para entender relaciones entre palabras. Al modificar solo esas partes, el modelo se adapta a la nueva tarea sin olvidar lo que ya sabía, mejorando su capacidad de generalizar a otros contextos.

## Prompt Tuning:

- In your own words, explain how prompt tuning differs from both full fine-tuning and LoRA. Why is it referred to as an additive fine-tuning technique?
>>> Prompt tuning es diferente porque no cambia los pesos del modelo ni añade nuevas capas. En lugar de eso, entrena unos vectores especiales que se añaden al inicio del texto para guiar la respuesta del modelo. Se llama técnica aditiva porque simplemente añade información al input sin modificar el modelo original.

- How does prompt tuning impact the number of parameters that are trained? Why is this method more efficient than full fine-tuning?
>>> En prompt tuning solo se entrenan los vectores adicionales, así que el número de parámetros es muy bajo. Esto lo hace mucho más rápido, barato y eficiente que el full fine-tuning, que requiere ajustar todo el modelo.

- How do the results from prompt-tuning compare to LoRA and full fine-tuning? Which technique performed best in terms of ROUGE scores?
>>> En general, el full fine-tuning logra los mejores ROUGE scores, LoRA obtiene resultados casi igual de buenos, y prompt tuning queda un poco por debajo, aunque sigue siendo competitivo y mucho más eficiente.

## Efficiency and Trade-offs:

- Given the results of your experiments, which fine-tuning method (LoRA, full fine-tuning, or prompt-tuning) do you think strikes the best balance between computational efficiency and model performance? Why?
>>> LoRA es el método que logra el mejor equilibrio, porque mantiene un rendimiento muy cercano al full fine-tuning pero con mucho menor coste de entrenamiento y uso. Es eficiente, rápido y no necesita tantos recursos, lo que lo hace ideal para tareas grandes o repetitivas.

- If you were to deploy one of these models in a production system with limited computational resources, which approach would you choose and why?
>>>Elegiría LoRA, porque ofrece buena precisión sin exigir tanta memoria ni potencia. Permite actualizar el modelo fácilmente y usarlo en entornos con recursos limitados sin perder demasiada calidad en los resultados.

- How would you extend these methods to other tasks beyond summarization (e.g., machine translation or question-answering)?
>>> Estos métodos pueden aplicarse igual a tareas como traducción o preguntas y respuestas, solo cambiando los datos y la instrucción del prompt. El enfoque de ajuste se mantiene igual, adaptado a la nueva tarea.
