# Parameter-Efficient Fine-Tuning (PEFT) Techniques for Large Language Models (LLMs)

In this notebook, we will explore various techniques for fine-tuning large language models in a parameter-efficient way (PEFT). These methods allow us to adapt pre-trained language models to new tasks without updating all the parameters of the model, which is computationally expensive and requires a large amount of data.

PEFT strategies are crucial in scenarios where computational resources are limited, or when working with large models like GPT, BERT, or T5. We'll discuss the following techniques:

- **LoRA (Low-Rank Adaptation)**
- **Prefix Tuning**

Let's dive in!

In [1]:
# Installing the necessary libraries
!pip install -q transformers datasets
# install peft from github
!pip install -q git+https://github.com/huggingface/peft

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone


## Low-Rank Adaptation (LoRA)
LoRA is another parameter-efficient fine-tuning technique. It reduces the rank of the model's parameter updates to achieve efficient training with fewer resources. This technique works by approximating the parameter updates in a low-dimensional subspace, rather than full-rank matrices.

### Data loading and preprocessing

In this example, we will use the samsum dataset, which consist of ~16k conversations. Each conversations comes wilt a summary. The objective is to fine-tune a model that is able to generate a summary when forwarded a diaglogue.

In [2]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("nyamuda/samsum")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

print("Sample example:")
print(dataset['train'][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/909 [00:00<?, ?B/s]

train.json:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

val.json: 0.00B [00:00, ?B/s]

test.json: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Train dataset size: 14732
Test dataset size: 819
Sample example:
{'id': '13818513', 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.', 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"}


To train a model, the text should be converted to machine-readable units, which are the token IDs. This can be done by using a tokenizer.

In this example, we'll use a small model from big science for demonstration

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="bigscience/mt0-small"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/430 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

In [4]:
from datasets import concatenate_datasets
import numpy as np

# Here we tokenize the dialogues, which is the input of our model
# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# Here we tokenize the summary, which should be the output of our model
# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max source length: 266


Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Max target length: 55


We now will preprocess the data. It's handy to save your preprocessed data to disk for time efficiency

In [5]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs which was the dialogue
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument, which was the summary
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/14732 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

### Model loading and training

Now that we have our dataset ready, we can start the fine-tuning process. First we need to load the base model.

In [6]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType

# when you're using a big model, you can quantisize the model  to save memory by using its
# bit configuration in the parameter setting, that is, 'load_in_4bit=True' or 'load_in_8bit=True'
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto")

config.json:   0%|          | 0.00/773 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

When you want to fine-tune a model, you have to define your fine-tune configuration and wrap the model in a peft-object

In [7]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# Define LoRA Config
lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training when you use a quatizied model
# model = repare_model_for_kbit_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 688,128 || all params: 300,864,896 || trainable%: 0.2287


Here you can see that only 22% of the parameters are being trained, which saves a lot of memory especially for bigger models!

Now we create a DataCollator, that will take care of padding the data and create batches

In [8]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

Lastly, we define the hyperparameters of our training process

In [9]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="tutorial"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
	auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=1,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

Now we can finally train the model

In [None]:
trainer.train()

Step,Training Loss
500,2.2487
1000,2.2645
1500,2.2453


TrainOutput(global_step=1842, training_loss=2.242405861389624, metrics={'train_runtime': 406.5865, 'train_samples_per_second': 36.233, 'train_steps_per_second': 4.53, 'total_flos': 4154737764728832.0, 'train_loss': 2.242405861389624, 'epoch': 1.0})

### Model saving and evaluation

Make sure you save your model and reload it to check whether everything works accordingly!

In [None]:
# Save our LoRA model & tokenizer results
peft_model_id="path_to_trained_model"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)
# if you want to save the base model to call
# trainer.model.base_model.save_pretrained(peft_model_id)

('path_to_trained_model/tokenizer_config.json',
 'path_to_trained_model/special_tokens_map.json',
 'path_to_trained_model/spiece.model',
 'path_to_trained_model/added_tokens.json',
 'path_to_trained_model/tokenizer.json')

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc.
peft_model_id = "path_to_trained_model"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model loaded")

Peft model loaded


Try it with one example from the dataset to see if it works

In [None]:
# use the first sample of the test set
sample = dataset['test'][0]

input_ids = tokenizer(sample["dialogue"], return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(f"input sentence: {sample['dialogue']}\n{'---'* 20}")

print(f"summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


input sentence: Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
------------------------------------------------------------
summary:
They called Betty last time she was


That's it for the LoRA fine-tuning!

## Prefix Tuning

Prefix tuning enables the model to learn a continuous task-specifc vector which are added to the beginning of the input, the prefix. In this method, only the prefix parameters are optimized, making it easy efficient for training by reducing memory and computational costs by the thousands!


### Data loading and preprocessing

We will use the financial phrasebank dataset, which contains sentiment labels for financial news sentences.

In [50]:
from datasets import load_dataset
from transformers import AutoTokenizer

# 1) load data
raw = load_dataset("pietrolesci/agnews")
train_ds = raw['train'].shuffle(seed=42).select(range(2000))
val_ds = raw['test'].shuffle(seed=42).select(range(500))

# 2) tokenizer and model name
MODEL_NAME = "bert-base-uncased" # For multi-class classification
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Convert dataset to text-to-text format and tokenize.



In [53]:
# Tokenize inputs
def preprocess(batch):
    inputs = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
    inputs["labels"] = batch["labels"]  # keep labels as integers
    return inputs

train_tok = train_ds.map(preprocess, batched=True, remove_columns=train_ds.column_names)
val_tok = val_ds.map(preprocess, batched=True, remove_columns=val_ds.column_names)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

### Model Loading and Training

Create a Prefix Tuning config and wrap the model (PEFT)

In [55]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import get_peft_model, PrefixTuningConfig

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=4)

peft_config = PrefixTuningConfig(
task_type="SEQ_CLS",
inference_mode=False,
num_virtual_tokens=20, # prefix length; try 10, 20, 50
encoder_hidden_size=model.config.hidden_size, # usually set automatically
)

model = get_peft_model(model, peft_config)


# Check number of trainable params vs total
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable:,} / {total:,} ({100*trainable/total:.4f}%)")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 371,716 / 109,857,032 (0.3384%)


Training with Hugging Face Trainer

In [59]:
training_args = TrainingArguments(
  output_dir="./prefix_tuning_bert",
  num_train_epochs=1,
  per_device_train_batch_size=8,
  per_device_eval_batch_size=8,
  do_eval=True,
  logging_steps=50,
  eval_strategy="epoch",
  save_total_limit=2,
  fp16=False, # enable if you have a GPU supporting mixed precision
)

In [29]:
!pip install -q evaluate

Start training the model

In [60]:
from evaluate import load as load_eval

metric = load_eval("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return metric.compute(predictions=preds, references=labels)

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=train_tok,
  eval_dataset=val_tok,
  tokenizer=tokenizer,
  compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,1.3437,1.364044,0.328


TrainOutput(global_step=250, training_loss=1.3598179626464844, metrics={'train_runtime': 36.3637, 'train_samples_per_second': 55.0, 'train_steps_per_second': 6.875, 'total_flos': 131562614784000.0, 'train_loss': 1.3598179626464844, 'epoch': 1.0})

### Model saving and evaluation

Save and load the model

In [64]:
# Save PEFT adapter (this saves only the prefix parameters, not the entire base model)
model.save_pretrained("./prefix_tuning_adapter")

# Later: load the base model and the adapter
from peft import PeftModel
base = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=4)
peft_model = PeftModel.from_pretrained(base, "./prefix_tuning_adapter")

# Now peft_model can be used for inference or further fine-tuning

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Inference

In [74]:
label_map = {0: 'world',
             1: 'sports',
             2: 'business',
             3:'sci/tech'}

model.eval()  # set to evaluation mode

def predict(texts, max_length=128, device=None):
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model.to(device)

    # Tokenize inputs
    enc = tokenizer(texts, truncation=True, padding=True, max_length=max_length, return_tensors="pt")
    enc = {k: v.to(device) for k, v in enc.items()}

    # Forward pass
    with torch.no_grad():
        outputs = model(**enc)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=-1).cpu().numpy()

    # Map indices to label names
    pred_names = [label_map[p] for p in preds]
    return pred_names

In [75]:
sentences = [
    "The stock market crashed yesterday due to inflation fears.",
    "The football match was thrilling with a last-minute goal!",
    "NASA successfully launched a new satellite into orbit.",
    "The president gave a speech about international trade."
]

predicted_labels = predict(sentences)
print(predicted_labels)

['world', 'sports', 'world', 'world']


## Conclusion

In this notebook, we explored several PEFT techniques for fine-tuning large language models. By only modifying a small subset of the model's parameters, these techniques allow us to adapt pre-trained models to new tasks more efficiently, without requiring extensive computational resources or massive amounts of data.
- **LoRA** reduces the rank of parameter updates, making training more efficient.
- **Prefix Tuning**, optimzing only tthe prefix parameters as only a sequence of continuous task-specific vectors are attached to the beginning of the input

These methods enable us to leverage the power of large models while minimizing the computational cost.