# Lab 10: Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a method to efficiently fine-tune large pre-trained language models. Instead of updating all the parameters of a model, LoRA introduces trainable low-rank matrices that are added to the original weights, keeping the pre-trained weights frozen. This approach significantly reduces the number of parameters that need to be optimized, leading to lower computational costs and faster training.


1. **Parameter Efficiency**: Instead of updating the entire weight matrices in the transformer layers, LoRA adds low-rank matrices (e.g., rank=1 or 2), which are much smaller in size.
2. **Frozen Base Model**: The base pre-trained model remains untouched, preserving its knowledge while adapting it to new tasks through the low-rank updates.
3. **Scalability**: LoRA is highly scalable and can be applied to models with billions of parameters while requiring only a fraction of the resources needed for full fine-tuning.

### How It Works:

LoRA modifies a weight matrix \( W \) in the transformer model as follows:

$W \rightarrow W + \Delta W \quad \text{where} \quad \Delta W = A \cdot B$

- \( A \) and \( B \) are low-rank matrices, with dimensions much smaller than \( W \).
- \( A \) and \( B \) are the only trainable parameters, reducing the total number of trainable parameters significantly.
  
![LoRA Schematic](https://cdn.prod.website-files.com/62c4a9809a85693c49c4674f/65b80a7f61892487cf1e3af6_lora-1.png)

1. The original weight matrix \( W \) is frozen.
2. Trainable low-rank matrices \( A \) and \( B \) are added to adapt the model for a new task.

This low-rank adaptation ensures that the model can adapt to specific tasks without updating the full weight matrices.

### Advantages of LoRA:

1. **Efficiency**: Requires less memory and computational resources.
2. **Speed**: Faster training compared to full fine-tuning.
3. **Reusability**: Pre-trained weights are preserved and can be reused across multiple tasks.

In [None]:
# !pip install "peft==0.2.0"
# !pip install "transformers==4.27.2" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" loralib --upgrade --quiet
# !pip install rouge-score tensorboard py7zr

: 

In [1]:
rm -rf ~/.cache/huggingface

In [1]:
from datasets import concatenate_datasets
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import AutoModelForSeq2SeqLM
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import evaluate
from datasets import load_from_disk
from tqdm import tqdm



In [2]:
dataset = load_dataset("samsum")
 
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

samsum.py:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

corpus.7z:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Train dataset size: 14732
Test dataset size: 819


In [3]:
model_id="google/flan-t5-xxl"
 
# Load tokenizer of FLAN-t5-XL
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [4]:
# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")
 
# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Max source length: 255


Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Max target length: 50


In [5]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]
 
    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
 
    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)
 
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]
 
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
 
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")
 
# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/14732 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

In [9]:
# huggingface hub model id
model_id = "philschmid/flan-t5-xxl-sharded-fp16"
 
# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto")

config.json:   0%|          | 0.00/759 [00:00<?, ?B/s]
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


In [10]:
############### YOUR CODE STARTS HERE ###############
# Define LoRA Config from peft library

# prepare int-8 model for training
 
# add LoRA adaptor

############### YOUR CODE ENDS HERE ###############
# Define LoRA Config from peft library
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

# prepare int-8 model for training
model = prepare_model_for_int8_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)

model.print_trainable_parameters()
 
# trainable params: 18874368 || all params: 11154206720 || trainable%: 0.16921300163961817

trainable params: 18874368 || all params: 11154206720 || trainable%: 0.16921300163961817


In [None]:
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [None]:
output_dir="lora-flan-t5-xxl" 
# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
	auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=1,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)
 
# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [13]:
# train model
trainer.train()

TrainOutput(global_step=1842, training_loss=1.1217418923310685, metrics={'train_runtime': 3233.9189, 'train_samples_per_second': 4.555, 'train_steps_per_second': 0.57, 'total_flos': 2.4942341612843827e+17, 'train_loss': 1.1217418923310685, 'epoch': 1.0})


In [14]:
# Metric
metric = evaluate.load("rouge")
 
def evaluate_peft_model(sample,max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)
 
    # Some simple post-processing
    return prediction, labels
 
# load test dataset from distk
test_dataset = load_from_disk("data/eval/").with_format("torch")
 
# run predictions
# this can take ~45 minutes
predictions, references = [] , []
for sample in tqdm(test_dataset):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)
 
# compute metric
rogue = metric.compute(predictions=predictions, references=references, use_stemmer=True)
 
# print results
print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

100%|██████████| 819/819 [44:15<00:00,  3.24s/it]
Rogue1: 47.740014%
rouge2: 21.631286%
rougeL: 38.460152%
rougeLsum: 38.447974%
