<a href="https://colab.research.google.com/github/ThomasHeap/Examples/blob/main/M2L_summer_school/NLP/part_III_llm_finetuning/LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Natural Language Processing Tutorial
======

This is the tutorial of the 2024 [Mediterranean Machine Learning Summer School](https://www.m2lschool.org/) on Natural Language Processing!

This tutorial will explore the fundamental aspects of Natural Language Processing (NLP). Basic Python programming skills are expected.
Prior knowledge of standard NLP techniques (e.g. text tokenization and classification with ML) is beneficial but optional when working through the notebooks as they assume minimal prior knowledge.

This tutorial combines detailed analysis and development of essential NLP concepts via custom (i.e. from scratch) implementations. Other necessary NLP components will be developed using PyTorch's NLP library implementations. As a result, the tutorial offers deep understanding and facilitates easy usage in future applications.

## Outline

* Part I: Introduction to Text Tokenization and Classification
  *  Text Classification: Simple Classifier
  *  Text Classification: Encoder-only Transformer

* Part II: Introduction to Decoder-only Transformer and Sparse Mixture of Experts Architecture
  *  Text Generation: Decoder-only Transformer
  *  Text Generation: Decoder-only Transformer + MoE

* Part III: Introduction to Parameter Efficient Fine-tuning
  *  Fine-tuning the full Pre-trained Models
  *  Fine-tuning using Low-Rank Adaptation of Large Language Models (LoRA)

## Notation

* Sections marked with [📚] contain cells that you should read, modify and complete to understand how your changes alter the obtained results.
* External resources are mentioned with [✨]. These provide valuable supplementary information for this tutorial and offer opportunities for further in-depth exploration of the topics covered.


## Libraries

This tutorial leverages [PyTorch](https://pytorch.org/) for neural network implementation and training, complemented by standard Python libraries for data processing and the [Hugging Face](https://huggingface.co/) datasets library for accessing NLP resources.

GPU access is recommended for optimal performance, particularly for model training and text generation. While all code can run on CPU, a CUDA-enabled environment will significantly speed up these processes.

## Credits

The tutorial is created by:

* [Luca Herranz-Celotti](http://LuCeHe.github.io)
* [Georgios Peikos](https://www.linkedin.com/in/peikosgeorgios/)

It is inspired by and synthesizes various online resources, which are cited throughout for reference and further reading.

## Note for Colab users

To grab a GPU (if available), make sure you go to `Edit -> Notebook settings` and choose a GPU under `Hardware accelerator`



## Part III: Introduction to Parameter Efficient Fine-tuning

We show how to adapt a model that has been pre-trained on a lot of data, can be adapted to be used in a downstream task, by fine-tuning it on a target dataset. The first idea could be to consider adapting all the weights of the network to the new task, but this can be resource intensive. This could lead us to decide that we can freeze all the weights, except the final output linear layer. We will see that this results in faster training, but also in worse performance on our target task. Finally we introduce a newer way of thinking, Parameter Efficient Fine-Tuning (PEFT) and one approach in that family, LoRA, that will provide us with a way to improve performance in a fine-tuning task, while being less resource intensive.


##Step 1: Load Packages


In [1]:
!pip install peft datasets evaluate

Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading peft-0.12.0-py3-none-any.whl (296 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import math
import torch
import torch.nn as nn

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification
from transformers.utils import PushToHubMixin

from peft.tuners.lora.layer import dispatch_default, Linear
from peft.tuners.tuners_utils import BaseTunerLayer
from peft import LoraConfig, PeftModel, LoraModel, get_peft_model
from datasets import load_dataset

import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer

We will fine-tune the ✨ [BERT](https://arxiv.org/pdf/1810.04805) architecture, a well known language classification architecture built based on the Transformer encoder. The pre-trained model is openly available at different sources. We will focus on the HuggingFace library, since it has become a standard for Large Language Models, and it includes a large number of convenient tools for language processing and generation.

In [3]:
model_name_or_path = "google-bert/bert-base-cased"
tokenizer_name_or_path = "google-bert/bert-base-cased"

##📚 Step 2: Load Dataset

Let's pick a dataset and use the tokenizer that corresponds to the BERT model. The ✨ [Yelp reviews dataset](https://huggingface.co/datasets/Yelp/yelp_review_full) consists of reviews from Yelp, and each review has a number of stars between one and five. The neural network will see the review at the input, and will have to predict the number of stars that correspond to that review.



In [4]:
# EXERCISE: load the yelp_review_full dataset using load_dataset
dataset = load_dataset("Yelp/yelp_review_full")

print(dataset)
print(dataset["train"][100])

# EXERCISE: load the BERT tokenizer with AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)

def tokenize_function(examples):
    # EXERCISE: pad to max length and truncate sentences
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})
{'label': 0, 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t mak

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

## 📚 Step 3: Define Training and Evaluation Loop
Let's standardize the training and evaluation loop, so we can better appreciate the difference in the final result between the three finetuning techniques explained.

In [9]:
def train_and_evaluate(model, max_steps=-1, num_train_epochs=2, learning_rate=5e-5):
    metric = evaluate.load("accuracy")

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        # EXERCISE: the greedy prediction is the argmax of the logits
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    training_args = TrainingArguments(
        output_dir="test_trainer",
        num_train_epochs=num_train_epochs,
        max_steps=max_steps,
        learning_rate=learning_rate,
        label_names=["labels"],
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        compute_metrics=compute_metrics,
    )

    train_metrics = trainer.train()
    print(train_metrics)
    eval_metrics = trainer.evaluate()
    print(eval_metrics)

We also introduce an auxiliary function to count the number of trainable parameters in each case.

In [6]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable: {100 * trainable_params / all_param:.2f}%"
    )

## 📚 Step 4: Full Finetuning

The simplest possibility is to fine-tune all the model, the pre-trained BERT, but also the new linear classifier on top. This might usually achieve the best final accuracy, but it results in slow fine-tuning. This is because all the matrices in the model have to be updated, which can be very large and consume a lot of memory.

In [7]:
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=5)
print_trainable_parameters(model)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 108314117 || all params: 108314117 || trainable: 100.00%


In [10]:
# EXERCISE: explore learning rates in the set [5e-2, 5e-3, 5e-4, 5e-5] to find the best
# one with this configuration
train_and_evaluate(model, learning_rate=5e-2)

Step,Training Loss


TrainOutput(global_step=250, training_loss=8.76671484375, metrics={'train_runtime': 194.5996, 'train_samples_per_second': 10.278, 'train_steps_per_second': 1.285, 'total_flos': 526236284928000.0, 'train_loss': 8.76671484375, 'epoch': 2.0})


{'eval_loss': 1.9723984003067017, 'eval_accuracy': 0.166, 'eval_runtime': 26.6091, 'eval_samples_per_second': 37.581, 'eval_steps_per_second': 4.698, 'epoch': 2.0}


## 📚 Step 5: Head Finetuning

Another possibility is to fix the weights of the pre-trained BERT, and fine-tune only the head, the linear classifier that HuggingFace has placed on top when we say `num_labels=5`. This will drastically reduce the number of trainable parameters, and therefore it will significantly speed up fine-tuning.

In [11]:
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=5)

# EXERCISE: set as trainable only the parameters of the classifier
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

print_trainable_parameters(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 3845 || all params: 108314117 || trainable: 0.00%


In [12]:
# EXERCISE: explore learning rates in the set [5e-2, 5e-3, 5e-4, 5e-5] to find the best
# one with this configuration
train_and_evaluate(model, learning_rate=5e-2)

Step,Training Loss


TrainOutput(global_step=250, training_loss=5.99334228515625, metrics={'train_runtime': 62.0626, 'train_samples_per_second': 32.226, 'train_steps_per_second': 4.028, 'total_flos': 526236284928000.0, 'train_loss': 5.99334228515625, 'epoch': 2.0})


{'eval_loss': 1.413234829902649, 'eval_accuracy': 0.365, 'eval_runtime': 28.9827, 'eval_samples_per_second': 34.503, 'eval_steps_per_second': 4.313, 'epoch': 2.0}


## 📚 Step 6: LoRA Finetuning

A newer line of research, called Parameter Efficient Fine-Tuning (PEFT) attempts to figure out different techniques to drastically reduce the number of parameters to fine-tune, and still achieve good performance. One of the most popular options is called ✨ [LoRA](https://arxiv.org/pdf/2106.09685), for Low-Rank adaptation of Language Models. It consists on constructing the new matrices as $\theta = \hat{\theta} + A^TB$, where $\theta$ is the new matrix of our model, the pre-trained weights $\hat{\theta}$ are kept fixed, and only an additive component made up by multiplying two smaller matrices $A,B$ is learned. This drastically reduces the number of parameters to train, if $A,B$ are chosen appropriately.

<img src="https://heidloff.net/assets/img/2023/08/lora.png" alt="drawing" width="50%"/>


The speed up is noticeable with BERT, and becomes more significant for larger models. The matrices $A,B$ have size $A,B\in	\mathbb{R}^{r\times N}$, where the size of the original matrix was $\theta,\hat{\theta}\in	\mathbb{R}^{N\times N}$.

Now, let's first define the hyper-parameters of our LoRA:

In [13]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Then let's define the LoRA layer itself.

In [28]:
class CustomLinearLoRA(Linear):
    def update_layer(
            self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora=False, use_dora=False
    ):
        # This code works for linear layers, override for other layer types
        if r <= 0:
            raise ValueError(f"`r` should be a positive integer value but the value passed is {r}")

        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha

        # EXERCISE: define a dropout layer
        lora_dropout_layer = nn.Dropout(p=lora_dropout) if lora_dropout > 0.0 else nn.Identity()

        self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer}))

        # Actual trainable parameters
        # EXERCISE: write a linear layer that goes from self.in_features to r
        self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
        # EXERCISE: write a linear layer that goes from r to self.out_features
        self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)

        self.scaling[adapter_name] = lora_alpha / r

        self.reset_lora_parameters(adapter_name, init_lora_weights)
        self.set_adapter(self.active_adapters)

    def forward(self, x, *args, **kwargs):
        result = self.base_layer(x, *args, **kwargs)
        torch_result_dtype = result.dtype
        for active_adapter in self.active_adapters:
            if active_adapter not in self.lora_A.keys():
                continue
            lora_A = self.lora_A[active_adapter]
            lora_B = self.lora_B[active_adapter]
            dropout = self.lora_dropout[active_adapter]
            scaling = self.scaling[active_adapter]

            x = x.to(lora_A.weight.dtype)

            x = dropout(x)

            # EXERCISE: add to the result of the base layer, the output of
            # lora_B and lora_A and multiply by the scaling factor
            result = result + lora_B(lora_A(x)) * scaling

        result = result.to(torch_result_dtype)

        return result

Since we are using the HuggingFace PEFT library framework, we need to tweak some of its internal workings to be able to expose the LoRA layer above. Therefore the following cell is not very insightful to understand.

In [29]:
def custom_dispatch_default(target: torch.nn.Module, adapter_name, lora_config, **kwargs):
    new_module = None
    target_base_layer = target.get_base_layer() if isinstance(target, BaseTunerLayer) else target

    if isinstance(target_base_layer, torch.nn.Linear):
        kwargs.update(lora_config.loftq_config)
        new_module = CustomLinearLoRA(target, adapter_name, **kwargs)

    if new_module is None:
        new_module = dispatch_default(target, adapter_name, lora_config=lora_config, **kwargs)
    return new_module

class CustomLoraModel(LoraModel):
    @staticmethod
    def _create_new_module(lora_config, adapter_name, target, **kwargs):
        return custom_dispatch_default(target, adapter_name, lora_config=lora_config, **kwargs)

class CustomPeftModel(PeftModel):
    def __init__(self, model, peft_config, adapter_name="default"):
        PushToHubMixin.__init__(self)
        torch.nn.Module.__init__(self)

        self.modules_to_save = None
        self.active_adapter = adapter_name
        self.peft_type = peft_config.peft_type
        # These args are special PEFT arguments that users can pass. They need to be removed before passing them to
        # forward.
        self.special_peft_forward_args = {"adapter_names"}

        self._is_prompt_learning = peft_config.is_prompt_learning
        self._peft_config = None
        self.base_model = CustomLoraModel(model, {adapter_name: peft_config}, adapter_name)

        self.set_additional_trainable_modules(peft_config, adapter_name)

        if getattr(model, "is_gradient_checkpointing", True):
            model = self._prepare_model_for_gradient_checkpointing(model)

        # the `pretraining_tp` is set for some models to simulate Tensor Parallelism during inference to avoid
        # numerical differences, https://github.com/pytorch/pytorch/issues/76232 - to avoid any unexpected
        # behavior we disable that in this line.
        if hasattr(self.base_model, "config") and hasattr(self.base_model.config, "pretraining_tp"):
            self.base_model.config.pretraining_tp = 1

Now we have everything we need to fine-tune BERT with LoRA. We load again the model, we upgrade it with LoRA, we count the trainable parameters and let's see what happens when we fine-tune it.

In [30]:
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=5)

model = CustomPeftModel(model, peft_config)
print_trainable_parameters(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 589824 || all params: 108903941 || trainable: 0.54%


In [31]:
# EXERCISE: explore learning rates in the set [5e-2, 5e-3, 5e-4, 5e-5] to find the best
# one with this configuration
train_and_evaluate(model, learning_rate= 5e-4)

Step,Training Loss


TrainOutput(global_step=250, training_loss=1.5055087890625, metrics={'train_runtime': 144.689, 'train_samples_per_second': 13.823, 'train_steps_per_second': 1.728, 'total_flos': 529860163584000.0, 'train_loss': 1.5055087890625, 'epoch': 2.0})


{'eval_loss': 1.2858178615570068, 'eval_accuracy': 0.421, 'eval_runtime': 29.6613, 'eval_samples_per_second': 33.714, 'eval_steps_per_second': 4.214, 'epoch': 2.0}


As you see, LoRA was faster than full fine-tuning, with a better final performance than just updating the last linear layer.

## ✨ Resources used for this tutorial and references
- [LORA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/pdf/2106.09685)
- [DoRA: Weight-Decomposed Low-Rank Adaptation](https://arxiv.org/pdf/2402.09353)
- [HuggingFace PEFT Tutorial](https://huggingface.co/blog/peft)
- [HuggingFace PEFT Tutorial for image classification](https://huggingface.co/docs/peft/main/en/task_guides/image_classification_lora)
- [HuggingFace Training Tutorial](https://huggingface.co/docs/transformers/training)
