# Fine-tune Falcon 40B using PEFT and LoRA with FSDP

In this blog, we are going to learn how to to fine-tune [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) using [PEFT](https://github.com/huggingface/peft) with [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685) and PyTorch [FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/). PyTorch FSDP is natively integrated into the [Hugging Face Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#pytorch-fully-sharded-data-parallel), making it easy to parallelize models on multiple GPUs. We are going to instruct-fine-tune Falcon-40B using the new [SFTTrainer](https://huggingface.co/docs/trl/main/en/sft_trainer) from the [trl](https://github.com/lvwerra/trl) library

We will learn how to:
1. Setup Development Environment and prepare the dataset
2. Fine-Tune Falcon-40B with LoRA and FSDP
3. Test Model and run Inference


### Quick intro: PEFT or Parameter Efficient Fine-tuning

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of LLMs to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)

*Note: This tutorial was created and run on a g5.48xlarge AWS EC2 Instance, including 1 NVIDIA A10G.*

## 1. Setup Development Environment and prepare the dataset

In our example, we use the [PyTorch Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-pytorch.html) with already set up CUDA drivers and PyTorch installed. We still have to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.

In [None]:
# install Hugging Face Libraries
!pip install "peft==0.3.0" "trl==0.4.4" "transformers==4.30.1" "datasets==2.12.0" "accelerate==0.20.3" "evaluate==0.4.0" loralib  --upgrade --quiet
# install additional dependencies needed for training
!pip install tensorboard  

we will use the [dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT paper](https://arxiv.org/abs/2203.02155), including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

```python
{
  "instruction": "What is world of warcraft",
  "context": "",
  "response": "World of warcraft is a massive online multi player role playing game. It was released in 2004 by bizarre entertainment"
}
```

> Note: The next steps are for demonstration. The dataset processing, formatting and tokenization will be part of the training script, [run_clm_fsdp_lora.py](./scripts/run_clm_fsdp_lora.py). 

To load the `databricks/databricks-dolly-15k` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [None]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset['train'])}")
# Train dataset size: 14732


To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. Here is where the `SFTTrainer` from `trl` comes handy. The `SFTTrainer` supports formatting during training. This means we only need to define a `formatting_function` that takes a sample and returns a string with our format instruction.

In [None]:
def format_dolly(sample):
  instruction = f"### Instruction\n{sample['text']}"
  context = f"### Context\n{sample['context']}" if len(sample['context']) > 0 else None
  response = f"### Answer\n{sample['answer']}"
  # join all the parts together
  prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
  return prompt

lets test our formatting function on a random example.

In [None]:
from random import randrange

format_dolly(dataset[randrange(len(dataset))])

## 2. Fine-Tune Falcon-40B with LoRA and FSDP

We are going to use PyTorch FSDP to train Falcon-40B on multiple GPUs, this means we need to use a distributed launcher, e.g. `torchrun` to start our training on multiple-gpus. We are formatting and tokenizing the dataset during the training using the `formatting_func` function of the `SFTTrainer`. 

The `SFTTrainer` also supports a native integration with `peft`, which makes it super easy to efficiently instruction tune LLMs. We prepared a [run_clm_fsdp_lora.py](./scripts/run_clm_fsdp_lora.py), which implements causal language modeling and accepts all relevant parameters, including the model id, peft configuration. The `SFTTrainer` part in our scripts looks like this:

```python
trainer = SFTTrainer(
    model, # our loaded model
    args=training_args, # our training args
    train_dataset=dataset, # raw training dataset
    formatting_func=format_dolly, # formatting function
    peft_config=peft_config, # peft config
    packing=True, # wether to pack data samples to max length
    max_seq_length=2048 # max sequence length for packing
)
```


In [None]:
%%bash

MODEL_ID="tiiuae/falcon-40b"
DATASET_ID="databricks/databricks-dolly-15k"
NUM_GPUS=8

echo "Training ${MODEL_ID} on ${DATASET_ID} using ${NUM_GPUS} GPU.

torchrun --nproc_per_node ${NUM_GPUS} run_clm_fsdp_lora.py \
  --model_name_or_path ${MODEL_ID} \
  --dataset_id ${DATASET_ID} \
  
  

The training took ~10:36:00 and cost `~13.22$` for 10h of training. For comparison a [full fine-tuning on FLAN-T5-XXL](https://www.philschmid.de/fine-tune-flan-t5-deepspeed#3-results--experiments) with the same duration (10h) requires 8x A100 40GBs and costs ~322$. 

We can save our model to use it for inference and evaluate it. We will save it to disk for now, but you could also upload it to the [Hugging Face Hub](https://huggingface.co/docs/hub/main) using the `model.push_to_hub` method.

In [None]:
# Save our LoRA model & tokenizer results
peft_model_id="results"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)
# if you want to save the base model to call
# trainer.model.base_model.save_pretrained(peft_model_id)

Our LoRA checkpoint is only 84MB small and includes all of the learnt knowleddge for samsum.

## 4. Evaluate & run Inference with LoRA FLAN-T5

After the training is done we want to evaluate and test it. The most commonly used metric to evaluate summarization task is [rogue_score](https://en.wikipedia.org/wiki/ROUGE_(metric)) short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries.

We are going to use `evaluate` library to evaluate the `rogue` score. We can run inference using `PEFT` and `transformers`. For our FLAN-T5 XXL model, we need at least 18GB of GPU memory.

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc. 
peft_model_id = "results"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model loaded")

Let’s load the dataset again with a random sample to try the summarization.

In [None]:
from datasets import load_dataset 
from random import randrange


# Load dataset from the hub and get a sample
dataset = load_dataset("samsum")
sample = dataset['test'][randrange(len(dataset["test"]))]

input_ids = tokenizer(sample["dialogue"], return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(f"input sentence: {sample['dialogue']}\n{'---'* 20}")

print(f"summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

Nice! our model works! Now, lets take a closer look and evaluate it against the `test` set of processed dataset from `samsum`. Therefore we need to use and create some utilities to generate the summaries and group them together. The most commonly used metrics to evaluate summarization task is [rogue_score](https://en.wikipedia.org/wiki/ROUGE_(metric)) short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries.

In [None]:
import evaluate
import numpy as np
from datasets import load_from_disk
from tqdm import tqdm

# Metric
metric = evaluate.load("rouge")

def evaluate_peft_model(sample,max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)    
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    return prediction, labels

# load test dataset from distk
test_dataset = load_from_disk("data/eval/").with_format("torch")

# run predictions
# this can take ~45 minutes
predictions, references = [] , []
for sample in tqdm(test_dataset):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)

# compute metric 
rogue = metric.compute(predictions=predictions, references=references, use_stemmer=True)

# print results 
print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

# Rogue1: 50.386161%
# rouge2: 24.842412%
# rougeL: 41.370130%
# rougeLsum: 41.394230%

Our PEFT fine-tuned FLAN-T5-XXL achieved a rogue1 score of `50.38%` on the test dataset. For comparison a [full fine-tuning of flan-t5-base achieved a rouge1 score of 47.23](https://www.philschmid.de/fine-tune-flan-t5). That is a `3%` improvements. 

It is incredible to see that our LoRA checkpoint is only 84MB small and model achieves better performance than a smaller fully fine-tuned model.