# Instruction-Tune Falcon 7B using PEFT and QLoRA with int-4 

In this blog, we are going to learn how to to fine-tune [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) using [PEFT](https://github.com/huggingface/peft) with [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685). We are going to instruct-fine-tune Falcon using the new [SFTTrainer](https://huggingface.co/docs/trl/main/en/sft_trainer) from the [trl](https://github.com/lvwerra/trl) library

We will learn how to:
1. Setup Development Environment and prepare the dataset
2. Fine-Tune Falcon-7B with QLoRA in int-4
3. Test Model and run Inference

### Quick intro: PEFT or Parameter Efficient Fine-tuning

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of LLMs to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)

*Note: This tutorial was created and run on a g5.48xlarge AWS EC2 Instance, including 1 NVIDIA A10G.*

## 1. Setup Development Environment and prepare the dataset

In our example, we use the [PyTorch Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-pytorch.html) with already set up CUDA drivers and PyTorch installed. We still have to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.

In [None]:
!pip install "git+https://github.com/huggingface/peft.git@189a6b8e357ecda05ccde13999e4c35759596a67"

In [None]:
# install Hugging Face Libraries
# !pip install "peft==0.3.0" "trl==0.4.4" "transformers==4.30.1" "datasets==2.12.0" "accelerate==0.20.3" "evaluate==0.4.0" "torch==2.0.1" "bitsandbytes==0.39.0" --upgrade --quiet
!pip install  "trl==0.4.4" "transformers==4.30.1" "datasets==2.12.0" "accelerate==0.20.3" "evaluate==0.4.0" "torch==2.0.1" "bitsandbytes==0.39.0" --upgrade --quiet
!pip install "git+https://github.com/huggingface/peft.git@189a6b8e357ecda05ccde13999e4c35759596a67"
# install additional dependencies needed for training
!pip install tensorboard einops loralib

we will use the [dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT paper](https://arxiv.org/abs/2203.02155), including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

```python
{
  "instruction": "What is world of warcraft",
  "context": "",
  "response": "World of warcraft is a massive online multi player role playing game. It was released in 2004 by bizarre entertainment"
}
```

> Note: The next steps are for demonstration. The dataset processing, formatting and tokenization will be part of the training script, [run_clm_fsdp_lora.py](./scripts/run_clm_fsdp_lora.py). 

To load the `databricks/databricks-dolly-15k` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [None]:
from datasets import load_dataset
from random import randrange
# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# Train dataset size: 14732


To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. Here is where the `SFTTrainer` from `trl` comes handy. The `SFTTrainer` supports formatting during training. This means we only need to define a `formatting_function` that takes a sample and returns a string with our format instruction.

In [8]:
def format_dolly(sample):
  instruction = f"### Instruction\n{sample['instruction']}"
  context = f"### Context\n{sample['context']}" if len(sample['context']) > 0 else None
  response = f"### Answer\n{sample['response']}"
  # join all the parts together
  prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
  return prompt

lets test our formatting function on a random example.

In [18]:
from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

### Instruction
What are the ways to save money in gardening?

### Answer
1. Avoid buying potting mix by making your own potting soil
2. Compost your food scraps to make your own soil
3. Avoid buying seed germinating trays by using tofu trays and other recycled food trays to germinate seeds
4. Avoid buying pots and containers by re-using plastic milk containers with the top cut off, tetra pak with the top cut off, yoghurt containers, plastic soda bottles etc.
5. Avoid buying plants from the store by germinating plants from seed yourself
6. Collect rainwater for your plants to avoid using municipal water
7. Re-use water from rinsing vegetables/rice to water your plants to minimize the use of municipal water


## 2. Fine-Tune Falcon-7B with QLoRA in int-4

We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2106.09685)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is: 

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

If you want to learn more about QLoRA and how it works I recommend you to read the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.


First, we are going to load our model together with our quantization configuration

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Hugging Face model id
model_id = "ybelkada/falcon-7b-sharded-bf16" # sharded weights

# BitsAndBytesConfig int-4 config 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0},    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

The `SFTTrainer` also supports a native integration with `peft`, which makes it super easy to efficiently instruction tune LLMs. We prepared a [run_clm_fsdp_lora.py](./scripts/run_clm_fsdp_lora.py), which implements causal language modeling and accepts all relevant parameters, including the model id, peft configuration. The `SFTTrainer` part in our scripts looks like this:

```python
trainer = SFTTrainer(
    model, # our loaded model
    args=training_args, # our training args
    train_dataset=dataset, # raw training dataset
    formatting_func=format_dolly, # formatting function
    peft_config=peft_config, # peft config
    packing=True, # wether to pack data samples to max length
    max_seq_length=2048 # max sequence length for packing
)
```


https://colab.research.google.com/drive/1BiQiw31DT7-cDp1-0ySXvvhzqomTdI-o?usp=sharing#scrollTo=dQdvjTYTT1vQ
https://gist.github.com/pacman100/1731b41f7a90a87b457e8c5415ff1c14

In [None]:
# %%bash
!python scripts/run_clm_fsdp_lora.py \
 --model_id tiiuae/falcon-7b \
 --dataset_id "databricks/databricks-dolly-15k" \
 --per_device_train_batch_size 1 \
 --num_train_epochs 1 \
 --learning_rate 2e-4 \
 --gradient_checkpointing True \
 --bf16 True \
 --tf32 True \
 --output_dir ./tmp \
 --logging_steps 10
 
 
 #--optim adamw_apex_fused \

In [None]:
%%bash

MODEL_ID="tiiuae/falcon-40b"
DATASET_ID="databricks/databricks-dolly-15k"
NUM_GPUS=8

echo "Training ${MODEL_ID} on ${DATASET_ID} using ${NUM_GPUS} GPU.

torchrun --nproc_per_node ${NUM_GPUS} scripts/run_clm_fsdp_lora.py \
  --model_id ${MODEL_ID} \
  --dataset_id ${DATASET_ID} \
  --per_device_train_batch_size 1 \
  --num_train_epochs 1 \
  --learning_rate 2e-4 \
  --gradient_checkpointing True \
  --bf16 True \
  --tf32 True \
  --output_dir ./tmp \
  --logging_steps 10 \
  --fsdp "full_shard auto_wrap" \
  --fsdp_transformer_layer_cls_to_wrap "DecoderLayer"
  # --optim adamw_apex_fused \

The training took ~10:36:00 and cost `~13.22$` for 10h of training. For comparison a [full fine-tuning on FLAN-T5-XXL](https://www.philschmid.de/fine-tune-flan-t5-deepspeed#3-results--experiments) with the same duration (10h) requires 8x A100 40GBs and costs ~322$. 

## 4. Test Model and run Inference

After the training is done we want to run and test our model. We will use `peft` and `transformers` to load our LoRA adapter into our model. We will also use `accelerate` to run our inference on multiple GPUs. 

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc. 
peft_model_id = "tmp"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model loaded")

Let’s load the dataset again with a random sample to try the summarization.

In [None]:
from datasets import load_dataset 
from random import randrange


# Load dataset from the hub and get a sample
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
sample = dataset[randrange(len(dataset))]

prompt = f"### Instruction\n{sample['instruction']}\n\n"
if len(sample['context']) > 0:
  prompt += f"### Context\n{sample['context']}\n\n"
prompt += f"### Answer\n"

input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=50, do_sample=True, top_p=0.9)

print(f"{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

Nice! our model works! If want to accelerate our model we can deploy it with [Text Generation Inference](https://github.com/huggingface/text-generation-inference). Therefore we would need to merge our adapter weights into the base model.

In [None]:
# Merge LoRA and base model
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model")
