## Finetune Falcon-7b on a Google colab
Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup
Run the cells below to setup and install the required libraries. For our experiment we will need accelerate, peft, transformers, datasets and TRL to leverage the recent SFTTrainer. We will use bitsandbytes to quantize the base model into 4bit. We will also install einops as it is a requirement to load Falcon models.

In [1]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.2 MB/s

## Dataset
For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

Dataset Link : https://huggingface.co/datasets/timdettmers/openassistant-guanaco

In [2]:
from datasets import load_dataset

dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Loading the model
In this section we will load the --> https://huggingface.co/tiiuae/falcon-7b, quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ybelkada/falcon-7b-sharded-bf16"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,   # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4",   # Compute dtype for 4-bit base models
    bnb_4bit_compute_dtype=torch.float16,  # Quantization type (fp4 or nf4)
    bnb_4bit_use_double_quant=False   # Activate nested quantization for 4-bit base models (double quantization)
)

In [3]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

pytorch_model-00001-of-00008.bin:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

pytorch_model-00002-of-00008.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

pytorch_model-00003-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00004-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00005-of-00008.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

pytorch_model-00006-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00007-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00008-of-00008.bin:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

## Let's also load the tokenizer below

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add dense, dense_h_to_4_h and dense_4h_to_h layers in the target modules in addition to the mixed query key value layer.

In [5]:
from peft import LoraConfig

lora_alpha = 16  # Alpha parameter for LoRA scaling
lora_dropout = 0.1  # Dropout probability for LoRA layers
lora_r = 64  # LoRA attention dimension

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

## Loading the trainer
Here we will use the --> https://huggingface.co/docs/trl/main/en/sft_trainer ,that gives a wrapper around transformers Trainer to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [None]:
# # Batch size per GPU for evaluation
# per_device_eval_batch_size = 4

# # Weight decay to apply to all layers except bias/LayerNorm weights
# weight_decay = 0.001

# # Number of training steps (overrides num_train_epochs)
# max_steps = -1

In [10]:
from transformers import TrainingArguments

output_dir = "./results"         # Output directory where the model predictions and checkpoints will be stored
per_device_train_batch_size = 4  # Batch size per GPU for training
gradient_accumulation_steps = 4  # Number of update steps to accumulate the gradients for
optim = "paged_adamw_32bit"      # Optimizer to use
save_steps = 10   
logging_steps = 10    # Log every X updates steps
learning_rate = 2e-4  # Initial learning rate (AdamW optimizer)
max_grad_norm = 0.3   # Maximum gradient normal (gradient clipping)
max_steps = 100       # Number of training epochs
warmup_ratio = 0.03   # Ratio of steps for a linear warmup (from 0 to learning rate)
lr_scheduler_type = "constant"  # Learning rate schedule ("cosine")

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps, 
    optim=optim,
    save_steps=save_steps,  # Save checkpoint every X updates steps
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,  # Enable fp16/bf16 training (set bf16 to True with an A100)
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    
    # Group sequences into batches with same length
    # Saves memory and speeds up training considerably
    group_by_length=True,
    
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,  # Enable gradient checkpointing
)

## Then finally pass everthing to the trainer

#### SFT parameters
**Maximum sequence length to use --> max_seq_length = None**

**Pack multiple short examples in the same input sequence to increase efficiency --> packing = False**

**Load the entire model on the GPU 0 --> device_map = {"": 0}**

In [11]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

## We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [12]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

### Now let's train the model! Simply call trainer.train()



In [None]:
trainer.train()

Step,Training Loss
10,1.3217
20,1.2541
30,1.3423
40,1.5333
50,1.7735




Step,Training Loss
10,1.3217
20,1.2541
30,1.3423
40,1.5333
50,1.7735


During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.