## Finetune Falcon-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

In [1]:
import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

In [2]:
# import os

# os.environ["HF_ENDPOINT"] = "https://huggingface.co/"

In [3]:
import os

cache_dir = "/root/autodl-fs"
os.environ["TRANSFORMERS_CACHE"] = cache_dir

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [4]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

^C
[31mERROR: Operation cancelled by user[0m[31m
[0m^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

## Dataset

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [13]:
from datasets import load_dataset

dataset_name = "JiazhenLiu01/test_362_2"
dataset = load_dataset(dataset_name, split="train")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [6]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "TinyPixel/Llama-2-7B-bf16-sharded"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    cache_dir=cache_dir,
)
model.config.use_cache = False

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Let's also load the tokenizer below

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/247 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [16]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "v_proj",
    ])

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [10]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
)

Then finally pass everthing to the trainer

In [18]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
[codecarbon INFO @ 22:09:35] [setup] RAM Tracking...
[codecarbon INFO @ 22:09:35] [setup] GPU Tracking...
[codecarbon INFO @ 22:09:35] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 22:09:35] [setup] CPU Tracking...
[codecarbon INFO @ 22:09:35] Tracking Intel CPU via RAPL interface
[codecarbon INFO @ 22:09:36] >>> Tracker's metadata:
[codecarbon INFO @ 22:09:36]   Platform system: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 22:09:36]   Python version: 3.10.8
[codecarbon INFO @ 22:09:36]   CodeCarbon version: 2.2.3
[codecarbon INFO @ 22:09:36]   Available RAM : 1007.512 GB
[codecarbon INFO @ 22:09:36]   CPU count: 128
[codecarbon INFO @ 22:09:36]   CPU model: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
[codecarbon INFO @ 22:09:36]   GPU count: 1
[codecarbon INFO @ 22:09:36]   GPU model: 1 x NVIDIA GeForce RTX 4090


We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [19]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [20]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mjiazhen-liu01[0m ([33mjiazhenliu[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112874663538403, max=1.0…



Step,Training Loss
10,3.0748
20,2.3591
30,2.2145
40,2.1324
50,1.9124
60,1.895
70,2.0006
80,1.8572
90,1.8892
100,1.8821


[codecarbon INFO @ 22:10:19] Energy consumed for RAM : 0.001574 kWh. RAM Power : 377.81707048416143 W
[codecarbon INFO @ 22:10:19] Energy consumed for all GPUs : 0.001009 kWh. Total GPU Power : 242.16400000000002 W
[codecarbon INFO @ 22:10:19] Energy consumed for all CPUs : 0.001116 kWh. Total CPU Power : 267.69069866083447 W
[codecarbon INFO @ 22:10:19] 0.003700 kWh of electricity used since the beginning.
[codecarbon INFO @ 22:10:34] Energy consumed for RAM : 0.003147 kWh. RAM Power : 377.81707048416143 W
[codecarbon INFO @ 22:10:34] Energy consumed for all GPUs : 0.002101 kWh. Total GPU Power : 262.29 W
[codecarbon INFO @ 22:10:34] Energy consumed for all CPUs : 0.002232 kWh. Total CPU Power : 268.0102297565652 W
[codecarbon INFO @ 22:10:34] 0.007481 kWh of electricity used since the beginning.
[codecarbon INFO @ 22:10:49] Energy consumed for RAM : 0.004816 kWh. RAM Power : 377.81707048416143 W
[codecarbon INFO @ 22:10:49] Energy consumed for all GPUs : 0.002451 kWh. Total GPU Power

TrainOutput(global_step=500, training_loss=1.6362906551361085, metrics={'train_runtime': 1155.0345, 'train_samples_per_second': 6.926, 'train_steps_per_second': 0.433, 'total_flos': 2.022408466101043e+16, 'train_loss': 1.6362906551361085, 'epoch': 8.0})

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [None]:
# trainer.push_to_hub(repo_id="falcon-test3", token="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")

In [None]:
trainer.save_model('Llama-test1')