In [1]:
import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

In [2]:
import os

cache_dir = "/root/autodl-fs"
os.environ["TRANSFORMERS_CACHE"] = cache_dir

## Finetune Falcon-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [6]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

[0m

## Dataset

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [3]:
from datasets import load_dataset

dataset_name = "JiazhenLiu01/SFT_LLAMA_362_threeturn"
dataset = load_dataset(dataset_name, split="train")

Downloading data:   0%|          | 0.00/768k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/108k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    cache_dir=cache_dir
)
model.config.use_cache = False

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]



Let's also load the tokenizer below

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [6]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [7]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
)

Then finally pass everthing to the trainer

In [8]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/1311 [00:00<?, ? examples/s]

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
[codecarbon INFO @ 17:41:58] [setup] RAM Tracking...
[codecarbon INFO @ 17:41:58] [setup] GPU Tracking...
[codecarbon INFO @ 17:41:58] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 17:41:58] [setup] CPU Tracking...
[codecarbon INFO @ 17:41:58] Tracking Intel CPU via RAPL interface
[codecarbon INFO @ 17:41:59] >>> Tracker's metadata:
[codecarbon INFO @ 17:41:59]   Platform system: Linux-5.4.0-153-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 17:41:59]   Python version: 3.10.8
[codecarbon INFO @ 17:41:59]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 17:41:59]   Available RAM : 1007.524 GB
[codecarbon INFO @ 17:41:59]   CPU count: 128
[codecarbon INFO @ 17:41:59]   CPU model: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz


We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [9]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

In [10]:
wandb login

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [11]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mjiazhen-liu01[0m ([33mjiazhenliu[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113328983386357, max=1.0…



Step,Training Loss
10,2.3909
20,2.1547
30,2.1732
40,2.0331
50,2.1103
60,1.9805
70,2.0693
80,1.8647
90,1.8859
100,1.6464


[codecarbon INFO @ 17:42:35] Energy consumed for RAM : 0.001575 kWh. RAM Power : 377.82145643234253 W
[codecarbon INFO @ 17:42:35] Energy consumed for all GPUs : 0.001306 kWh. Total GPU Power : 313.3085600329557 W
[codecarbon INFO @ 17:42:35] Energy consumed for all CPUs : 0.000992 kWh. Total CPU Power : 237.9788870768988 W
[codecarbon INFO @ 17:42:35] 0.003873 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:42:50] Energy consumed for RAM : 0.003148 kWh. RAM Power : 377.82145643234253 W
[codecarbon INFO @ 17:42:50] Energy consumed for all GPUs : 0.002649 kWh. Total GPU Power : 322.3814569161788 W
[codecarbon INFO @ 17:42:50] Energy consumed for all CPUs : 0.001937 kWh. Total CPU Power : 226.71790782709098 W
[codecarbon INFO @ 17:42:50] 0.007734 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:43:05] Energy consumed for RAM : 0.004727 kWh. RAM Power : 377.82145643234253 W
[codecarbon INFO @ 17:43:05] Energy consumed for all GPUs : 0.002924 kWh. Total

TrainOutput(global_step=500, training_loss=1.040332973718643, metrics={'train_runtime': 2183.2103, 'train_samples_per_second': 3.664, 'train_steps_per_second': 0.229, 'total_flos': 5.064003551380608e+16, 'train_loss': 1.040332973718643, 'epoch': 6.097560975609756})

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [12]:
trainer.push_to_hub("JiazhenLiu01/falcon-7b-threeturn")



events.out.tfevents.1715852237.autodl-container-5daf42b4c1-031359d8.2171.0:   0%|          | 0.00/10.7k [00:00…

events.out.tfevents.1715851887.autodl-container-5daf42b4c1-031359d8.949.2:   0%|          | 0.00/11.1k [00:00<…

events.out.tfevents.1715852526.autodl-container-e434419765-a9a1b7f6.1034.0:   0%|          | 0.00/16.2k [00:00…

events.out.tfevents.1715851080.autodl-container-5daf42b4c1-031359d8.949.1:   0%|          | 0.00/9.14k [00:00<…

Upload 6 LFS files:   0%|          | 0/6 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/522M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.05k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/JiazhenLiu01/results/commit/f39ac9421d923b276efe9b05c557c9f27c230598', commit_message='JiazhenLiu01/falcon-7b-threeturn', commit_description='', oid='f39ac9421d923b276efe9b05c557c9f27c230598', pr_url=None, pr_revision=None, pr_num=None)

In [13]:
trainer.save_model('falcon-7b-oneturn')



In [13]:
input_text = "<s>[INST] <<SYS>> Your name is Tom and you are now communicating with a psychologist in a therapy session. Answer the following questions as Tom and respond in a casual tone. Each of your replies should be no longer than three sentences.<</SYS>>When do you fall asleep?[/INST]"