In [1]:
import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

In [2]:
import os

cache_dir = "/root/autodl-fs"
os.environ["TRANSFORMERS_CACHE"] = cache_dir

## Finetune Falcon-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [6]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

[0m

## Dataset

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [3]:
from datasets import load_dataset

dataset_name = "JiazhenLiu01/SFT_LLAMA_362_threeturn"
dataset = load_dataset(dataset_name, split="train")

## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3-8B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    cache_dir=cache_dir
)
model.config.use_cache = False

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



Let's also load the tokenizer below

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [8]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [9]:
from transformers import TrainingArguments

output_dir = "./results_llama"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
)

Then finally pass everthing to the trainer

In [10]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
[codecarbon INFO @ 18:45:19] [setup] RAM Tracking...
[codecarbon INFO @ 18:45:19] [setup] GPU Tracking...
[codecarbon INFO @ 18:45:19] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 18:45:19] [setup] CPU Tracking...
[codecarbon INFO @ 18:45:20] CPU Model on constant consumption mode: AMD EPYC 7T83 64-Core Processor
[codecarbon INFO @ 18:45:20] >>> Tracker's metadata:
[codecarbon INFO @ 18:45:20]   Platform system: Linux-5.4.0-153-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 18:45:20]   Python version: 3.10.8
[codecarbon INFO @ 18:45:20]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 18:45:20]   Available RAM : 503.602 GB
[codecarbon INFO @ 18:45:20]   CPU count: 128
[codecarbon INFO @ 18:45:20]   CPU model: AMD EPYC 7T83

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [11]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [12]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mjiazhen-liu01[0m ([33mjiazhenliu[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112631360689799, max=1.0…



Step,Training Loss
10,2.4163
20,2.2096
30,2.1798
40,2.0184
50,2.1268
60,1.9523
70,2.0543
80,1.7981
90,1.8148
100,1.4733


[codecarbon INFO @ 18:45:51] Energy consumed for RAM : 0.000787 kWh. RAM Power : 188.85086917877197 W
[codecarbon INFO @ 18:45:51] Energy consumed for all GPUs : 0.001380 kWh. Total GPU Power : 331.1839558360765 W
[codecarbon INFO @ 18:45:51] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 18:45:51] 0.002344 kWh of electricity used since the beginning.
[codecarbon INFO @ 18:46:06] Energy consumed for RAM : 0.001574 kWh. RAM Power : 188.85086917877197 W
[codecarbon INFO @ 18:46:06] Energy consumed for all GPUs : 0.002831 kWh. Total GPU Power : 348.18475219943787 W
[codecarbon INFO @ 18:46:06] Energy consumed for all CPUs : 0.000354 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 18:46:06] 0.004758 kWh of electricity used since the beginning.
[codecarbon INFO @ 18:46:21] Energy consumed for RAM : 0.002362 kWh. RAM Power : 188.85086917877197 W
[codecarbon INFO @ 18:46:22] Energy consumed for all GPUs : 0.003179 kWh. Total GPU Power : 81.3996930362

TrainOutput(global_step=500, training_loss=0.8486868362426758, metrics={'train_runtime': 2225.9451, 'train_samples_per_second': 3.594, 'train_steps_per_second': 0.225, 'total_flos': 5.501374268343091e+16, 'train_loss': 0.8486868362426758, 'epoch': 6.097560975609756})

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [15]:
trainer.push_to_hub("JiazhenLiu01/falcon-7b-oneturn")



Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/671M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.05k [00:00<?, ?B/s]

events.out.tfevents.1715856325.autodl-container-5daf42b4c1-031359d8.2422.0:   0%|          | 0.00/16.1k [00:00…

CommitInfo(commit_url='https://huggingface.co/JiazhenLiu01/results_llama/commit/77c96866f0b717cd84dc8fc5e29d3f32d849e822', commit_message='JiazhenLiu01/falcon-7b-oneturn', commit_description='', oid='77c96866f0b717cd84dc8fc5e29d3f32d849e822', pr_url=None, pr_revision=None, pr_num=None)

In [13]:
trainer.save_model('falcon-7b-oneturn')



In [13]:
input_text = "<s>[INST] <<SYS>> Your name is Tom and you are now communicating with a psychologist in a therapy session. Answer the following questions as Tom and respond in a casual tone. Each of your replies should be no longer than three sentences.<</SYS>>When do you fall asleep?[/INST]"