Finetune Falcon on a Google colab

Let's leverage PEFT library and QLoRA for more memory efficient finetuning.

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [None]:
!nvidia-smi

In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

## Dataset

For our experiment, used description data from publically available details.


In [None]:
import pandas as pd
import json
from datasets import load_dataset


from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
#df = pd.read_csv('gdrive/My Drive/Dataset/code_snippets.csv')
#df = pd.read_excel('gdrive/My Drive/Dataset/Doc_Dataset_final.xlsx')
df = pd.read_excel('gdrive/My Drive/Dataset/')


df_1 = df[['question', 'answer']]
df_1.isna().sum()
df_1 = df_1.dropna()

data_1 = []

for index, row in list(df_1.iterrows()):
    data_1.append(dict(row))

data = {
  "questions": data_1,
}

with open("dataset.json", 'w+') as f:
    json.dump(data["questions"], f)

data = load_dataset('json', data_files='dataset.json')

#print(data["train"])

def generate_prompt(data_point):
  return f"""
<human>: {data_point["question"]}
<assistance>: {data_point["answer"]}
""".strip()

def generate_and_tokenize_prompt(data_point):
  full_prompt = generate_prompt(data_point)
  value = print(full_prompt)
  return value



data = data["train"].shuffle().map(generate_and_tokenize_prompt)

df = pd.DataFrame(data)
df.to_csv("dataset.csv")

dataset = load_dataset('csv', data_files='dataset.csv', split='train')

## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer, AutoModelForQuestionAnswering

#model_name = "ybelkada/falcon-7b-sharded-bf16"
#model_name = "tiiuae/falcon-7b"
model_name = "tiiuae/falcon-40b-instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

# Let's also load the tokenizer below

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Testing before fine-tuning

In [None]:
DEVICE = "cuda:0"

In [None]:
%%time

prompt = f"""
<human>: How to create a molecule using rdkit?
<assistance>:
""".strip()

# A program that performs a multi-threaded matched pair analysis of a set of structures for
# Last updated on May 15, 2023.

encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)

generation_config = model.generation_config
generation_config.max_new_token = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = generation_config.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id
generation_config.max_length = 200

with torch.inference_mode():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0]))


Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters.

In [None]:
from transformers import TrainingArguments

output_dir="gdrive/My Drive/falcon-adapter-output-new-1"
#output_dir = "./instruct-falcon"
#output_dir = './output'
per_device_train_batch_size = 8
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 20
logging_steps = 5
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "constant"
gradient_checkpointing = True
group_by_length = True
save_total_limit = 40

training_arguments = TrainingArguments(
    output_dir=output_dir,
    #push_to_hub = True,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

In [None]:
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['question'])):
        text = f"### Question: {example['question'][i]}\n ### Answer: {example['answer'][i]}"
        output_texts.append(text)
    return output_texts

In [None]:

from huggingface_hub import notebook_login
notebook_login()

Then finally pass everthing to the trainer

In [None]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    formatting_func=formatting_prompts_func,
    args=training_arguments,
)

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

In [None]:

trainer.train()

In [None]:
trainer.model.push_to_hub('falcon-40B-instruct-600steps', create_pr=1)

# Test the model performance

In [None]:
import os
DEVICE = "cuda:0"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
%%time
prompt = f"""
<human>: Which molecule type should I use in order to get the smallest memory usage for my database application?
<assistance>:
""".strip()

#A program that performs a multi-threaded matched pair analysis of a set of structures for
#Last updated on May 15, 2023.

encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)

In [None]:
generation_config = model.generation_config
generation_config.max_new_token = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = generation_config.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id
generation_config.max_length = 200

In [None]:
import warnings
warnings.filterwarnings("ignore")

with torch.inference_mode():
  outputs = trainer.model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0]))