## Finetune Falcon-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving Test.csv to Test.csv


In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
import pandas as pd
from datasets import Dataset

# Load the CSV file into a pandas DataFrame
df = pd.read_csv('/content/Test.csv')  # Update the path as per your file location
test1=df[:100]
test1.loc[:, 'Summary'] = ''

# Iterate over each row and assign the value to the 'Summary' column based on the conditions
for index, row in test1.iterrows():
    a = []
    if row['Computer Science'] == 1:
        a.append('Computer Science')
    if row['Mathematics'] == 1:
        a.append('Mathematics')
    if row['Physics'] == 1:
        a.append('Physics')
    if row['Statistics'] == 1:
        a.append('Statistics')
    test1.loc[index, 'Summary'] = ' '.join(a)
# Drop the specified columns from the DataFrame
test1 = test1.drop(columns=['id', 'Computer Science', 'Mathematics', 'Physics', 'Statistics'])

# Convert pandas DataFrame to a dataset
dataset = Dataset.from_pandas(test1)

# Optionally, you can set the dataset name
dataset_name = "my_custom_dataset"

# Save the dataset with a specific name
dataset.save_to_disk(dataset_name)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test1.loc[:, 'Summary'] = ''


Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

## Dataset

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [None]:
from datasets import load_from_disk

loaded_dataset = load_from_disk(dataset_name)


## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer


Let's also load the tokenizer below

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
#not used

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)
#not used

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [None]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
)
#not used

Then finally pass everthing to the trainer

In [None]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="ABSTRACT",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)
#not used

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
dataset

Dataset({
    features: ['ABSTRACT', 'Summary'],
    num_rows: 100
})

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)
#not used

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [None]:
trainer.train()
#not used

Step,Training Loss
10,2.8932
20,2.5357
30,2.1143
40,1.6082
50,1.0911
60,0.6387
70,0.3371
80,0.1502
90,0.0813
100,0.0591




TrainOutput(global_step=100, training_loss=1.150897263288498, metrics={'train_runtime': 1770.9515, 'train_samples_per_second': 0.903, 'train_steps_per_second': 0.056, 'total_flos': 1.620691299929088e+16, 'train_loss': 1.150897263288498, 'epoch': 16.0})

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [None]:
trained_model_dir='./trained_model'
model.save_pretrained(trained_model_dir)

In [None]:
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])

TypeError: FalconForCausalLM.forward() got an unexpected keyword argument 'token_type_ids'

In [None]:
!rm -rf /content/sample_data

In [None]:
!du -sh /content/my_custom_dataset

120K	/content/my_custom_dataset


In [None]:
!rm -rf /content/results

In [None]:
from peft import (
    LoraConfig,
    PeftConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
text = "The quick brown fox"

# Encode the input text
input_ids = tokenizer.encode(text, return_tensors='pt')

# Generate text using the model
# Adjust parameters like max_length according to your needs
output = model.generate(input_ids, max_length=50)


In [None]:
text = "The quick brown fox"

# Encode the input text
input_ids = tokenizer.encode(text, return_tensors='pt')

# Generate text using the model
# Adjust parameters like max_length according to your needs
output = model.generate(input_ids, max_length=50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incom

In [None]:
decoded_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(decoded_text)


The quick brown fox jumped over the lazy dog. a quick brown fox jumped over the lazy dog. -- this was the classic text to detect the copy of a article. inside this paper, we introduce the copycat score as the metric to evaluate


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp trained_model.zip /content/drive/My\ Drive/your_destination_folder/

cp: cannot stat 'trained_model.zip': No such file or directory


In [None]:
!zip -r trained_model.zip /content/trained_model

  adding: content/trained_model/ (stored 0%)
  adding: content/trained_model/.ipynb_checkpoints/ (stored 0%)
  adding: content/trained_model/model-00001-of-00002.safetensors (deflated 20%)
  adding: content/trained_model/model-00002-of-00002.safetensors (deflated 10%)
  adding: content/trained_model/model.safetensors.index.json (deflated 97%)
  adding: content/trained_model/config.json (deflated 55%)
  adding: content/trained_model/generation_config.json (deflated 20%)


In [None]:
!cp trained_model.zip /content/drive/My\ Drive/your_destination_folder/

cp: cannot create regular file '/content/drive/My Drive/your_destination_folder/': Not a directory


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import pandas as pd
from datasets import Dataset

# Load the CSV file into a pandas DataFrame
df = pd.read_csv('/content/Test.csv')  # Update the path as per your file location
test1=df
test1.loc[:, 'Summary'] = ''

# Iterate over each row and assign the value to the 'Summary' column based on the conditions
for index, row in test1.iterrows():
    a = []
    if row['Computer Science'] == 1:
        a.append('Computer Science')
    if row['Mathematics'] == 1:
        a.append('Mathematics')
    if row['Physics'] == 1:
        a.append('Physics')
    if row['Statistics'] == 1:
        a.append('Statistics')
    test1.loc[index, 'Summary'] = ' '.join(a)
# Drop the specified columns from the DataFrame
test1 = test1.drop(columns=['id', 'Computer Science', 'Mathematics', 'Physics', 'Statistics'])

# Convert pandas DataFrame to a dataset
dataset = Dataset.from_pandas(test1)

# Optionally, you can set the dataset name
dataset_name = "my_custom_dataset"

# Save the dataset with a specific name
dataset.save_to_disk(dataset_name)


Saving the dataset (0/1 shards):   0%|          | 0/6002 [00:00<?, ? examples/s]

In [None]:
from datasets import load_from_disk

loaded_dataset = load_from_disk(dataset_name)


In [None]:
loaded_dataset=loaded_dataset.train_test_split(test_size=0.2)

In [None]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["ABSTRACT"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["Summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_loaded_dataset= loaded_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/4801 [00:00<?, ? examples/s]

Map:   0%|          | 0/1201 [00:00<?, ? examples/s]

In [None]:
tokenized_loaded_dataset

DatasetDict({
    train: Dataset({
        features: ['ABSTRACT', 'Summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 4801
    })
    test: Dataset({
        features: ['ABSTRACT', 'Summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1201
    })
})

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)


In [None]:
pip install transformers datasets evaluate rouge_score
import evaluate

rouge = evaluate.load("rouge")

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_billsum_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_loaded_dataset["train"],
    eval_dataset=tokenized_loaded_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,0.436781,0.7646,0.3127,0.7653,0.7655,2.562




In [None]:
trainer.push_to_hub()

events.out.tfevents.1710879591.06daa4d29145.315.3:   0%|          | 0.00/8.03k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.05k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Agastaya/my_awesome_billsum_model/commit/b64d94bf8f316459fbd3d613837d80ed4c86c39d', commit_message='End of training', commit_description='', oid='b64d94bf8f316459fbd3d613837d80ed4c86c39d', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
text ="this is an article about computer science."
from transformers import pipeline

summarizer = pipeline("summarization", model="Agastaya/my_awesome_billsum_model")
summarizer(text)

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Your max_length is set to 200, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


[{'summary_text': 'this article is an article about computer science . it is based on a series of articles on computer science and computer science in the u.s.'}]

In [None]:
text ="this is an article about computer science."
from transformers import pipeline

summarizer = pipeline("summarization", model="Agastaya/my_awesome_billsum_model")
summarizer(text)

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Your max_length is set to 200, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


[{'summary_text': 'This is an article about computer science . a computer science course based on a symphonic aficion'}]

In [None]:
test1=pd.read_csv('/content/Test.csv')

Unnamed: 0,id,ABSTRACT,Computer Science,Mathematics,Physics,Statistics
0,9409,fundamental frequency (f0) approximation from ...,0,0,0,1
1,17934,"this large-scale study, consisting of 24.5 mil...",1,0,0,1
2,16071,we present a stability analysis of the plane c...,0,0,1,0
3,16870,we construct finite time blow-up solutions to ...,0,1,0,0
4,10496,planetary nebulae (pne) constitute an importan...,0,0,1,0
...,...,...,...,...,...,...
5997,11506,a first step inside constructing the machine l...,0,0,0,1
5998,3418,a focus of this paper was to quantify measures...,1,0,0,0
5999,7369,as autonomous vehicles become an every-day rea...,1,0,0,0
6000,8421,a hamiltonian monte carlo (hmc) method has bee...,0,0,0,1
