## Domain specific fine-tuning Falcon-7B model using QLoRA, HF PEFT and bitsandbytes

Fine-tuning large language models (LLMs) allows you to adjust open-source foundational models to achieve improved performance on your domain-specific tasks. In this notebook, we will see how we can leverage Amazon SageMaker to fine-tune state-of-the-art open-source model Falcon-7B. We utilize Hugging Face’s parameter-efficient fine-tuning (PEFT) library and quantization techniques through bitsandbytes to support fine-tuning of extremely large model. We will be using a new technique known as Quantized LLMs with Low-Rank Adapters (QLoRA). QLoRA is an efficient fine-tuning approach that reduces memory usage of LLMs while maintaining solid performance.

For the purpose of this lab, we will be using [Medical Q&A](https://huggingface.co/datasets/medalpaca/medical_meadow_medical_flashcards) dataset from HuggingFace.

> **Note** This link leads to a Third-Party Dataset. AWS does not own, nor does it have any control over the Third-Party Dataset. You should perform your own independent assessment, and take measures to ensure that you comply with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses and terms of use that apply to you, your content, and the Third-Party Dataset. AWS does not make any representations or warranties that the Third-Party Dataset is secure, virus-free, accurate, operational, or compatible with your own environment and standards. AWS does not make any representations, warranties or guarantees that any information in the Third-Party Dataset will result in a particular outcome or result.

This notebook was tested in Amazon SageMaker Studio with `Python 3 (Data Science 3.0)` kernel and a `ml.g5.12xlarge` instance, and in an Amazon SageMaker Notebook instance with `conda_python3` kernel in a `ml.g5.12xlarge` instance.

Install the required libriaries. To load the model in 4-bit, install the Hugging Face libraries including accelerate, transformers, and PEFT from source, as well as the latest version of bitsandbytes.

In [1]:
%env PIP_DISABLE_PIP_VERSION_CHECK True
%env PIP_ROOT_USER_ACTION ignore

%pip install -q -U torch==2.0.1 bitsandbytes==0.39.1 --root-user-action=ignore
%pip install -q -U datasets py7zr einops tensorboardX --root-user-action=ignore
%pip install -q -U git+https://github.com/huggingface/transformers.git@850cf4af0ce281d2c3e7ebfc12e0bc24a9c40714 --root-user-action=ignore
%pip install -q -U git+https://github.com/huggingface/peft.git@e2b8e3260d3eeb736edf21a2424e89fe3ecf429d --root-user-action=ignore 
%pip install -q -U git+https://github.com/huggingface/accelerate.git@b76409ba05e6fa7dfc59d50eee1734672126fdba --root-user-action=ignore

env: PIP_DISABLE_PIP_VERSION_CHECK=True
env: PIP_ROOT_USER_ACTION=ignore
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Restart the kernel before proceeding to take effect of new installations.

Next, we set the CUDA environment path using the installed CUDA that was a dependency of PyTorch installation. This is a required step for the bitsandbytes library to correctly find and load the correct CUDA shared object binary.

In [2]:
# Add installed cuda runtime to path for bitsandbytes
import os
import nvidia

cuda_install_dir = '/'.join(nvidia.__file__.split('/')[:-1]) + '/cuda_runtime/lib/'
os.environ['LD_LIBRARY_PATH'] =  cuda_install_dir

In [3]:
import torch
import bitsandbytes
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so


  warn(msg)


CUDA SETUP: CUDA runtime path found: /home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/nvidia/cuda_runtime/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...


To train our model, we need to convert our inputs (text) to token IDs. This is done by a Hugging Face Transformers Tokenizer. In addition to QLoRA, we will use bitsanbytes 4-bit precision to quantize out frozen LLM to 4-bit and attach LoRA adapters on it. We use bitsandbytes to quantize the Falcon-7B model into 4-bit precision so that we can load the model into memory on 4 A10G GPUs using Hugging Face Accelerate’s native pipeline parallelism. QLoRA tuning is shown to match 16-bit fine-tuning methods in a wide range of experiments because model weights are stored as 4-bit NF4 (noramlized float 4), but are dequantized to the computation bfloat16 on forward and backward passes as needed. We are using NF4 based on recommendations from the QLoRA paper. 

Another option includes bnb_4bit_use_double_quant, which uses a second quantization after the first one to save an additional 0.4 bits per parameter. While 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit and here any combination can be chosen (float16, bfloat16, float32 etc) for compute.

The matrix multiplication and training will be faster if one uses a 16-bit compute dtype (default torch.float32). We leverage the recent BitsAndBytesConfig from transformers to change these parameters. An example to load a model in 4-bit using NF4 quantization is below with double quantization with the compute dtype bfloat16 for faster training.

When loading the pretrained weights, we specify device_map=”auto" so that Hugging Face Accelerate will automatically determine which GPU to put each layer of the model on. This process is known as model parallelism.

In [4]:
model_id="tiiuae/falcon-7b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, 
quantization_config=bnb_config, device_map="auto", trust_remote_code=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading (…)figuration_falcon.py:   0%|          | 0.00/7.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.




Downloading (…)n/modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

With Hugging Face’s PEFT library, you can freeze most of the original model weights and replace or extend model layers by training an additional, much smaller, set of parameters. This makes training much less expensive in terms of required compute. We set the Falcon modules that we want to fine-tune as target_modules in the LoRA configuration:

In [5]:
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

    
from peft import LoraConfig, get_peft_model
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 2359296 || all params: 3611104128 || trainable%: 0.06533447711203746


Notice that we’re only fine-tuning a small percentage of the model’s parameters, which makes this feasible in a reasonable amount of time.

### Loading dataset

For the purpose of the demo, we will load only 20% of the dataset as training dataset and last 10% of dataset as evaluation dataset. 

In [6]:
import datasets
from datasets import load_dataset

    # datasets.ReadInstruction('train', to=20, unit='%'),
    # datasets.ReadInstruction('train', from_=-10, unit='%')

dataset_split = [
    datasets.ReadInstruction('train', to=10, unit='%'),
    datasets.ReadInstruction('train', from_=-5, unit='%')
]

train_dataset, eval_dataset = load_dataset("medalpaca/medical_meadow_medical_flashcards", split=dataset_split)
print(train_dataset.shape)
print(eval_dataset.shape)

Downloading readme:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

(3396, 3)
(1698, 3)


Create a prompt template and load the dataset with a random sample to try summarization.

In [7]:
from random import randint

# custom instruct prompt start
prompt_template = f"Answer this question truthfully:\n{{question}}\n---\nAnswer:\n{{answer}}{{eos_token}}"

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(question=sample["input"],
                                            answer=sample["output"],
                                            eos_token=tokenizer.eos_token)
    return sample


# apply prompt template per sample
train_dataset = train_dataset.map(template_dataset, remove_columns=list(train_dataset.features))
eval_dataset = eval_dataset.map(template_dataset, remove_columns=list(eval_dataset.features))

print(train_dataset[randint(0, len(train_dataset))]["text"])

Map:   0%|          | 0/3396 [00:00<?, ? examples/s]

Map:   0%|          | 0/1698 [00:00<?, ? examples/s]

Answer this question truthfully:
What is mesenchyme, and what is its role in embryonic development?
---
Answer:
Mesenchyme is a type of embryonic connective tissue that plays a crucial role in the development of various organs and tissues in the body. During embryonic development, mesenchymal cells migrate to different parts of the body and differentiate into various types of cells, such as bone, cartilage, muscle, and blood vessels. Mesenchyme also provides structural support and helps to shape and organize developing tissues and organs. In addition to its role in embryonic development, mesenchyme can also contribute to tissue repair and regeneration in adults.<|endoftext|>


In [8]:
#tokenize and chunk dataset
lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, batch_size=48, remove_columns=list(train_dataset.features)
)

lm_eval_dataset = eval_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, batch_size=48, remove_columns=list(eval_dataset.features)
)

# Print total number of samples
print(f"Total number of train samples: {len(lm_train_dataset)}")
print(f"Total number of evaluation samples: {len(lm_eval_dataset)}")

Map:   0%|          | 0/3396 [00:00<?, ? examples/s]

Map:   0%|          | 0/1698 [00:00<?, ? examples/s]

Total number of train samples: 3396
Total number of evaluation samples: 1698


### Training

Use the Hugging Face Trainer class to fine-tune the model. Define the hyperparameters we want to use. We also create a DataCollator that will take care of padding our inputs and labels.

In [9]:
import transformers

#set the Falcon tokenizer
tokenizer.pad_token = tokenizer.eos_token
# We set num_train_epochs=1 simply to run a demonstration

trainer = transformers.Trainer(
    model=model,
    train_dataset=lm_train_dataset,
    eval_dataset=lm_eval_dataset,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=16,
        per_device_eval_batch_size=8,
        logging_steps=100,
        num_train_epochs=1,
        learning_rate=2e-4,
        bf16=True, # For g5 instances
        # fp16=True, # for g4 instances
        save_strategy = "no",
        output_dir="outputs",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [10]:
#start training
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,1.0634
200,0.9557


TrainOutput(global_step=213, training_loss=1.0083278109769866, metrics={'train_runtime': 752.4221, 'train_samples_per_second': 4.513, 'train_steps_per_second': 0.283, 'total_flos': 1.3465563254728704e+16, 'train_loss': 1.0083278109769866, 'epoch': 1.0})

In [11]:
#evaluate and return the metrics
trainer.evaluate()

{'eval_loss': 1.131762146949768,
 'eval_runtime': 83.9709,
 'eval_samples_per_second': 20.221,
 'eval_steps_per_second': 2.537,
 'epoch': 1.0}

In [12]:
#save model to use it for inference
model.save_pretrained("qlora-finetuned-model")

### Load saved adapters

In [13]:
from peft import *

In [14]:
peft_model_id = "qlora-finetuned-model"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token


model = PeftModel.from_pretrained(model, peft_model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
orig_model_id="tiiuae/falcon-7b"
orig_tokenizer = AutoTokenizer.from_pretrained(orig_model_id)
orig_model = AutoModelForCausalLM.from_pretrained(orig_model_id, device_map="auto", trust_remote_code=True) 
# quantization_config=bnb_config, device_map="auto", trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Inference using fine-tuned model and original model

Set the hyperparameters for the LLM to perform inference. 

In [16]:
generation_config = model.generation_config
generation_config.max_new_tokens = 50
generation_config_temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config_eod_token_id = tokenizer.eos_token_id
generation_config.repetition_penalty = 1.1

In [17]:
orig_generation_config = orig_model.generation_config
orig_generation_config.max_new_tokens = 50
orig_generation_config_temperature = 0.7
orig_generation_config.top_p = 0.7
orig_generation_config.num_return_sequences = 1
orig_generation_config.pad_token_id = tokenizer.eos_token_id
orig_generation_config_eod_token_id = tokenizer.eos_token_id
orig_generation_config.repetition_penalty = 1.1

Interact with the model by using the following prompt.

In [18]:
%%time
prompt = f"""
Answer this question truthfully:
What are the treatments for ARDS?
---
Answer:
""".strip()

encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.attention_mask,
        generation_config=generation_config,
    )

CPU times: user 8.45 s, sys: 5.27 ms, total: 8.45 s
Wall time: 8.45 s


In [19]:
def generate_response(question, model):
    prompt = f"""Answer this question truthfully:
    {question}
    ---
    Answer:
    """.strip()
    encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0],skip_special_tokens=True)
    
    answer_start = 'Answer:'
    response_start = response.find(answer_start)
    return response[response_start + len(answer_start):].strip()

In [20]:
prompt = "What are the treatments for ARDS?"
print(generate_response(prompt, model))

The treatments for ARDS include mechanical ventilation, oxygen therapy, and prone positioning. Mechanical ventilation is used to provide support for breathing and oxygen therapy is used to deliver supplemental oxygen to the lungs. Prone positioning is a specific position in which the patient


In [21]:
prompt = "What are the treatments for ARDS?"
print(generate_response(prompt, orig_model))

1. Mechanical ventilation
    Answer: 2. Corticosteroids
    Answer: 3. Antibiotics
    Answer: 4. Extracorporeal membrane oxygenation
    Answer: 5. Lung protective ventilation

The answer is


In [24]:
prompt = "What does low mobility and bulging of TM suggest?"
print(generate_response(prompt, model))

Low mobility and bulging of TM suggest a high likelihood of a parotid abscess. This is a condition in which the parotid gland becomes inflamed and infected, causing swelling and pain in the affected area. Parotid abscesses are typically caused by


In [25]:
prompt = "What does low mobility and bulging of TM suggest?"
print(generate_response(prompt, orig_model))

(A) TMJ arthritis
    Answer: (B) TMJ ankylosis
    Answer: (C) TMJ dislocation
    Answer: (D) TMJ fracture
    Answer: (E) TMJ fracture
