# Instruction Finetuning using QLoRA

In this notebook, we will look into how to perform instruction finetuning using QLoRA PEFT method. The task is to perform Supervised finetuning (SFT) of CodeLlama for function calling

Load the required libraries

In [1]:
import os
os.environ["WANDB_PROJECT"]="codellama_instruct_finetuning"

from enum import Enum
from functools import partial
import pandas as pd
import torch
import json

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig, set_seed
from datasets import load_dataset
from trl import SFTTrainer
from peft import get_peft_model, LoraConfig, TaskType

seed = 42
set_seed(seed)

## Data preprocessing

In [2]:
model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
dataset_name = "heegyu/glaive-function-calling-v2-formatted"
tokenizer = AutoTokenizer.from_pretrained(model_name)
template = """{% for message in messages %}\n{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% if loop.last and add_generation_prompt %}{{'<|im_start|>assistant\n' }}{% endif %}{% endfor %}"""
tokenizer.chat_template = template

def preprocess(samples):
    batch = []
    for system_prompt, function_desc, conversation in zip(samples["system_message"], samples["function_description"], samples["conversations"]):
        try:
            function_desc_formatted = json.dumps(json.loads(f"[{function_desc}]"), indent=2, sort_keys=True)
        except:
            function_desc_formatted = f"[{function_desc}]"
        system_message = {"role": "system", "content": f"{system_prompt}\nfunctions: {function_desc_formatted}"}
        conversation.insert(0, system_message)
        batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
    return {"content": batch}

dataset = load_dataset(dataset_name)
dataset = dataset.map(
    preprocess,
    batched=True,
    remove_columns=dataset["train"].column_names
)
dataset = dataset["train"].train_test_split(0.1)
print(dataset)
print(dataset["train"][0])

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

(…)-00000-of-00001-faf73733adfef8e8.parquet:   0%|          | 0.00/93.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/112960 [00:00<?, ? examples/s]

Map:   0%|          | 0/112960 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['content'],
        num_rows: 101664
    })
    test: Dataset({
        features: ['content'],
        num_rows: 11296
    })
})
{'content': '<|im_start|>system\nYou are a helpful assistant, with no access to external functions.\nfunctions: []<|im_end|>\n<|im_start|>user\nConsider the following equation with the added constraint that x must be a prime number: \n1 + 1 = x\nWhat is the value of x?<|im_end|>\n<|im_start|>assistant\nThe equation 1 + 1 = x has a solution of x = 2. However, with the added constraint that x must be a prime number, the only solution is x = 2.<|im_end|>\n<|im_start|>user\nCan you explain how encryption works?<|im_end|>\n<|im_start|>assistant\nEncryption is the process of converting plaintext into ciphertext, which is a scrambled version of the original message. This is done by using a mathematical algorithm and a key to encode the information in a way that can only be decoded with the corresponding key.<|im_

In [13]:
from datasets import DatasetDict

# Assuming `dataset` is your DatasetDict
dataset = dataset.rename_columns({"content": "text"})

# Verify the change
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 101664
    })
    test: Dataset({
        features: ['text'],
        num_rows: 11296
    })
})


In [19]:
print(len(dataset["train"]))

101664


## Create the PEFT model

### LoRA Config

In [6]:
peft_config = LoraConfig(r=8,
                         lora_alpha=16,
                         lora_dropout=0.1,
                         target_modules=["gate_proj","q_proj","lm_head","o_proj","k_proj","embed_tokens","down_proj","up_proj","v_proj"],
                         task_type=TaskType.CAUSAL_LM)

### bitsandbytes 4-bit quantization config

In [7]:
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

In [8]:
class ChatmlSpecialTokens(str, Enum):
    user = "<|im_start|>user"
    assistant = "<|im_start|>assistant"
    system = "<|im_start|>system"
    function_call = "<|im_start|>function-call"
    function_response = "<|im_start|>function-response"
    eos_token = "<|im_end|>"
    bos_token = "<s>"
    pad_token = "<pad>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]

tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        pad_token=ChatmlSpecialTokens.pad_token.value,
        bos_token=ChatmlSpecialTokens.bos_token.value,
        eos_token=ChatmlSpecialTokens.eos_token.value,
        additional_special_tokens=ChatmlSpecialTokens.list(),
        trust_remote_code=True
    )
tokenizer.chat_template = template

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             quantization_config=bnb_config,
                                             device_map="auto",
                                             attn_implementation="flash_attention_2")
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.88G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.33G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Embedding(151672, 3584)

In [9]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151672, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear4bit(in_features=3584, out_features=3584, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear4bit(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((3584,), 

## Training

In [10]:
output_dir = "Qwen2.5-Coder-7B_function_calling_instruct"
per_device_train_batch_size = 2
per_device_eval_batch_size = 2
gradient_accumulation_steps = 4
logging_steps = 5
learning_rate = 5e-4
max_grad_norm = 1.0
num_train_epochs=1
warmup_ratio = 0.1
lr_scheduler_type = "cosine"
max_seq_length = 2048

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_strategy="no",
    evaluation_strategy="epoch",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    weight_decay=0.1,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    bf16=True,
    report_to=["tensorboard", "wandb"],
    hub_private_repo=True,
    push_to_hub=True,
    num_train_epochs=num_train_epochs,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)




In [22]:
# Shuffle and select subsets
train_subset = dataset['train'].shuffle(seed=42).select(range(5000))
test_subset = dataset['test'].shuffle(seed=42).select(range(500))

# Create a new DatasetDict with the subsets
subset_dataset = DatasetDict({
    'train': train_subset,
    'test': test_subset
})

In [23]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=subset_dataset["train"],
    eval_dataset=subset_dataset["test"],
    tokenizer=tokenizer,
    # packing=True,
    # dataset_text_field="content",
    # max_seq_length=max_seq_length,
    peft_config=peft_config,
    # dataset_kwargs={
    #     "append_concat_token": False,
    #     "add_special_tokens": False,
    # },
)

  trainer = SFTTrainer(


Converting train dataset to ChatML:   0%|          | 0/5000 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/5000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/5000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/5000 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/500 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

In [24]:
trainer.train()
trainer.save_model()

Epoch,Training Loss,Validation Loss
1,0.4847,0.31717




adapter_model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

events.out.tfevents.1741673766.215d55e723fe.1308.1:   0%|          | 0.00/41.0k [00:00<?, ?B/s]

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

events.out.tfevents.1741673543.215d55e723fe.1308.0:   0%|          | 0.00/6.58k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

In [25]:
!nvidia-smi

Tue Mar 11 06:59:49 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:A1:00.0 Off |                  Off |
| 30%   33C    P8             23W /  300W |   27472MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Loading the trained model and getting the predictions of the trained model

In [26]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
import torch

bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

peft_model_id = "badribn/Qwen2.5-Coder-7B_function_calling_instruct"
device = "cuda"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             quantization_config=bnb_config,
                                             device_map="auto",
                                             attn_implementation="flash_attention_2")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
model = PeftModel.from_pretrained(model, peft_model_id)
# model.to(torch.bfloat16)
# model.cuda()
model.eval()

adapter_config.json:   0%|          | 0.00/837 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/6.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/803 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/648 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): lora.Embedding(
          (base_layer): Embedding(151672, 3584)
          (lora_dropout): ModuleDict(
            (default): Dropout(p=0.1, inplace=False)
          )
          (lora_A): ModuleDict()
          (lora_B): ModuleDict()
          (lora_embedding_A): ParameterDict(  (default): Parameter containing: [torch.cuda.FloatTensor of size 8x151672 (cuda:0)])
          (lora_embedding_B): ParameterDict(  (default): Parameter containing: [torch.cuda.FloatTensor of size 3584x8 (cuda:0)])
          (lora_magnitude_vector): ModuleDict()
        )
        (layers): ModuleList(
          (0-27): 28 x Qwen2DecoderLayer(
            (self_attn): Qwen2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3584, out_features=3584, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dr

In [27]:
system_prompt = """You are a helpful assistant with access to the following functions. You do a reasoning step before acting. Use the functions if required -
functions: [{
    "name": "get_current_location",
    "description": "Returns the current location. ONLY use this if the user has not provided an explicit location in the query.",
    "parameters": {}
},
{
    "name": "search",
    "description": "A search engine. 
    Useful for when you need to answer questions and provide information about real-time updates, current events and latest NEWS.
    Useful to answer general knowledge questions and provide information about people, places, companies, facts, historical events, or other subjects.
    Input should be a search query."
    "parameters": {
        "type": "object",
        "properties": {
            "search_query": {
                "type": "array",
                "items": {
                    "type": "string"
                },
                "description": "The search query"
            }
        },
        "required": [
            "search_query"
        ]
    }
},
{
    "name": "code_analysis",
    "description": "Useful when the user query can be solved by writing Python code. 
    This function generates high quality Python code and runs it to solve the user query and provide the output.
    Useful when user asks queries that can be solved with Python code. 
    Useful for sorting, generating graphs for data visualization and analysis, solving arthmetic and logical questions, data wrangling and data chrunching tasks for csv files etc.",
    "parameters": {
        "type": "object",
        "properties": {
            "text_prompt": {
                "type": "string",
                "description": "The description of the problem to be solved by writing python code."
            }
        },
        "required": [
            "text_prompt"
        ]
    }
},
{
    "name": "analyze_image",
    "description": "Analyze the contents of an image",
    "parameters": {
        "type": "object",
        "properties": {
            "image_url": {
                "type": "string",
                "description": "The URL of the image"
            }
        },
        "required": [
            "image_url"
        ]
    }
},
{
    "name": "generate_image",
    "description": "generate image based on the given description. ",
    "parameters": {
        "type": "object",
        "properties": {
            "text_prompt": {
                "type": "string",
                "description": "The description of the image to be generated"
            }
        },
        "required": [
            "text_prompt"
        ]
    }
}]"""

# Can you tell me what's in the image www.example.com/myimage.jpg?
# Please give me the receipe for kadhai paneer.
# Where am I?
# Generate an image of Indian festival Sankranthi where children are flying colourful kites on their terrace.
# What is the latest news of earthquakes in Japan?
# Sort the array [1,7,5,6].
# I want to know about tour packages from India to Maldives.

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "I want to know about tour packages from India to Maldives."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
    outputs = model.generate(**inputs, 
                             max_new_tokens=128, 
                             do_sample=True, 
                             top_p=0.95, 
                             temperature=0.2, 
                             repetition_penalty=1.0, 
                             eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<|im_start|>system
You are a helpful assistant with access to the following functions. You do a reasoning step before acting. Use the functions if required -
functions: [{
    "name": "get_current_location",
    "description": "Returns the current location. ONLY use this if the user has not provided an explicit location in the query.",
    "parameters": {}
},
{
    "name": "search",
    "description": "A search engine. 
    Useful for when you need to answer questions and provide information about real-time updates, current events and latest NEWS.
    Useful to answer general knowledge questions and provide information about people, places, companies, facts, historical events, or other subjects.
    Input should be a search query."
    "parameters": {
        "type": "object",
        "properties": {
            "search_query": {
                "type": "array",
                "items": {
                    "type": "string"
                },
                "description": "The search

In [28]:
!nvidia-smi

Tue Mar 11 07:01:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:A1:00.0 Off |                  Off |
| 30%   31C    P2            111W /  300W |   25448MiB /  49140MiB |     26%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
