# Finetuning with QLoRA

In this step, we will efficiently finetune a pre-trained language model (`Llama-3.1-8B-Instruct`) on our custom-generated dataset. We will follow a process that allows us to create a specialised, high-performing model on accessible hardware.

We will use **QLoRA (Quantized Low-Rank Adaptation)**. This is a highly efficient technique that involves:
1.  **Quantization:** Loading the base model in 4-bit precision to drastically reduce its memory footprint.
2.  **LoRA:** Freezing the full model parameters and injecting small, trainable "adapter" layers (Low-Rank Adaptation).
3.  **Training:** Training only these tiny adapter layers, which is significantly faster and requires much less VRAM than training the full model.

In [1]:
import os
import yaml
import torch
from trl import SFTTrainer
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)

In [2]:
# Load configuration
with open("../config/config.yaml", "r") as f:
    config = yaml.safe_load(f) 
base_model_name = "../"+config['finetuning']['base_model_name']
dataset_path = "../"+config['finetuning']['dataset_path']
finetuned_model_output_dir = "../"+config['finetuning']['output_dir']

# Load Dataset
train_dataset = load_dataset('json', data_files=dataset_path, split='train')
print("Dataset loaded successfully.")
print(f"Number of training examples: {len(train_dataset)}")
print("\nSample entry:")
print(train_dataset[0])

# Configure the 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

Dataset loaded successfully.
Number of training examples: 13

Sample entry:
{'text': "<s>[INST] What is Subhojit Ghimire's professional background and experience, including education and work history? [/INST] Subhojit Ghimire is an AI/ML Developer with two years of industry experience.  He holds a Bachelor of Technology (B.Tech.) in Computer Science & Engineering from the National Institute of Technology (NIT) Silchar, Assam, India, graduating between July 2019 and June 2023 with a CGPA of 8.79/10.0. His professional experience includes working as an AI/ML Developer from July 2023 to July 2025 (the provided text cuts off before specifying the end of this role).  He has expertise in production-grade backend development, scalable machine learning automation, and Generative AI applications.  He was a core contributor to JioBrain, India's first AI/ML platform with 5G integration, where he delivered scalable solutions for business use.  His work history demonstrates a track record of delive

In [3]:
# Load the base model with the quantization config
print(f"Loading base model: {base_model_name}")
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
base_model = prepare_model_for_kbit_training(base_model)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Recommended for fine-tuning

Loading base model: ../local-model/Llama-3.1-8B-Instruct


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
# Configure LoRA
lora_config = LoraConfig(
    r=8,                     # Rank of the update matrices. Lower rank means fewer parameters.
    lora_alpha=32,            # Alpha parameter for scaling.
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Target layers for LoRA adapters
    lora_dropout=0.05,        # Dropout probability for LoRA layers.
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA adapters to the base model
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

trainable params: 6,815,744 || all params: 8,037,076,992 || trainable%: 0.0848


In [5]:
# Defining and Initialising the Trainer
training_args = TrainingArguments(
    output_dir=finetuned_model_output_dir,
    per_device_train_batch_size=1,   # Batch size per GPU
    gradient_accumulation_steps=8, # Accumulate gradients to simulate a larger batch size
    gradient_checkpointing=True,     # Save memory by not storing intermediate activations
    learning_rate=2e-4,              # Learning rate
    logging_steps=25,                # Log training progress every 25 steps
    num_train_epochs=1,              # Number of training epochs
    max_steps=-1,                    # If set, overrides num_train_epochs
    save_steps=50,                   # Save a checkpoint every 50 steps
    fp16=True,                       # Use mixed precision training
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=512, # Max sequence length for inputs
    tokenizer=tokenizer,
    args=training_args,
)

# Begin Finetuning.
print("Starting the finetuning process.")
trainer.train()
print("Finetuning complete.")

# Save the LoRA adapters
print(f"Saving LoRA adapters to: {finetuned_model_output_dir}")
trainer.save_model(finetuned_model_output_dir)
print("Adapters saved successfully.")


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  super().__init__(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.


Starting the finetuning process. This may take a while...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss


Finetuning complete.
Saving LoRA adapters to: ../models/Llama-3.1-8B-Instruct-Finetuned
Adapters saved successfully.


In [6]:
# Quickly testing the finetuned model
prompt = "Who is Subhojit Ghimire? Tell me in brief."
formatted_prompt = f"<s>[INST] {prompt} [/INST]"

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(formatted_prompt)
print(result[0]['generated_text'])

Device set to use cuda:0
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_values=None`.


<s>[INST] Who is Subhojit Ghimire? Tell me in brief. [/INST] SubSlinkyHauntedimport'gc'gc骨Question以下defOccurs‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍


### Next Steps

Now that we have our specialised LoRA adapters, we need to rigorously evaluate their performance. In the next notebook, `3_Benchmarking.ipynb`, we will compare our finetuned model against the original base model and the Gemini Pro API to prove its value.