## Let's go through the QLoRA flow, practically and quickly.

dataset used to fine-tune "meta-llama/Llama-3.2-3B-Instruct" (along with quantization to 4 bits):
https://huggingface.co/datasets/fka/awesome-chatgpt-prompts

Input: I want you to act as a career coach.

Let's compare what we get before and after fine-tuning the model above!

assumptions:

    1) using google colab
    2) HF_TOKEN and WANDB_API_KEY stored under Secrets in colab

these experiments were not specifically shown below (see other .py/.ipynb's):
* other datasets (involving: evaluating with Giskard, DPO - Direct Preference Optimization [considered more efficient than RLHF], RecursiveCharacterTextSplitter vs CharacterTextSplitter, etc)
* learning_rate (involving decay, warm-up, scheduler[like cosine], adjusted adaptively/dynamically)
* num_train_epochs, batch sizes, checkpointing
* optimizers - Adam, Nadam
* weight initializations - He for ReLU and Glorot/Xavier for sigmoid/tanh
* batch norm
* r (for LoRA)
* lora_alpha
* quantization - configs, symmetric vs asymmetric, calibration, QAT (*)
* hyperparameters - GridSearchCV, RandomizedSearchCV, Bayesian optimization (like Hyperopt)
* stratified K-Fold cross-validation - each fold having similar proportion of samples from each class as the entire dataset
* Ollama
* prompt tuning # https://huggingface.co/docs/peft/en/package_reference/prompt_tuning
* .. and more

(*) https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/

In [1]:
!pip install -q accelerate # to take advantage of the GPU
!pip install -q bitsandbytes # for working with the quantized model; to create the quantization configuration
!pip install -q trl # provides a set of torch utilities needed for fine-tuning; to create the trainer
!pip install -q peft # allowing LoRA or QLoRA
!pip install -q transformers # https://huggingface.co/docs/transformers/en/index

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import numpy as np
import math
import matplotlib.pyplot as plt
import os
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import TrainingArguments # , Trainer
from trl import SFTTrainer, SFTConfig
import torch
from datasets import load_dataset
import peft
from peft import LoraConfig, get_peft_model
from peft import AutoPeftModelForCausalLM, PeftConfig
import gc
from google.colab import userdata

In [3]:
!pip freeze

absl-py==1.4.0
accelerate==1.5.2
aiohappyeyeballs==2.6.1
aiohttp==3.11.15
aiosignal==1.3.2
alabaster==1.0.0
albucore==0.0.23
albumentations==2.0.5
ale-py==0.10.2
altair==5.5.0
annotated-types==0.7.0
anyio==4.9.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.7.1
arviz==0.21.0
astropy==7.0.1
astropy-iers-data==0.2025.4.21.0.37.6
astunparse==1.6.3
atpublic==5.1
attrs==25.3.0
audioread==3.0.1
autograd==1.7.0
babel==2.17.0
backcall==0.2.0
backports.tarfile==1.2.0
beautifulsoup4==4.13.4
betterproto==2.0.0b6
bigframes==2.1.0
bigquery-magics==0.9.0
bitsandbytes==0.45.5
bleach==6.2.0
blinker==1.9.0
blis==1.3.0
blosc2==3.3.1
bokeh==3.6.3
Bottleneck==1.4.2
bqplot==0.12.44
branca==0.8.1
CacheControl==0.14.2
cachetools==5.5.2
catalogue==2.0.10
certifi==2025.1.31
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.4.1
chex==0.1.89
clarabel==0.10.0
click==8.1.8
cloudpathlib==0.21.0
cloudpickle==3.1.1
cmake==3.31.6
cmdstanpy==1.2.5
colorcet==3.1.0
colorlover==0.3.0
colour==0.1.5
commu

## Hugging Face login

In [1]:
HF_TOKEN = userdata.get('HF_TOKEN') # already stored in Secrets in Colab, should NOT do this here directly: HF_TOKEN = "your-hf-token"
!huggingface-cli login --token $HF_TOKEN

In [5]:
model_name = "meta-llama/Llama-3.2-3B-Instruct" #also tried this: "meta-llama/Meta-Llama-3-8B"
target_modules = ["q_proj", "v_proj"]

In [6]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
'''
https://codecompass00.substack.com/p/qlora-visual-guide-finetune-quantized-llms-peft
NormalFloat4 has 4-bits so we have 2^4 = 16 different bins available for quantization i.e. [0000, 0001, 0010, …, 1111]. Using standard quantization we could divide the range [-1, 1] into 16 equal-sized bins but we know that this is not ideal when values come from a normal distribution.
NF4 exploits the knowledge of the values following a normal distribution where a bulk of the values are around the center of the bell curve and then it flattens out at either extreme. With this QLoRA design NF4 creates bins based on the probability of finding points in that bin. Ideally, each bin has the same number of points falling in it assuring an optimal quantization.
'''

'\nhttps://codecompass00.substack.com/p/qlora-visual-guide-finetune-quantized-llms-peft\nNormalFloat4 has 4-bits so we have 2^4 = 16 different bins available for quantization i.e. [0000, 0001, 0010, …, 1111]. Using standard quantization we could divide the range [-1, 1] into 16 equal-sized bins but we know that this is not ideal when values come from a normal distribution.\nNF4 exploits the knowledge of the values following a normal distribution where a bulk of the values are around the center of the bell curve and then it flattens out at either extreme. With this QLoRA design NF4 creates bins based on the probability of finding points in that bin. Ideally, each bin has the same number of points falling in it assuring an optimal quantization.\n'

see the QLoRA paper as needed: https://arxiv.org/abs/2305.14314

some background info:
{
https://codecompass00.substack.com/p/qlora-visual-guide-finetune-quantized-llms-peft
"
NormalFloat4 has 4-bits so we have 2^4 = 16 different bins available for quantization i.e. [0000, 0001, 0010, …, 1111]. Using standard quantization we could divide the range [-1, 1] into 16 equal-sized bins but we know that this is not ideal when values come from a normal distribution.
NF4 exploits the knowledge of the values following a normal distribution where a bulk of the values are around the center of the bell curve and then it flattens out at either extreme. With this QLoRA design NF4 creates bins based on the probability of finding points in that bin. Ideally, each bin has the same number of points falling in it assuring an optimal quantization.
"
}

{
on Double Quantization, see:
https://blog.dataiku.com/quantization-in-llms-why-does-it-matter#:~:text=Double%20Quantization&text=As%20illustrated%20below%2C%20storing%20one,32/(256*64)
"
This involves performing a second round of quantization, this time to quantize the scaling factors from the initial quantization of the weights. The 32-bit scale factors are grouped into blocks of 256 and scaled down to 8-bit precision with the introduction of a second round quantization factor.
... storing one scaling factor in 32-bit for every block of 64 parameters adds 0.5 bits per parameter (32/64). Instead, using this double quantization to compress the per block scaling factors to 8-bit results in a reduction to only 0.127 bits per parameter (8/64 + 32/(256*64)).
"
}

In [7]:
device_map = {"": 0} #The "" (empty string) key means "everything". The value 0 refers to device 0, typically the primary GPU (if available) or the CPU if no GPU is detected.
foundation_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config=bnb_config, #this config needed for QLoRA here, not needed for LoRA
                    device_map=device_map,
                    use_cache = False) #During fine-tuning, the model's parameters are being updated. Using cached values from a previous state would defeat the purpose of fine-tuning.

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Now, got the quantized version of the model loaded in memory!

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

## Inference with the pre-trained model .. to be compared with the fine-tuned model later

In [9]:
def get_outputs(model, inputs, max_new_tokens=100):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        repetition_penalty=1.5, # to avoid repetition
        early_stopping=False, # the model can stop before reaching the max_length
        eos_token_id=tokenizer.eos_token_id,
    )
    return outputs

going to request the pre-trained model that acts like a career coach

In [10]:
#Inference - original model
#os.environ["CUDA_LAUNCH_BLOCKING"] = "1" # added for: RuntimeError: CUDA error: device-side assert triggered
input_sentences = tokenizer("I want you to act as a career coach.", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(foundation_model, input_sentences, max_new_tokens=50)
print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


["I want you to act as a career coach. I have been working in the IT industry for over 10 years, and recently my company has gone through some significant changes due largely of our CEO's departure.\nThe new management team is looking at restructuring their operations including staff reductions across multiple departments.\n\nGiven"]


The answer is not good enough ...

In [12]:
print(input_sentences)
print(input_sentences["input_ids"])
print(input_sentences["input_ids"].shape)
print(input_sentences["attention_mask"])
print(input_sentences["attention_mask"].shape)

{'input_ids': tensor([[128000,     40,   1390,    499,    311,   1180,    439,    264,   7076,
           7395,     13]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
tensor([[128000,     40,   1390,    499,    311,   1180,    439,    264,   7076,
           7395,     13]], device='cuda:0')
torch.Size([1, 11])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')
torch.Size([1, 11])


In [13]:
dataset = "fka/awesome-chatgpt-prompts"
data = load_dataset(dataset)
# .map below applied to "dict_values" which is "samples" below where "prompt" is 1 of its "features"
print(data)
print(data['train'])
print(data.values())
data = data.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
print("INFO - embeddings size is:", len(data['train']['prompt'][0])) # embeddings size
train_sample = data["train"].select(range(50)) #using only 50 of them seems to be good enough, while saving time, ok for illustrative purpose here
del data
train_sample = train_sample.remove_columns('act')
#can also remove the prompt column, since only input_ids(containing the embeddings) and attention_mask are needed for fine-tuning
display(train_sample)

README.md:   0%|          | 0.00/339 [00:00<?, ?B/s]

prompts.csv:   0%|          | 0.00/104k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/203 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 203
    })
})
Dataset({
    features: ['act', 'prompt'],
    num_rows: 203
})
dict_values([Dataset({
    features: ['act', 'prompt'],
    num_rows: 203
})])


Map:   0%|          | 0/203 [00:00<?, ? examples/s]

INFO - embeddings size is: 578


Dataset({
    features: ['prompt', 'input_ids', 'attention_mask'],
    num_rows: 50
})

## Fine-Tuning.
1) create a LoRA configuration object to set the variables that specify the characteristics of the fine-tuning process

In [14]:
lora_config = LoraConfig(
    r=16, # larger -> more parameters to train, longer to train
    lora_alpha=16, # LoRA scaling factor that adjusts the magnitude/output of the new/trainable rank decomposition matrices (*)
    target_modules=target_modules,
    lora_dropout=0.05, # to avoid overfitting
    bias="none", # specifies if the bias parameter should be trained
    task_type="CAUSAL_LM"
)
# (*) see section 4.1 of the LoRA paper: https://arxiv.org/pdf/2106.09685

most important parameter is **r**, it affects how many parameters will be trained

list of the **target_modules** available on the [Hugging Face]( https://github.com/huggingface/peft/blob/39ef2546d5d9b8f5f8a7016ec10657887a867041/src/peft/utils/other.py#L220)

also see: https://huggingface.co/docs/peft/en/quicktour

and see: https://huggingface.co/docs/peft/main/en/conceptual_guides/lora

In [15]:
working_dir = './'
output_directory = os.path.join(working_dir, "peft_lab_outputs")

In [16]:
training_args = TrainingArguments(
    output_dir=output_directory,
    auto_find_batch_size=True, # to find a batch size that will fit into memory
    learning_rate= 0.0002,
    num_train_epochs=5
)

2) train the model, with:

* Model
* training_args
* Dataset
* output of DataCollator - objects that will form a batch by using a list of dataset elements as input
* LoRA config





In [17]:
tokenizer.pad_token = tokenizer.eos_token
os.environ["WANDB_API_KEY"] = userdata.get("WANDB_API_KEY")
trainer = SFTTrainer(
    model=foundation_model,
    train_dataset=train_sample,
    peft_config=lora_config,
    args = SFTConfig(
      output_dir=output_directory,
      auto_find_batch_size=True,
      learning_rate= 0.0002,
      dataset_text_field="prompt",
      num_train_epochs=5
    ),
    processing_class=tokenizer,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer.train()

Truncating train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mmustworksmart[0m ([33mmustworksmart-na[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


TrainOutput(global_step=35, training_loss=2.43212890625, metrics={'train_runtime': 166.6955, 'train_samples_per_second': 1.5, 'train_steps_per_second': 0.21, 'total_flos': 499526999310336.0, 'train_loss': 2.43212890625})

In [18]:
peft_model_path = os.path.join(output_directory, f"lora_model")

In [19]:
trainer.model.save_pretrained(peft_model_path)

In [20]:
# to free some memory
del foundation_model
del trainer
del train_sample
torch.cuda.empty_cache()
gc.collect()

877

## Inference with the pretrained model

In [21]:
current_directory = os.getcwd()
print("INFO - currecnt directory is: ", current_directory)
print("INFO - peft_model_path is:", peft_model_path)
print("INFO - and it got:")
entries = os.listdir(peft_model_path)
for entry in entries:
  print(entry)

INFO - currecnt directory is:  /content
INFO - peft_model_path is: ./peft_lab_outputs/lora_model
INFO - and it got:
README.md
adapter_model.safetensors
adapter_config.json


In [22]:
loaded_model = AutoPeftModelForCausalLM.from_pretrained(
                                        peft_model_path,
                                        is_trainable=False,
                                        quantization_config=bnb_config,
                                        device_map = 'cuda')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Inference with the fine-tuned model

In [23]:
input_sentences = tokenizer("I want you to act as a career coach.", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)
print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


["I want you to act as a career coach. My friend has recently graduated from college and is now looking for her first job in the industry she studied (computer science). She's feeling anxious about finding employment, but I think that with your guidance we can help find something suitable.\nMy goal here will"]


The result is better, although it did not end in complete sentence, can experiment with increasing max_new_tokens or adding/improving system prompt


In [24]:
# see: https://huggingface.co/docs/peft/en/quicktour
model = get_peft_model(loaded_model, lora_config)
print(model.print_trainable_parameters())

trainable params: 4,587,520 || all params: 3,217,337,344 || trainable%: 0.1426
None


