## FINE TUNING 
As mentioned before, fine-tuning large language models (LLMs) is a process that allows models to adapt to specific tasks, domains, or user requirements. However, depending on the size of the model and the fine-tuning dataset, the process can take a significant amount of time and demand high-performance GPUs to handle the computation load. However, there are various ways one can make the task more computationally efficient. One method is called Parameter Efficient Fine Tuning (PEFT), which helps fine tune only a small subset of a model’s parameters, significantly reducing the computational expenses while freezing the weights of original pretrained LLM.

One PEFT method is called LoRA, which stands for Low Rank Adaptation. This technique introduces trainable rank decomposition matrices (matrices A and B in the image below) within the transformer architecture and also reduces trainable parameters for downstream task while keeping the pre trained weights frozen. The method assumes that redundant information is often easily stored in a big matrix, especially in high-dimensional spaces. Hence, a more "paramter efficient" matrix can capture the important data attributes during training.

![LoRA_diagram.png](./LoRA_diagram.png "LoRA_diagram.png")

So we should be able to go ahead and start fine tuning the model right? Unfortunately, there is another step we must consider before proceeding. In most fine tuning cases, you may be limited to a single GPU. Therefore, we need to be able to be able to make the fine tuning method even more efficient. This is where Quantized Low Rank Adaptation (QLORA) comes in.

QLoRA is the extended version of LoRA which works by quantizing the precision of the weight parameters in the pre trained LLM to 4-bit precision. Typically, parameters of trained models are stored in a 32-bit format, but QLoRA compresses them to a 4-bit format. This reduces the memory footprint of the LLM, making it possible to finetune it on a single GPU.

## 
## Pre-requisites

Before continuing, you would need to have a hugging face account. If you head to: https://huggingface.co/ , you should be able to create an one.

Next you will need access to Llama 3.2 1B, which is the model we will use for this task.  Use the link: https://huggingface.co/meta-llama/Llama-3.2-1B 

Once you reach the website, complete the required form (Do not mention that you are affliated to Accenture! Use a random univeristy maybe)

Once you have your HuggingFace account, create an access token to use. Head to your profile on the top right of your page and select "access tokens". Once created, you can store it in a notepad in your local machine.

## Install and import libraries
Lets install and import the required dependencies:

In [0]:
!pip install transformers datasets bitsandbytes peft trl accelerate torch 

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting peft
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Downloading trl-0.14.0-py3-none-any.whl.metadata (12 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.28.0-py3-none-any.whl.metadata (13 kB)
Collecting accelerate
  Downloading accelerate-1.3.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting transformers
  Downloading transformers-4.48.1-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.4 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_

In [0]:
%restart_python

In [0]:
import torch 
import os
from datasets import load_dataset, Dataset, load_from_disk # load datasets from hugging face 
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, DataCollatorForSeq2Seq) 
from trl import SFTConfig, SFTTrainer
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import torch 
import seaborn as sns
from peft import LoraConfig


2025-01-29 21:53:10.182025: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738187590.191747    2290 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738187590.195333    2290 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-29 21:53:10.208505: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [0]:
os.environ['HF_TOKEN'] = ""

## Load the dataset

In [0]:

# load the preprocessed datasets using the load_from_disk function
training_dataset = load_from_disk("training_data")
evaluation_data = load_from_disk("evaluation_data")

## Load the model 
We will use the BitsandBytes library to create a configuration that will fetch us the quatised version of the Llama 3.2 1B model. Its important to remember that although we will be quantizing the models weights to 4 bits, the precision of the LoRA matrices will be in 16 bits. This is to ensure that the learning process does not miss any details. During inference, the adapters are merged with the frozen quantised weights on the fly by dequantizing the weights. 

In [0]:
use_4bit = True
## model name
model_name = "meta-llama/Llama-3.2-1B" 
# Compute dtype for 4-bit base models. 
bnb_4bit_compute_dtype = "float16"
# Quantization process (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False
bnb_config = BitsAndBytesConfig(load_in_4bit= use_4bit, bnb_4bit_quant_type= bnb_4bit_quant_type, bnb_4bit_compute_dtype= bnb_4bit_compute_dtype, bnb_4bit_use_double_quant= use_nested_quant) ## perform computations in google brain float 16 format 
# no need to quatize the tokenizer of the model. 
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map= "auto", token = os.environ['HF_TOKEN'],) ## downloading the model in a 4bit format  
tokenizer = AutoTokenizer.from_pretrained(model_name, token = os.environ['HF_TOKEN'])

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

## QLoRA variables




Now that we understand what LORA is, let’s dive into some practical aspects of it. When we fine-tune a language model with QLORA, two new hyperparameters come into play:

1. Rank (r)

2.  Alpha (α)


Let’s suppose that our original weight matrix W was 10,000 x 10,000. Using LoRA, we break it down into two smaller matrices, A and B, with dimensions 10,000 x 8 and 8 x 10,000, respectively. Multiplying A and B allows us to reconstruct our original weight matrix with a shape of 10,000 by 10,000. In this example, 8 represents the rank of this LORA fine-tuning. We can choose any value of rank while decomposing or weight matrix into A and B. A higher rank means a greater number of trainable parameters in our model, making fine-tuning more memory intensive. 

The alpha pamaeter is a scaling factor that is applied to the product of the matrices B and A. 

In [0]:
################################################################################
# QLoRA parameters
################################################################################
# LoRA attention dimension for the model. This is the rank of the LoRA projected matrix.
lora_r = 8
# Alpha parameter for LoRA scaling. scaling factor that controls the magnitude of the weight changes added to the base model when fine-tuning
lora_alpha = 16
# Dropout probability for LoRA layers
lora_dropout = 0.1
################################################################################

Lets create the LoRA configuration that we will add to our model.

1. LoRA Alpha = alpha paramter
2. LoRA droptoup = dropout paramter
3. task_type = Causal language model since the model is auto regressive
4. Target Modules = Target the linear projection layers of the model 

In [0]:
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

## Model & Tokenizer configuration

Model configuration involves adding a pad token to both the tokenizer and the model. This to ensure that all input sequences in a batch have the same length. The model and the tokenizer will pad short input sentences based on the longest input in a batch

In [0]:
# an output tokens hidden state remains the same once computed for every further generation step, so recomputing it every time we want to generate a new token seems wasteful.
model.config.use_cache = False

# configure the pad token to be the EOS token on the model side as well.
if model.config.pad_token_id is None:
    model.config.pad_token_id = model.config.eos_token_id

# Add the LoRA adapter to the model architecture
model.add_adapter(peft_config)

# Modify the tokenizer to add the pad tokens 
tokenizer.pad_token = tokenizer.eos_token # 
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

## Model training arguements

Lets define the training arguements for training our model:

1. **num_train_epochs:** This is the number of times the model will go through your entire dataset
2. **bf16:** floating point data format to use during training
3. **per_device_train_batch_size:** The size of a batch that is processed during forward and backward propagation 
4. **per_device_eval_batch_size:** The size of the batch used for evaluation, ususally should be the same as the training batch size
5. **gradient_accumulation_steps:** Instead of updating the A and B matrices after each batch of data, you accumulate the gradients from batches before performing the weight update.
6. **max_grad_norm:  **Limits the maximum magnitude of gradients during training to prevent exploding gradients 
7.** Learning_rate =** Controls the size of adjustments made to the A and B parameters (weights) at each iteration during optimization.
8. **weight_decay:** a value added to the computation of the loss function which restricts the development of large matrix weights.
9. **optim:** An optimizer is a crucial element that fine-tunes a neural network's parameters during training. Its primary role is to minimize the model's error or loss function , enhancing performance.
10. **lr_scheduler_type:** changes the learning rate during learning and is most often changed between epochs. 
11. **warmup_steps:** the initial training period where the learning rate is set low to gradually adjust the newly added parameters (like LoRA matrices) before ramping up to a higher learning rate for full optimization
12. **group_by_length:** # Group sequences into batches with same length. Saves memory and speeds up training considerably. 
13. **logging_steps:** Log every X updates steps


In [0]:
output_dir = "./medical_model"
# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training 
fp16 = False
bf16 = True

# Batch size per GPU for training
per_device_train_batch_size = 3

# Batch size per GPU for evaluation
per_device_eval_batch_size = 3

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 2

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping). Limits the maximum magnitude of gradients during training to prevent exploding gradients 
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights. This is a regularization parameter added to the loss function for avoiding large weights 
weight_decay = 0.001

# Optimizer to use is adam 32bit 
optim = "paged_adamw_32bit"

# Learning rate schedule, how the learning rate decays over time
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# steps for a linear warmup (from 0 to learning rate)
warmup_steps = 2

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 2


Create a training configuration using the training arguements:

In [0]:
training_args = SFTConfig(
        learning_rate= learning_rate,
        lr_scheduler_type= lr_scheduler_type,
        per_device_train_batch_size= per_device_train_batch_size,
        per_device_eval_batch_size= per_device_eval_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        num_train_epochs= 1,
        fp16= fp16,
        bf16= bf16,
        logging_steps= logging_steps,
        optim= optim,
        weight_decay= weight_decay,
        warmup_steps= warmup_steps,
        output_dir="medical_summary/",
        dataset_text_field= 'prompt',
        seed=0,
    )

## Model Training 

We will use the SFTTrainer class to train the model based on the training configuration from the previous step

In [0]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.cuda.empty_cache()

In [0]:
trainer = SFTTrainer(
    model=model,
    train_dataset= training_dataset,
    eval_dataset= evaluation_data,
    peft_config=peft_config,
    args=training_args,
)

trainer.train()

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

[2025-01-29 21:56:15,262] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


df: /root/.triton/autotune: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




  @autocast_custom_fwd
  @autocast_custom_bwd


Step,Training Loss
2,2.3931
4,2.5099
6,2.5633
8,2.2717
10,2.2875
12,2.4156
14,2.1398
16,2.2966
18,2.1442
20,2.0679


com.databricks.backend.common.rpc.CommandCancelledException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$5(SequenceExecutionState.scala:136)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:136)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:714)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:432)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:432)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.can