# FINE TUNING 
As mentioned before, fine-tuning large language models (LLMs) is a process that allows models to adapt to specific tasks, domains, or user requirements. However, depending on the size of the model and the fine-tuning dataset, the process can take a significant amount of time and demand high-performance GPUs to handle the computation load. However, there are various ways one can make the task more computationally efficient. One method is called Parameter Efficient Fine Tuning (PEFT), which helps fine tune only a small subset of a model’s parameters, significantly reducing the computational expenses while freezing the weights of original pretrained LLM.

One PEFT method is called LoRA, which stands for Low Rank Adaptation. This technique introduces trainable rank decomposition matrices (matrices A and B in the image below) within the transformer architecture and also reduces trainable parameters for downstream task while keeping the pre trained weights frozen. The method assumes that redundant information is often easily stored in a big matrix, especially in high-dimensional spaces. Hence, a more "parameter efficient" matrix can capture the important data attributes during training.

![LoRA_diagram.png](./LoRA_diagram.png "LoRA_diagram.png")

So we should be able to go ahead and start fine tuning the model right? Unfortunately, there is another step we must consider before proceeding. In most fine tuning cases, you may be limited to a single GPU. Therefore, we need to be able to make the fine tuning method even more efficient. This is where Quantized Low Rank Adaptation (QLoRA) comes in.

QLoRA is the extended version of LoRA which works by quantizing the precision of the weight parameters in the pre trained LLM to 4-bit precision. Typically, parameters of trained models are stored in a 32-bit format, but QLoRA compresses them to a 4-bit format. This reduces the memory footprint of the LLM, making it possible to finetune it on a single GPU.

## 
## Pre-requisites

Before continuing, you would need to have a hugging face account. If you head to: https://huggingface.co/ , you should be able to create an one.

Next you will need access to Llama 3.2 1B, which is the model we will use for this task.  Use the link: https://huggingface.co/meta-llama/Llama-3.2-1B. We are going to be fine tuning a base model ( a model that does not understand instructions) to understad instructions!

Once you reach the website, complete the required form (Do not mention that you are affliated to Accenture! Use a random univeristy maybe)

Once you have your HuggingFace account, create an access token to use. Head to your profile on the top right of your page and select "access tokens". Once created, you can store it in a notepad in your local machine.

This notebook is only compatible with Ampere GPUs, meaning GPUs with the Ampere architecture. For example, A100, A10, A40

## Install and import libraries
Lets install and import the required dependencies:

In [1]:
!pip install --ignore-installed transformers datasets bitsandbytes peft trl accelerate torch typing_extensions mlflow


Collecting transformers
  Using cached transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting datasets
  Using cached datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Using cached bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting peft
  Using cached peft-0.15.2-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Using cached trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Collecting accelerate
  Using cached accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting torch
  Using cached torch-2.6.0-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting typing_extensions
  Using cached typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB)
Collecting mlflow
  Using cached mlflow-2.21.3-py3-none-any.whl.metadata (30 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Using cached huggingfac

In [2]:
%restart_python

UsageError: Line magic function `%restart_python` not found.


In [1]:
import torch 
import os
from datasets import load_dataset, Dataset, load_from_disk # load datasets from hugging face 
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForSeq2Seq, TrainerCallback) 
from trl import SFTConfig, SFTTrainer
import pandas as pd 
import numpy as np 
import torch 
from peft import LoraConfig
import mlflow

  warn(


## Update the package manager in the OS and install the libaio-dev package

**Note - This is required only if you do not have the libaio package. Run it anyways, see what happpens**

In [2]:
command = "apt update"
os.system(command=command)





Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:5 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Fetched 257 kB in 1s (215 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
143 packages can be upgraded. Run 'apt list --upgradable' to see them.


0

In [3]:
command_2 = "apt-get -y install libaio-dev g++"
os.system(command=command_2)

Reading package lists...
Building dependency tree...
Reading state information...
g++ is already the newest version (4:11.2.0-1ubuntu1).
libaio-dev is already the newest version (0.3.112-13build1).
0 upgraded, 0 newly installed, 0 to remove and 143 not upgraded.


0

## Clear the GPU memeory (just in case)

In [4]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
import gc
gc.collect()
torch.cuda.empty_cache()

## Assign environemnt variables

In [None]:
## Add your HF token 
os.environ['HF_TOKEN'] = ""

## Reduce VRAM usage by reducing fragmentation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
## Assign where to cache downloaded datasets 
os.environ["HF_HOME"] = "/dbfs/huggingface/"
os.environ["HF_DATASETS_CACHE_DIR"] = "/dbfs/huggingface/.datasets_cache"
cle
## Just make sure you release unused memory in the GPU
torch.cuda.empty_cache()

## Load the dataset

In [6]:

# load the preprocessed datasets using the load_from_disk function
training_dataset = load_dataset("parquet", data_files = {'train': 'training_data.parquet'}).rename_columns({"prompt": "text"})
evaluation_dataset = load_dataset("parquet", data_files = {'test': 'evaluation_data.parquet'}).rename_columns({"prompt": "text"})

## Load the model 
We will use the BitsandBytes library to create a configuration that will fetch us the quatised version of the Llama 3.2 1B base model. Its important to remember that although we will be quantizing the models weights to 4 bits, the precision of the LoRA matrices will be in 16 bits. This is to ensure that the learning process does not miss any details. During inference, the adapters are merged with the frozen quantised weights on the fly by dequantizing the weights. 

In [7]:
use_4bit = True
# Compute dtype for 4-bit base models. 
# bnb_4bit_compute_dtype = torch.float16
bnb_4bit_compute_dtype = torch.bfloat16
# Quantization process (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_double_quant = True

bnb_config = BitsAndBytesConfig(load_in_4bit= use_4bit, bnb_4bit_quant_type= bnb_4bit_quant_type, bnb_4bit_compute_dtype= bnb_4bit_compute_dtype, bnb_4bit_use_double_quant= use_double_quant) 
## model name
model_name = "meta-llama/Llama-3.2-1B" 
## downloading the model in a 4bit format 
# model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map= "auto", token = os.environ['HF_TOKEN'], attn_implementation="flash_attention_2", use_cache=True) ## downloading the model in a 4bit format 
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map= "auto", token = os.environ['HF_TOKEN'], use_cache= False)

tokenizer = AutoTokenizer.from_pretrained(model_name, token = os.environ['HF_TOKEN'])

**Lets play with the base model a little and see what kind of outputs we get!**

In [38]:
text = "What is (are) Parasites - Taeniasis ?"
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs = tokenizer(text, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens= 500, temperature= 0.1)
outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


What is (are) Parasites - Taeniasis? - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Parasite - Para

Notice that when you ask "What is (are) Parasites - Taeniasis ?", it simply tries to complete that sentence by looking at what the next best words are. Hence, it ends up repeating itself. We want to be able to leverage its ability to learn patterns, and make it understand quesitons in a domain. I our case, it is being able to answer specific medical questions completely.

## QLoRA variables




Now that we understand what LORA is, let’s dive into some practical aspects of it. When we fine-tune a language model with QLORA, two new hyperparameters come into play:

1. Rank (r)

2.  Alpha (α)


Let’s suppose that our original weight matrix W was 10,000 x 10,000. Using LoRA, we break it down into two smaller matrices, A and B, with dimensions 10,000 x 8 and 8 x 10,000, respectively. Multiplying A and B allows us to reconstruct our original weight matrix with a shape of 10,000 by 10,000. In this example, 8 represents the rank of this LORA fine-tuning. We can choose any value of rank while decomposing or weight matrix into A and B. A higher rank means a greater number of trainable parameters in our model, making fine-tuning more memory intensive. 

The alpha pamaeter is a scaling factor that is applied to the product of the matrices B and A. This is usually a value that is twice the rank.

In [8]:
################################################################################
# QLoRA parameters
################################################################################
# LoRA attention dimension for the model. This is the rank of the LoRA projected matrix.
lora_r = 30
# Alpha parameter for LoRA scaling. scaling factor that controls the magnitude of the weight changes added to the base model when fine-tuninng.
lora_alpha = 60
# Dropout probability for LoRA layers
lora_dropout = 0.1
################################################################################

Lets create the LoRA configuration that we will add to our model.

1. LoRA Alpha = alpha paramter

2. LoRA droptout = dropout paramter

3. task_type = Causal language model since the model is auto regressive

4. Target Modules = Target the linear projection layers of the model 


In [9]:
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj'],
    modules_to_save = ['lm_head'], 
    use_rslora = True
)

## Model & Tokenizer configuration

#### Model configuration involves adding a pad token to both the tokenizer and the model. This to ensure that all input sequences in a batch have the same length. The model and the tokenizer will pad short input sentences based on the longest input in a batch

In [10]:
# an output tokens hidden state remains the same once computed for every further generation step, so recomputing it every time we want to generate a new token seems wasteful.
model.config.use_cache = False
# Modify the tokenizer to add the pad tokens 
tokenizer_special_tokens_map = {'bos_token': '<|im_start|>',
 'eos_token': '<|im_end|>', "additional_special_tokens": ["<answer>", "</answer>", "<think>", "</think>"]}

## Update the Tokenizer 
tokenizer.add_special_tokens(tokenizer_special_tokens_map)

tokenizer.pad_token = tokenizer.eos_token 
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right" 

# Modify the models tokenzer library 
model.resize_token_embeddings(len(tokenizer))
model.config.eos_token = '<|im_end|>'
model.config.bos_token = '<|im_start|>'
# model.config.eos_token_id = tokenizer.eos_token_id
# model.config.bos_token_id = tokenizer.bos_token_id
model.config.pad_token_id = model.config.eos_token_id

# # Add the LoRA adapter to the model architecture
model.add_adapter(peft_config, "adapter 1")


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


#### Lets check how many parameters are we actually fine tuning!

In [11]:
print(sum(p.numel() for p in model.parameters() if p.requires_grad))

283815936


## Model training arguements

Lets define the training arguements for training our model:

1. **num_train_epochs:** This is the number of times the model will go through your entire dataset.

2. **bf16/fp16:** floating point data format or binary float format to use during training.

3. **per_device_train_batch_size:** The number of samples that are processed before weights are updated Larger batch sizes generally lead to more stable training, although we would need to consider our GPU memory. We can use gradient accumulation to effectively take in larger batches over multiple forward/backward passes befoer updating the model. 

4. **per_device_eval_batch_size:** The size of the batch used for evaluation, ususally should be the same as the training batch size.

5. **gradient_accumulation_steps:** Instead of updating the A and B matrices/weights after each batch of data, you accumulate the gradients from batches before performing the weight update.

6. **max_grad_norm:** Limits the maximum magnitude of gradients during training to prevent exploding gradients. Common values are 1, 3, 5, 8, 10

7. **Learning_rate:** Controls the size of adjustments made to the A and B parameters (weights) at each iteration during optimization. A low number might cause training to be slow and may also cause the model to get stuck in local minima. Too high and the training may become unstable or diverge, which will degrade the performance. 

8. **weight_decay:** a value added to the computation of the loss function which restricts the development of large paramters.weights. Encourages the model to learn more simple and generalisable featreus. Usually a value of 0.01 works well.

9. **optim:** An optimizer is a crucial element that fine-tunes a neural network's parameters during training. Its primary role is to minimize the model's error or loss function , enhancing performance. In practice, Adamw8-bit is stringly recommended, it performs as well as its 32bit version while using less memory. There is a paged version (Parts of the optimizer states are moved automativally between CPU and GPU when you use this version).

10. **lr_scheduler_type:** changes the learning rate during learning. Starting with a higher LR for rapid initial progress and then decreasing it in later stages. Linear and cosine schedulers are the two most common options.

11. **warmup_steps:** the initial training period where the learning rate is set low to gradually adjust the newly added parameters (like LoRA matrices) before ramping up to a higher learning rate for full optimization.

12. **group_by_length:** Group sequences into batches with same length. Saves memory and speeds up training considerably. 

13. **logging_steps:** Log every X updates steps


In [12]:
output_dir = "./medical_model_results"
# Number of training epochs
num_train_epochs = 1.3

# Enable fp16/bf16 training. Usually keep fp16 True and bf16 False as bf16 works well if you have a larger GPU (like A100)
fp16 = False
bf16 = True

# Batch size per GPU for training. I.e how many tokens would you like the model to go through in a single forward and backward pass? Given how big the Q and A pairs are, we can only have it up until 3 samples per batch  
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4
 
# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 2

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping). Limits the maximum magnitude of gradient vector during training to prevent exploding gradients. 
max_grad_norm = 1

# Initial learning rate (AdamW optimizer)
learning_rate = 6e-5

# Weight decay to apply to all layers except bias/LayerNorm weights. This is a regularization parameter added to the loss function for avoiding large weights 
weight_decay = 0.01

# Optimizer to use is adam 8bit 
optim = "paged_adamw_8bit"
# optim = "paged_adamw_32bit"
# Learning rate schedule, how the learning rate decays over time
lr_scheduler_type = "linear"

# steps for a linear warmup (from 0 to learning rate)
warmup_steps = 5

# Log every X updates steps
logging_steps = 3

# goup smaller size samples into same batch 
group_by_length = True

## ensure that different tensors do not share the same memory during training 
safe_tensors = False

### keep overwiting the output directory
overwrite_output_dir = True

## save the model after every X steps 
save_steps = 300

## Lets push the mdoe to the hub as well
push_to_hub = True

## save the best model after training 
save_best = True 

## state the directory for the putput logs 
log_dir = "./logs"

load_best_model_at_end = True

max_steps = 1000

## Create a training configuration using the training arguements:

In [13]:

training_args = SFTConfig(
        learning_rate= learning_rate,
        lr_scheduler_type= lr_scheduler_type,
        eval_strategy = "steps",
        per_device_train_batch_size= per_device_train_batch_size,
        per_device_eval_batch_size= per_device_eval_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        num_train_epochs= num_train_epochs,
        fp16= fp16,
        bf16= bf16,
        logging_steps= logging_steps,
        optim= optim,
        weight_decay= weight_decay,
        warmup_steps= warmup_steps,
        output_dir=output_dir,
        seed=42, 
        group_by_length= group_by_length, 
        do_eval = True, 
        overwrite_output_dir = overwrite_output_dir,
        max_grad_norm = max_grad_norm, 
        save_safetensors = safe_tensors, 
        hub_token = os.environ['HF_TOKEN'],
        save_steps = save_steps,
        logging_dir = log_dir, 
        report_to = 'mlflow', 
        max_steps = max_steps)

## Model Training 

We will use the SFTTrainer class to train the model based on the training configuration from the previous step. We will also start the mlflow ui to visualize the curves during training 

In [14]:

mlflow.set_experiment("fine tuning llama 3.2 1B test 1")

# mlflow.log_params(vars(peft_config))
# mlflow.log_params(vars(training_args))

trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset= training_dataset['train'],
    eval_dataset= evaluation_dataset['test'],
    args=training_args)

trainer.train()


Step,Training Loss,Validation Loss
3,2.189,2.209043
6,2.0705,1.928275
9,1.9401,1.834912
12,1.8167,1.79132
15,1.8752,1.760048
18,1.757,1.740023
21,1.7761,1.721646
24,1.7246,1.705224
27,1.748,1.690865
30,1.7175,1.674868


KeyboardInterrupt: 

## Save the Model locally

In [17]:
# trainer.save_state()
# trainer.save_pretrained("new_medical_model")
trainer.save_model()



## Clear the GPU memeory 

In [18]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
import gc
gc.collect()
torch.cuda.empty_cache()

## Import the model and infer

In [27]:
from peft import PeftModel, PeftConfig, AutoPeftModelForCausalLM

## specify the directory where the training results are stored, include the checkpoint folder path
dir = "./medical_model_results/checkpoint-1313"

## Get the tokenizer from the directory 
tokenizer_tuned = AutoTokenizer.from_pretrained(dir)

## get the base model from hugging face again 
# trial_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map= "auto", token = os.environ['HF_TOKEN'], use_cache = False)

trial_model = AutoPeftModelForCausalLM.from_pretrained(dir, is_trainable = False, device_map = "auto")

## Resize the base moredel embeddings to the size of the tuned tokenizer
trial_model.resize_token_embeddings(len(tokenizer_tuned))
trial_model.config.eos_token = '<|im_end|>'
trial_model.config.bos_token = '<|im_start|>'
trial_model.config.eos_token_id = tokenizer_tuned.eos_token_id
trial_model.config.bos_token_id = tokenizer_tuned.bos_token_id
trial_model.config.pad_token_id = trial_model.config.eos_token_id

In [28]:
print(type(trial_model))

<class 'peft.peft_model.PeftModelForCausalLM'>


In [29]:
print(tokenizer_tuned.all_special_tokens)

['<|im_start|>', '<|im_end|>']


### Lets push the mdoel and the tokenizer to the hub!

In [39]:
trial_model.merge_and_unload()

Repo_name = "Insert complete repo name here"

trial_model.push_to_hub(Repo_name)
tokenizer_tuned.push_to_hub(Repo_name)

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]



adapter_model.safetensors:   0%|          | 0.00/2.10G [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/digitalApe14/tuned_llama/commit/754456d5a3be13c3dab11efee6e60cbf138be27b', commit_message='Upload tokenizer', commit_description='', oid='754456d5a3be13c3dab11efee6e60cbf138be27b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/digitalApe14/tuned_llama', endpoint='https://huggingface.co', repo_type='model', repo_id='digitalApe14/tuned_llama'), pr_revision=None, pr_num=None)

In [40]:
text = "What is (are) Parasites - Taeniasis ?"
inputs = tokenizer_tuned(text, return_tensors="pt").to("cuda")
outputs = trial_model.generate(**inputs,temperature=0.1, max_new_tokens=500)
print(tokenizer_tuned.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


What is (are) Parasites - Taeniasis? (also known as Taenia solium infection)
What is (are) Taenia solium infection?
Taenia solium is an intestinal parasite of humans that causes cysticercosis, a disease that can lead to serious health problems if not treated. The parasite is spread by the consumption of undercooked pork or other meat infected with Taenia solium cysts. The infection is found throughout Latin America, the Caribbean, the Middle East, the Far East, and parts of Africa and Asia. In the United States, the disease has been found in California, Florida, and Texas. The disease is also called cysticercosis, taeniasis, and taeniasis taenialis.
What are the symptoms of Taenia solium infection?
The symptoms of cysticercosis caused by Taenia solium infection are similar to those of other intestinal infections, such as gastroenteritis (inflammation of the lining of the stomach and intestines), but can be confused with other conditions. The disease is usually more severe in children a