### Install the below Libraries
Reference Article: https://www.analyticsvidhya.com/blog/2024/12/fine-tuning-llama-3-2-3b-for-rag/, https://docs.unsloth.ai/get-started/installing-+-updating/pip-install, https://mer.vin/2024/02/unsloth-fine-tuning/
 - pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
 - pip install --upgrade pip
 - pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
 

In [1]:
import torch
import re 

In [2]:
print(torch.__version__)

2.4.0+cu121


##### Notebooks: [https://www.youtube.com/watch?v=UWF6dxQYcbU&ab_channel=PromptEngineer, https://www.youtube.com/watch?v=Gpyukc6c0w8&ab_channel=MervinPraison, https://medium.com/@alexandros_chariton/how-to-fine-tune-llama-3-2-instruct-on-your-own-data-a-detailed-guide-e5f522f397d7]

In [3]:
from unsloth import FastLanguageModel, is_bfloat16_supported, train_on_responses_only

from datasets import load_dataset, Dataset

from trl import SFTTrainer, apply_chat_template

from transformers import TrainingArguments, DataCollatorForSeq2Seq, TextStreamer
import warnings
warnings.filterwarnings("ignore")

import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!


In [4]:
import xformers

### Initialize the Model and Tokenizers

- ##### More models at https://huggingface.co/unsloth


In [6]:
max_seq_length = 2048 
dtype = None # None for auto-detection.
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
	model_name = "unsloth/Llama-3.2-3B-Instruct",
	max_seq_length = max_seq_length,
	dtype = dtype,
	load_in_4bit = load_in_4bit,
	# token = "hf_...", # use if using gated models like meta-llama/Llama-3.2-11b
)


==((====))==  Unsloth 2025.1.7: Fast Llama patching. Transformers: 4.48.1.
   \\   /|    GPU: NVIDIA GeForce RTX 4070 Laptop GPU. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.4.0+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
# tokenizer

### Initialize the Model for PEFT

In [7]:
model = FastLanguageModel.get_peft_model(
	model,
	r = 16,
	target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  	"gate_proj", "up_proj", "down_proj",],
	lora_alpha = 16,
	lora_dropout = 0, 
	bias = "none",
	use_gradient_checkpointing = "unsloth",
	random_state = 42,
	use_rslora = False, 
	loftq_config = None,
)


Unsloth 2025.1.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


#### Description for Each Parameter
- r: Rank of LoRA; higher values improve accuracy but use more memory (suggested: 8–128).  
- target_modules: Modules to fine-tune; include all for better results 
- lora_alpha: Scaling factor; typically equal to or double the rank r.  
- lora_dropout: Dropout rate; set to 0 for optimized and faster training.  
- bias: Bias type; “none” is optimized for speed and minimal overfitting.  
- use_gradient_checkpointing: Reduces memory for long-context training; “unsloth” is highly recommended.  
- random_state: Seed for deterministic runs, ensuring reproducible results (e.g., 42).  
- use_rslora: Automates alpha selection; useful for rank-stabilized LoRA.  
- loftq_config: Initializes LoRA with top r singular vectors for better accuracy, though memory-intensive.

### Data Processing
We will use the RAG data to finetune. Download the data from huggingface.

In [8]:
dataset = load_dataset("neural-bridge/rag-dataset-1200", split = "train")
dataset

Dataset({
    features: ['context', 'question', 'answer'],
    num_rows: 960
})

In [9]:
dataset[0:3]

{'context': ['Francisco Rogers found the answer to a search query collar george herbert essay\nLink ----> collar george herbert essay\nWrite my essay ESSAYERUDITE.COM\nconstitution research paper ideas\ndefinition essay humility\nbusiness strategy case study solution\ncorporals course essay\ndecisions in paradise essays\ncollege essay word count\ncredit cart terminal paper\nbyron don juan essay\ndemocratic party essays\ncoursework language learning material teaching\nchristmas commercialized essay\ndahrendorf essays theory society\nbuy apa format essay buy apa format essay\nconan doyle speckled band essay\ncollege essay application prompt\ncolumbia university mfa creative writing acceptance rate\ncrucible coursework questions\ncollege essay topics texas\ncover letter thesis proposal\nciting ma thesis\ncompare and contrast essays for elementary\ncoursework completed without degree\ncomparison islam christianity essay\ncheerleading stereotypes essay\ncultural diversity college essay\ncri

In [10]:
dataset[0]

{'context': 'Francisco Rogers found the answer to a search query collar george herbert essay\nLink ----> collar george herbert essay\nWrite my essay ESSAYERUDITE.COM\nconstitution research paper ideas\ndefinition essay humility\nbusiness strategy case study solution\ncorporals course essay\ndecisions in paradise essays\ncollege essay word count\ncredit cart terminal paper\nbyron don juan essay\ndemocratic party essays\ncoursework language learning material teaching\nchristmas commercialized essay\ndahrendorf essays theory society\nbuy apa format essay buy apa format essay\nconan doyle speckled band essay\ncollege essay application prompt\ncolumbia university mfa creative writing acceptance rate\ncrucible coursework questions\ncollege essay topics texas\ncover letter thesis proposal\nciting ma thesis\ncompare and contrast essays for elementary\ncoursework completed without degree\ncomparison islam christianity essay\ncheerleading stereotypes essay\ncultural diversity college essay\ncrit

### The dataset has three keys as follows:

Dataset({ features: [‘context’, ‘question’, ‘answer’], num_rows: 960 })

The data needs to be in a specific format depending on the language model. Read more [details](https://huggingface.co/docs/trl/main/dataset_formats#converting-a-conversational-dataset-into-a-standard-dataset) here.

So, let’s convert the data into the required format:

In [11]:
def convert_dataset_to_dict(dataset):
    dataset_dict = {
        "prompt": [],
        "completion": []
    }

    for row in dataset:
        user_content = f"Context: {row['context']}\nQuestion: {row['question']}"
        assistant_content = row['answer']

        dataset_dict["prompt"].append([
            {"role": "user", "content": user_content}
        ])
        dataset_dict["completion"].append([
            {"role": "assistant", "content": assistant_content}
        ])
    return dataset_dict
    
    
converted_data = convert_dataset_to_dict(dataset)



In [12]:
converted_data["prompt"][0]

[{'role': 'user',
  'content': 'Context: Francisco Rogers found the answer to a search query collar george herbert essay\nLink ----> collar george herbert essay\nWrite my essay ESSAYERUDITE.COM\nconstitution research paper ideas\ndefinition essay humility\nbusiness strategy case study solution\ncorporals course essay\ndecisions in paradise essays\ncollege essay word count\ncredit cart terminal paper\nbyron don juan essay\ndemocratic party essays\ncoursework language learning material teaching\nchristmas commercialized essay\ndahrendorf essays theory society\nbuy apa format essay buy apa format essay\nconan doyle speckled band essay\ncollege essay application prompt\ncolumbia university mfa creative writing acceptance rate\ncrucible coursework questions\ncollege essay topics texas\ncover letter thesis proposal\nciting ma thesis\ncompare and contrast essays for elementary\ncoursework completed without degree\ncomparison islam christianity essay\ncheerleading stereotypes essay\ncultural d

In [13]:
converted_data["completion"][0]

[{'role': 'assistant',
  'content': 'Francisco Rogers found the answer to a search query collar george herbert essay.'}]

In [14]:
dataset = Dataset.from_dict(converted_data)
dataset

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 960
})

In [15]:
dataset[0]

{'prompt': [{'content': 'Context: Francisco Rogers found the answer to a search query collar george herbert essay\nLink ----> collar george herbert essay\nWrite my essay ESSAYERUDITE.COM\nconstitution research paper ideas\ndefinition essay humility\nbusiness strategy case study solution\ncorporals course essay\ndecisions in paradise essays\ncollege essay word count\ncredit cart terminal paper\nbyron don juan essay\ndemocratic party essays\ncoursework language learning material teaching\nchristmas commercialized essay\ndahrendorf essays theory society\nbuy apa format essay buy apa format essay\nconan doyle speckled band essay\ncollege essay application prompt\ncolumbia university mfa creative writing acceptance rate\ncrucible coursework questions\ncollege essay topics texas\ncover letter thesis proposal\nciting ma thesis\ncompare and contrast essays for elementary\ncoursework completed without degree\ncomparison islam christianity essay\ncheerleading stereotypes essay\ncultural diversit

In [16]:
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
dataset[0]
# f= tokenizer.apply_chat_template(dataset, tokenize=False)

Map: 100%|██████████| 960/960 [00:00<00:00, 7900.63 examples/s]


{'prompt': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Feb 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContext: Francisco Rogers found the answer to a search query collar george herbert essay\nLink ----> collar george herbert essay\nWrite my essay ESSAYERUDITE.COM\nconstitution research paper ideas\ndefinition essay humility\nbusiness strategy case study solution\ncorporals course essay\ndecisions in paradise essays\ncollege essay word count\ncredit cart terminal paper\nbyron don juan essay\ndemocratic party essays\ncoursework language learning material teaching\nchristmas commercialized essay\ndahrendorf essays theory society\nbuy apa format essay buy apa format essay\nconan doyle speckled band essay\ncollege essay application prompt\ncolumbia university mfa creative writing acceptance rate\ncrucible coursework questions\ncollege essay topics texas\ncover letter thesis proposal\nciting ma thesi

In [17]:
from pprint import pprint
pprint(dataset[6]["prompt"])

('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 26 Feb 2025\n'
 '\n'
 '<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
 '\n'
 'Context: Wander round St Cézaire\n'
 'Explore the picturesque hilltop town, stop for a café au lait or a leisurely '
 'lunch on the square, meet the locals playing boules, and soak up the '
 'Provencal atmosphere.\n'
 'Follow the Siagne\n'
 'As well as exploring the attractive village of St Cézaire on your doorstep, '
 'you could follow the river Siagne downstream to Auribeau, another pretty '
 'village which clings to the hillside. The Siagne meets the Mediterranean '
 'ocean at Mandelieu-la-Napoule, with its beach and marina, and a 14th century '
 'fortified castle which was renovated in the 1920s by an eccentric American '
 'couple - the sculptor husband renovated the interior whilst the architect '
 'wife applied her talents to the gardens. The castle and gardens are open 

### Setting-up the Trainer Parameters

We can initialize the trainer for finetuning the SLM:



In [18]:
trainer = SFTTrainer(
	model = model,
	tokenizer = tokenizer,
	train_dataset = dataset,
	max_seq_length = max_seq_length,
	data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
	dataset_num_proc = 1,
	packing = False, # Can make training 5x faster for short sequences.
	args = TrainingArguments(
    	per_device_train_batch_size = 2,
    	gradient_accumulation_steps = 4,
    	warmup_steps = 5,
    	# num_train_epochs = 1, # Set this for 1 full training run.
    	max_steps = 6, # using small number to test
    	learning_rate = 2e-4,
    	fp16 = not is_bfloat16_supported(),
    	bf16 = is_bfloat16_supported(),
    	logging_steps = 1,
    	optim = "adamw_8bit",
    	weight_decay = 0.01,
    	lr_scheduler_type = "linear",
    	seed = 3407,
    	output_dir = "outputs",
    	report_to = "none", # Use this for WandB etc
	),
)


Map: 100%|██████████| 960/960 [00:00<00:00, 2668.27 examples/s]


### Description of some of the parameters:

- per_device_train_batch_size: Batch size per device; increase to utilize more GPU memory but watch for padding inefficiencies (suggested: 2).  
- gradient_accumulation_steps: Simulates larger batch sizes without extra memory usage; increase for smoother loss curves (suggested: 4).  
- max_steps: Total training steps; set for faster runs (e.g., 60), or use `num_train_epochs` for full dataset passes (e.g., 1–3).  
- learning_rate: Controls training speed and convergence; lower rates (e.g., 2e-4) improve accuracy but slow training. 

In [25]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4070 Laptop GPU. Max memory = 7.996 GB.
7.064 GB of memory reserved.


In [20]:
trainer = train_on_responses_only(
	trainer,
	instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
	response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)


Map: 100%|██████████| 960/960 [00:00<00:00, 2829.50 examples/s]


### Fine-tuning the Model

In [21]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 960 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 6
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,1.4867
2,1.5474
3,1.0767
4,1.6428
5,1.0426
6,0.7397


Refer this to {[URL](https://docs.unsloth.ai/get-started/installing-+-updating/windows-installation)} for further queries on installation

### Test and Save the Model

Let’s use the model for inference:

In [23]:
FastLanguageModel.for_inference(model)

messages = [
	{"role": "user", "content": "Context: The sky is typically clear during the day. Question: What color is the water?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	tokenize = True,
	add_generation_prompt = True,
	return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
               	use_cache = True, temperature = 1.5, min_p = 0.1)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The water can be various colors, including blue, green, and others, depending on its clarity, depth, and surrounding environment.

Some examples of colored water include:

1. Ocean water: It is typically blue due to a phenomenon called scattering, where shorter (blue) wavelengths of light are scattered more than longer (red) wavelengths, giving the appearance of blue water.

2. Lake water: In clear lakes, the water may appear blue or blue-green, depending on its clarity and depth. In more turbid or clouded lakes, it might appear green or brown.

3. River water: River water is often a mix of brown


To save the trained including LoRA weights, use the below code 

In [24]:
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 8.83 out of 31.71 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 61%|██████    | 17/28 [00:00<00:00, 27.93it/s]
We will save to Disk and not RAM now.
100%|██████████| 28/28 [00:03<00:00,  7.59it/s]


Unsloth: Saving tokenizer... Done.
Done.


### Conclusion
Fine-tuning Llama 3.2 3B for RAG tasks showcases the efficiency of smaller models in delivering high performance with reduced computational costs. Techniques like LoRA optimize resource usage while maintaining accuracy. This approach empowers domain-specific applications, making advanced AI more accessible, scalable, and cost-effective, driving innovation in retrieval-augmented generation and democratizing AI for real-world challenges.