<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/fine_tuning_llama_3_2_3b_dpo_peft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLaMA with Unsloth and Direct Preference Optimization

## Introduction

This notebook provides a comprehensive guide to fine-tuning a LLaMA-based language model using the Unsloth library. It begins by setting up the necessary environment and dependencies, followed by loading a pre-trained model with optional 4-bit quantization for memory efficiency. The process includes applying Parameter-Efficient Fine-Tuning (PEFT) using LoRA, preparing a preference-based dataset, and configuring the Direct Preference Optimization (DPO) trainer for training. Additionally, the notebook demonstrates how to perform inference, stream generated text in real-time, and save the fine-tuned model in various formats suitable for different deployment scenarios.


## Methodology

### Setup and Installation

This block installs necessary Python packages and their dependencies. It removes existing installations of `torch`, `torchvision`, and `torchaudio`, then reinstalls them with specific configurations. Additionally, it installs `unsloth` (including the latest nightly version) and upgrades the `transformers` library.


In [1]:
%%capture
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install --upgrade --no-cache-dir transformers

### Load the Language Model

Imports `FastLanguageModel` from `unsloth` and initializes the model and tokenizer with specified parameters, including sequence length, data type, and optional 4-bit quantization for memory efficiency.


In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.7: Fast Llama patching. Transformers: 4.48.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

### Apply PEFT (Parameter-Efficient Fine-Tuning)

Configures the model for fine-tuning using LoRA (Low-Rank Adaptation) by specifying parameters like rank, target modules, dropout, and gradient checkpointing to optimize memory usage and training efficiency.


In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.1.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Prepare the Dataset

Formats the dataset using a prompt template suitable for training. It loads the `gpt4_preference_rlaif` dataset and applies the `format_prompt` function to structure each sample with instructions, input, and responses.


In [4]:
# Define the prompt template for formatting data
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Retrieve the end-of-sequence token from the tokenizer
EOS_TOKEN = tokenizer.eos_token

# Function to format a single sample with the defined prompt structure
def format_prompt(sample):
    instruction = "You are an AI assistant. You will be given a task. You must generate a correct answer."
    input_text = sample["question"]

    # Add EOS_TOKEN to accepted and rejected responses
    sample["prompt"] = alpaca_prompt.format(instruction, input_text, "")
    sample["chosen"] = sample["chosen"] + EOS_TOKEN
    sample["rejected"] = sample["rejected"] + EOS_TOKEN
    return sample

# Load and preprocess the dataset
from datasets import load_dataset
dataset = load_dataset("Intel/orca_dpo_pairs")["train"]

# Shuffle the dataset, limit to 500 samples, and apply formatting
dataset = dataset.shuffle(seed=42).select(range(500))
dataset = dataset.map(format_prompt)

README.md:   0%|          | 0.00/196 [00:00<?, ?B/s]

orca_rlhf.jsonl:   0%|          | 0.00/36.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12859 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [5]:
import pprint
row = dataset[1]
print('INSTRUCTION: ' + '=' * 50)
pprint.pprint(row["prompt"])
print('ACCEPTED: ' + '=' * 50)
pprint.pprint(row["chosen"])
print('REJECTED: ' + '=' * 50)
pprint.pprint(row["rejected"])

('Below is an instruction that describes a task, paired with an input that '
 'provides further context. Write a response that appropriately completes the '
 'request.\n'
 '\n'
 '### Instruction:\n'
 'You are an AI assistant. You will be given a task. You must generate a '
 'correct answer.\n'
 '\n'
 '### Input:\n'
 'Extract the answer to the following question from the movie plot. If the '
 'question isn\'t answerable, please output "Can\'t answer". Question: What '
 "does Johanna cut from Katniss's arm? Title: The Hunger Games: Catching Fire "
 'Movie plot: After winning the 74th Hunger Games, Katniss Everdeen (Jennifer '
 'Lawrence) and Peeta Mellark (Josh Hutcherson) return home to District 12. '
 'President Snow visits Katniss at her home. The two make an agreement to not '
 'lie to one another, and Snow explains that her actions in the Games have '
 'inspired rebellions across the districts. He orders her to use the upcoming '
 'victory tour to convince him that her actions were 

### Configure the DPO Trainer

Sets up the Direct Preference Optimization (DPO) trainer with training arguments such as batch size, learning rate, mixed precision settings, and other hyperparameters. It also integrates reward modeling statistics.


In [6]:
# Enable reward modelling stats
from unsloth import PatchDPOTrainer
PatchDPOTrainer()
from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
    beta = 0.1,
    train_dataset = dataset,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

Extracting prompt from train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

### Start Training

Initiates the training process using the configured DPO trainer.


In [None]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 62
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,2.7726,0.0,0.0,0.0,0.0,-179.639725,-149.344894,0.414554,0.507371
2,2.7726,0.0,0.0,0.0,0.0,-240.43811,-257.39566,0.264283,0.736304
3,2.7752,0.001586,0.002899,0.625,-0.001313,-162.001999,-209.180008,0.416307,0.161567
4,2.7708,0.00139,0.000468,0.75,0.000923,-131.183655,-142.446167,0.457675,0.127763
5,2.7765,-0.003428,-0.001469,0.375,-0.001959,-295.34198,-330.808929,0.101384,0.312241
6,2.7768,-0.004443,-0.002346,0.375,-0.002097,-181.925308,-386.13031,0.031255,0.094382
7,2.7652,3.3e-05,-0.003688,0.625,0.003721,-202.956238,-192.848541,0.387834,0.242426
8,2.7817,-0.000662,0.003867,0.25,-0.004529,-201.73262,-178.644135,0.244898,0.120074
9,2.7625,0.00068,-0.0044,0.75,0.005081,-170.191406,-139.833725,0.458102,0.742569
10,2.7666,-0.000883,-0.003927,0.625,0.003044,-153.659744,-258.160278,0.141125,0.272556


### Inference: Generate Text

Prepares the model for inference with optimized settings and generates a continuation of a Fibonacci sequence based on the provided prompt.


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "What are GPUs and why would I use them for machine learning tasks?" # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

### Inference with Streaming

Enables faster inference and streams the generated text output in real-time as it's being produced.


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "What are GPUs and why would I use them for machine learning tasks?"
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

### Save the Model Locally

Saves the fine-tuned model and tokenizer to the local directory named `lora_model`.


In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

### Save the Model for vllm

Provides options to save the model in various formats and precisions, such as 16-bit, 4-bit, or with LoRA adapters. The `if False` statements indicate optional execution based on the desired format.


In [None]:
# Saving to float16 for VLLM
# We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4.
# We also allow lora adapters as a fallback.

# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)

### Save the Model as GGUF (Ollama, llama.cpp)

Enables saving the model in formats compatible with GGUF, Ollama, or llama.cpp, supporting various quantization methods like `q8_0`, `f16`, and `q4_k_m`.


In [None]:
# GGUF / Ollama / llama.cpp Conversion
# To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0.
# We allow all methods like q4_k_m.

# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

## Conclusion

By following this notebook, users can effectively fine-tune a LLaMA language model to suit specific tasks and preferences using Unsloth and DPO. The step-by-step approach ensures optimized training performance and memory usage, while the flexible saving options facilitate seamless integration into diverse deployment environments. This workflow empowers developers and researchers to customize powerful language models efficiently and deploy them across various platforms.
