# Fine-Tuning LLaMA with Unsloth and Direct Preference Optimization

## Introduction

This notebook provides a comprehensive guide to fine-tuning a LLaMA-based language model using the Unsloth library. It begins by setting up the necessary environment and dependencies, followed by loading a pre-trained model with optional 4-bit quantization for memory efficiency. The process includes applying Parameter-Efficient Fine-Tuning (PEFT) using LoRA, preparing a preference-based dataset, and configuring the Direct Preference Optimization (DPO) trainer for training. Additionally, the notebook demonstrates how to perform inference, stream generated text in real-time, and save the fine-tuned model in various formats suitable for different deployment scenarios.


## Methodology

### Setup and Installation

This block installs necessary Python packages and their dependencies. It removes existing installations of `torch`, `torchvision`, and `torchaudio`, then reinstalls them with specific configurations. Additionally, it installs `unsloth` (including the latest nightly version) and upgrades the `transformers` library.


In [1]:
# %%capture
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
# !pip install unsloth
# # Also get the latest nightly Unsloth!
# !pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
# !pip install --upgrade --no-cache-dir transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
torchaudio 2.5.1+cu121 (/opt/conda/lib/python3.12/site-packages)
    torch 2.5.1+cu121 (/opt/conda/lib/python3.12/site-packages)
        nvidia-cuda-nvrtc-cu12 12.1.105 (/opt/conda/lib/python3.12/site-packages)
        nvidia-cuda-runtime-cu12 12.1.105 (/opt/conda/lib/python3.12/site-packages)
        nvidia-cuda-cupti-cu12 12.1.105 (/opt/conda/lib/python3.12/site-packages)
        nvidia-cudnn-cu12 9.1.0.70 (/opt/conda/lib/python3.12/site-packages)
            nvidia-cublas-cu12 12.1.3.1 (/opt/conda/lib/python3.12/site-packages)
        nvidia-cufft-cu12 11.0.2.54 (/opt/conda/lib/python3.12/site-packages)
        nvidia-curand-cu12 10.3.2.106 (/opt/conda/lib/python3.12/site-packages)
        nvidia-cusolver-cu12 11.4.5.107 (/opt/conda/lib/python3.12/site-packages)
            nvidia-nvjitlink-cu12 12.9.86 (/opt/conda/lib/python3.12/site-packages)
            nvidia-cusparse-cu12 12.1.0.106 (/opt/conda/lib/python3

In [2]:
# !unzip ./unsloth-main.zip

In [3]:
%cd ./unsloth-main

/home/jovyan/llm_project/unsloth-main


In [4]:
!pip install -e .

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///home/jovyan/llm_project/unsloth-main
  Installing build dependencidone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?2done
[?25h  Preparing editable metadata (pyproject.toml) ... [?25done
[?25hBuilding wheels for collected packages: unsloth
  Building editable for unsloth (pyproject.toml) ... [?25done
[?25h  Created wheel for unsloth: filename=unsloth-2025.8.10-0.editable-py3-none-any.whl size=20689 sha256=09212197acd2e8be0b093c8e6b46200fc844bfd31c0fa455055387f72a0d37cc
  Stored in directory: /tmp/pip-ephem-wheel-cache-bkn71i64/wheels/66/16/5a/d477950254e8de7e54b1fd1c92cb2563956a556964f0bc53a8
Successfully built unsloth
Installing collected packages: unsloth
  Attempting uninstall: unsloth
    Found existing installation: unsloth 2025.8.10
    Uninstalling unsloth-2025.8.10:
      Successfully uninstalled unslot

In [5]:
%cd ~/llm_project

/home/jovyan/llm_project


In [6]:
# !pip install unsloth
# Also get the latest nightly Unsloth!
# !pip uninstall unsloth -y && 
# !pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install --upgrade --no-cache-dir transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


### Load the Language Model

Imports `FastLanguageModel` from `unsloth` and initializes the model and tokenizer with specified parameters, including sequence length, data type, and optional 4-bit quantization for memory efficiency.


In [7]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-09-04 16:16:50.306078: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757002610.761908 1311793 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757002610.881526 1311793 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1757002612.145112 1311793 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1757002612.145174 1311793 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1757002612.145182 1311793 computation_placer.cc:177] computation placer alr

🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.10: Fast Llama patching. Transformers: 4.56.0.
   \\   /|    Tesla V100S-PCIE-32GB. Num GPUs = 1. Max memory: 31.733 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Apply PEFT (Parameter-Efficient Fine-Tuning)

Configures the model for fine-tuning using LoRA (Low-Rank Adaptation) by specifying parameters like rank, target modules, dropout, and gradient checkpointing to optimize memory usage and training efficiency.


In [8]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.8.10 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Prepare the Dataset

Formats the dataset using a prompt template suitable for training. It loads the `gpt4_preference_rlaif` dataset and applies the `format_prompt` function to structure each sample with instructions, input, and responses.


In [9]:
# The data must be formatted with appropriate prompt template first.
# See details here: https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

# https://huggingface.co/datasets/ogbrandt/gpt4_preference_rlaif
def format_prompt(sample):
    instruction = "You are an AI assistant. You will be given a task. You must generate a correct answer."

    # hh
    # a = sample["chosen"].split('\n\n')
    # b = sample["rejected"].split('\n\n')
    # # print(a)
    # input       = a[1]
    # accepted    = a[2]
    # rejected    = b[2]

    # uf
    a = sample["chosen"].split('Assistant:')
    b = sample["rejected"].split('Assistant:')
    # print(a)
    # print(b)
    if len(a) < 2:
        a.append(' ')
    if len(b) < 2:
        b.append(' ')
    input       = a[0]
    accepted    = 'Assistant: ' + a[1]
    rejected    = 'Assistant: ' + b[1]

    # uf
    # print(sample)
    # input = sample["prompt"]
    # accepted = sample["chosen"][0]['content']
    # rejected = sample["rejected"][0]['content']

    # print(input, accepted, rejected)

    sample["prompt"]   = alpaca_prompt.format(instruction, input, "")
    sample["chosen"]   = accepted + EOS_TOKEN
    sample["rejected"] = rejected + EOS_TOKEN
    return sample
pass

from datasets import load_dataset
# dataset = load_dataset('json', data_files={'train':'/kaggle/input/hh-random-sample/hh_random_sample.jsonl'})["train"]
dataset = load_dataset('json', data_files={'train':'datasets/uf_full.jsonl'})["train"]
# dataset = load_dataset('json', data_files={'train':'/kaggle/input/uf-random-dataset/uf_random_sample.jsonl'})["train"]
# dataset = load_dataset('json', data_files={'train':'/kaggle/input/uf-external-reward-npz/uf_ex_reward_margin_sample_N.jsonl'})["train"]
# dataset = load_dataset('json', data_files={'train':'/kaggle/input/uf-external-reward-npz/uf_ex_reward_margin_sample_P.jsonl'})["train"]
# dataset = load_dataset("ogbrandt/gpt4_preference_rlaif")["train"]
dataset = dataset.map(format_prompt,)

# dataset = dataset.remove_columns(['messages'])
# dataset

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [10]:
import pprint
row = dataset[1]
print(row)
print('INSTRUCTION: ' + '=' * 50)
pprint.pprint(row["prompt"])
print('ACCEPTED: ' + '=' * 50)
pprint.pprint(row["chosen"])
print('REJECTED: ' + '=' * 50)
pprint.pprint(row["rejected"])

{'chosen': 'Assistant:  Translation: The mongoose had become accustomed to putting its life in danger.\n\nExplanation: In this sentence, the subject is "خرمگس" (mongoose), and it talks about its habit of taking risks with its life. The translation accurately captures the meaning of the Persian sentence and is engaging, sparking curiosity about the mongoose\'s adventures and why it endangers itself.<|eot_id|>', 'rejected': "Assistant:  Sure, I'm here to help! Can you please provide the Persian sentence you would like me to translate into English?<|eot_id|>", 'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are an AI assistant. You will be given a task. You must generate a correct answer.\n\n### Input:\nHuman: In this task, you are given a sentence in Persian, and your task is to translate it into English.\nExample: جناب گرانت خیلی دلش می\u200cخ

### Configure the DPO Trainer

Sets up the Direct Preference Optimization (DPO) trainer with training arguments such as batch size, learning rate, mixed precision settings, and other hyperparameters. It also integrates reward modeling statistics.


In [11]:
# Enable reward modelling stats
from unsloth import PatchDPOTrainer
PatchDPOTrainer()
from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 1,
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
    beta = 0.1,
    train_dataset = dataset,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

Extracting prompt in train dataset (num_proc=20):   0%|          | 0/20000 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=20):   0%|          | 0/20000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=20):   0%|          | 0/20000 [00:00<?, ? examples/s]

### Start Training

Initiates the training process using the configured DPO trainer.


In [12]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 20,000 | Num Epochs = 1 | Total steps = 2,500
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss,aux_loss
1,0.6931,0.0,0.0,0.0,0.0,-392.445068,-303.1203,0.203624,0.099876,0,0,0,0
2,0.6931,0.0,0.0,0.0,0.0,-400.370453,-347.559448,0.216964,-0.132337,No Log,No Log,No Log,No Log
3,0.6931,0.0,0.0,0.0,0.0,-506.807007,-455.206543,0.192731,0.155567,No Log,No Log,No Log,No Log
4,0.6932,-0.004626,-0.004524,0.625,-0.000102,-387.293945,-357.080872,0.421187,0.436843,No Log,No Log,No Log,No Log
5,0.6962,-0.001806,0.004361,0.125,-0.006166,-255.607712,-240.814713,0.422347,0.169075,No Log,No Log,No Log,No Log
6,0.6945,-0.003566,-0.000936,0.375,-0.002631,-273.684357,-234.213531,-0.186036,0.304375,No Log,No Log,No Log,No Log
7,0.691,0.000397,-0.003901,0.625,0.004297,-496.726746,-368.868866,-0.011616,0.287739,No Log,No Log,No Log,No Log
8,0.6982,-0.005067,0.005068,0.375,-0.010135,-505.467285,-660.956421,0.229836,0.039131,No Log,No Log,No Log,No Log
9,0.6894,0.004127,-0.003434,0.625,0.007561,-458.187256,-465.330017,-0.128248,-0.15889,No Log,No Log,No Log,No Log
10,0.6918,0.001396,-0.001359,0.5,0.002755,-263.208923,-274.533356,0.370073,0.498774,No Log,No Log,No Log,No Log


Unsloth: Will smartly offload gradients to save VRAM!




TrainOutput(global_step=2500, training_loss=0.5793678359925747, metrics={'train_runtime': 16438.841, 'train_samples_per_second': 1.217, 'train_steps_per_second': 0.152, 'total_flos': 0.0, 'train_loss': 0.5793678359925747, 'epoch': 1.0})

### Inference: Generate Text

Prepares the model for inference with optimized settings and generates a continuation of a Fibonacci sequence based on the provided prompt.


In [13]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

AttributeError: 'NoneType' object has no attribute 'shape'

### Inference with Streaming

Enables faster inference and streams the generated text output in real-time as it's being produced.


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

### Save the Model Locally

Saves the fine-tuned model and tokenizer to the local directory named `lora_model`.


In [None]:
model.save_pretrained("uf_random_lora_model") # Local saving
tokenizer.save_pretrained("uf_random_lora_model")

### Save the Model for vllm

Provides options to save the model in various formats and precisions, such as 16-bit, 4-bit, or with LoRA adapters. The `if False` statements indicate optional execution based on the desired format.


In [None]:
from peft import PeftModel
from transformers import AutoModelForCausalLM

model_to_merge = PeftModel.from_pretrained(model, 'hh_full_lora_model')

merged_model = model_to_merge.merge_and_unload()

In [None]:
model = merged_model

In [15]:
# Saving to float16 for VLLM
# We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. 
# We also allow lora adapters as a fallback.

# Merge to 16bit
if True: model.save_pretrained_merged("model_uf_full", tokenizer, save_method = "merged_16bit",)

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)

Found HuggingFace hub cache directory: /home/jovyan/.cache/huggingface/hub


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00002.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Merging weights into 16bit:   0% 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

{"timestamp":"2025-09-04T20:56:26.139132Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, source: hyper_util::client::legacy::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: \"Connection reset by peer\" } }) }). Retrying..."},"filename":"/home/runner/work/xet-core/xet-core/cas_client/src/http_client.rs","line_number":242}
{"timestamp":"2025-09-04T20:56:26.139287Z","level":"WARN","fields":{"message":"Retry attempt #0. Sleeping 1.426085202s before the next attempt"},"filename":"/root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/reqwest-retry-0.7.0/src/middleware.rs","line_number":171}
{"timestamp":"2025-09-04T20:56:26.140861Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, source: hyper_util::client::legacy::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: \"Connection reset by peer\" } }) }). Retrying..."},"filename":"/home/runner/work/

Unsloth: Merging weights into 16bit:  50% 1/2 [07:26<07:26, 446.50s/it]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

{"timestamp":"2025-09-04T21:06:08.092559Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, source: hyper_util::client::legacy::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: \"Connection reset by peer\" } }) }). Retrying..."},"filename":"/home/runner/work/xet-core/xet-core/cas_client/src/http_client.rs","line_number":242}
{"timestamp":"2025-09-04T21:06:08.092638Z","level":"WARN","fields":{"message":"Retry attempt #0. Sleeping 1.78416986s before the next attempt"},"filename":"/root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/reqwest-retry-0.7.0/src/middleware.rs","line_number":171}
{"timestamp":"2025-09-04T21:06:08.094489Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, source: hyper_util::client::legacy::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: \"Connection reset by peer\" } }) }). Retrying..."},"filename":"/home/runner/work/x

Unsloth: Merging weights into 16bit: 100% 2/2 [11:19<00:00, 339.97s/it]


In [None]:
!du -h /kaggle/working/outputs

In [None]:
!du -h /kaggle/working/lora_model

In [None]:
!du -h /kaggle/working/model

In [None]:
!pip install -U "huggingface_hub[cli]"


In [None]:
!hf auth login --token hf_asdfgh

In [None]:
!hf upload AliEdalat/dpo_uf_random_2 ~/llm_project/model_uf_random

### Save the Model as GGUF (Ollama, llama.cpp)

Enables saving the model in formats compatible with GGUF, Ollama, or llama.cpp, supporting various quantization methods like `q8_0`, `f16`, and `q4_k_m`.


In [None]:
# GGUF / Ollama / llama.cpp Conversion
# To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. 
# We allow all methods like q4_k_m.

# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

## Conclusion

By following this notebook, users can effectively fine-tune a LLaMA language model to suit specific tasks and preferences using Unsloth and DPO. The step-by-step approach ensures optimized training performance and memory usage, while the flexible saving options facilitate seamless integration into diverse deployment environments. This workflow empowers developers and researchers to customize powerful language models efficiently and deploy them across various platforms.
