<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/fine_tuning_llama_3_2_3b_dpo_peft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLaMA with Unsloth and Direct Preference Optimization

## Introduction

This notebook provides a comprehensive guide to fine-tuning a LLaMA-based language model using the Unsloth library. It begins by setting up the necessary environment and dependencies, followed by loading a pre-trained model with optional 4-bit quantization for memory efficiency. The process includes applying Parameter-Efficient Fine-Tuning (PEFT) using LoRA, preparing a preference-based dataset, and configuring the Direct Preference Optimization (DPO) trainer for training. Additionally, the notebook demonstrates how to perform inference, stream generated text in real-time, and save the fine-tuned model in various formats suitable for different deployment scenarios.


## Methodology

### Setup and Installation

This block installs necessary Python packages and their dependencies. It removes existing installations of `torch`, `torchvision`, and `torchaudio`, then reinstalls them with specific configurations. Additionally, it installs `unsloth` (including the latest nightly version) and upgrades the `transformers` library.


In [1]:
%%capture
# Installs pip3-autoremove to manage package dependencies
!pip install pip3-autoremove
# Removes specified versions of torch, torchvision, and torchaudio to clean up dependencies
!pip-autoremove torch torchvision torchaudio -y
# Installs specific versions of torch, torchvision, torchaudio, and xformers
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
# Installs the unsloth package
!pip install unsloth
# Uninstalls the current version of unsloth and installs the latest nightly version from GitHub
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
# Upgrades the transformers package to the latest version
!pip install --upgrade --no-cache-dir transformers

### Load the Language Model

Imports `FastLanguageModel` from `unsloth` and initializes the model and tokenizer with specified parameters, including sequence length, data type, and optional 4-bit quantization for memory efficiency.


In [2]:
from unsloth import FastLanguageModel
import torch

# Define the maximum sequence length for the model.
# This sets the maximum number of tokens the model can process in a single input.
# RoPE Scaling is automatically supported internally, allowing longer sequences.
max_seq_length = 2048

# Specify the data type for model computations.
# None means automatic detection based on the device.
# Use Float16 for older GPUs like Tesla T4 or V100, and Bfloat16 for newer GPUs (Ampere+).
dtype = None

# Enable 4-bit quantization to significantly reduce memory usage.
# This is useful for running large models on devices with limited VRAM.
# Set to False if higher precision is needed or memory is not a concern.
load_in_4bit = True

# Load the model and tokenizer using the FastLanguageModel class from Unsloth.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",  # Specify the model to use. Alternative: "unsloth/Llama-3.2-1B-Instruct".
    max_seq_length=max_seq_length,  # Pass the maximum sequence length defined earlier.
    dtype=dtype,  # Use the dtype setting for automatic or custom precision.
    load_in_4bit=load_in_4bit,  # Enable or disable 4-bit quantization based on the earlier setting.
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.7: Fast Llama patching. Transformers: 4.48.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

### Apply PEFT (Parameter-Efficient Fine-Tuning)

Configures the model for fine-tuning using LoRA (Low-Rank Adaptation) by specifying parameters like rank, target modules, dropout, and gradient checkpointing to optimize memory usage and training efficiency.


In [3]:
# Configure and create a PEFT (Parameter-Efficient Fine-Tuning) model using the FastLanguageModel class.
# This method fine-tunes specific parameters of the model without retraining the entire model, making it more memory- and compute-efficient.
model = FastLanguageModel.get_peft_model(
    model,  # The base model to apply PEFT.

    # Rank of the Low-Rank Adaptation (LoRA) matrices.
    # Higher values allow the model to learn more complex patterns but increase memory usage.
    # Suggested values: 8, 16, 32, 64, 128.
    r=16,

    # Target modules to apply LoRA. These are specific layers where LoRA will adjust parameters.
    target_modules=[
        "q_proj", "k_proj", "v_proj",  # Query, Key, Value projection layers.
        "o_proj",                      # Output projection layer.
        "gate_proj",                   # Gate projection layer.
        "up_proj", "down_proj",        # Up and Down projection layers (for Feedforward Networks).
    ],

    # Alpha parameter controls the scaling of LoRA updates. Higher values emphasize LoRA changes.
    lora_alpha=16,

    # Dropout probability for LoRA. Dropout helps in regularizing the training, but for optimization, 0 is often preferred.
    lora_dropout=0,

    # Specifies whether to include biases in LoRA training.
    # "none" disables bias training for better optimization and lower memory usage.
    bias="none",

    # Enables gradient checkpointing with "unsloth" optimizations, reducing VRAM usage by 30%.
    # This allows fitting larger batch sizes and very long context lengths.
    # Set to `True` or `"unsloth"` depending on preference.
    use_gradient_checkpointing="unsloth",

    # Random seed for reproducibility during training and fine-tuning.
    random_state=3407,

    # Enables Rank Stabilized LoRA (RS-LoRA) to prevent rank collapse during fine-tuning.
    # Set to `True` if rank stabilization is required.
    use_rslora=False,

    # Configuration for LoftQ (LoRA Quantization), an advanced quantization method.
    # `None` disables LoftQ.
    loftq_config=None,
)

Unsloth 2025.1.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Prepare the Dataset

Formats the dataset using a prompt template suitable for training. It loads the `gpt4_preference_rlaif` dataset and applies the `format_prompt` function to structure each sample with instructions, input, and responses.


In [4]:
# Define the prompt template for formatting data.
# This template structures the input, instruction, and response in a specific format
# for training or evaluating the model.
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Retrieve the end-of-sequence token from the tokenizer.
# This token marks the end of a generated response, helping the model determine where to stop.
EOS_TOKEN = tokenizer.eos_token

# Define a function to format a single data sample using the prompt template.
# This function structures the data into the `alpaca_prompt` format.
def format_prompt(sample):
    # Extract the instruction and input text from the sample.
    # `sample["system"]` contains the instruction, and `sample["question"]` contains the input/context.
    instruction = sample["system"]
    input_text = sample["question"]

    # Format the instruction and input using the `alpaca_prompt` template.
    # The response is left blank initially, as it will be generated later.
    sample["prompt"] = alpaca_prompt.format(instruction, input_text, "")

    # Append the end-of-sequence token (EOS) to both the accepted and rejected responses.
    # This helps the model recognize the end of these sequences during training.
    sample["chosen"] = sample["chosen"] + EOS_TOKEN
    sample["rejected"] = sample["rejected"] + EOS_TOKEN

    return sample  # Return the formatted sample.

# Load and preprocess the dataset.
from datasets import load_dataset
# Load the "Intel/orca_dpo_pairs" dataset from Hugging Face Datasets library.
# This dataset contains pairs of accepted and rejected responses for preference modeling.
dataset = load_dataset("Intel/orca_dpo_pairs")["train"]

# Shuffle the dataset, select the first 500 samples, and apply formatting.
# Shuffling ensures randomness, and limiting the size improves training speed for experiments.
dataset = dataset.shuffle(seed=42).select(range(1000))

# Apply the `format_prompt` function to each sample in the dataset.
# This ensures all samples are structured according to the `alpaca_prompt` template.
dataset = dataset.map(format_prompt)

README.md:   0%|          | 0.00/196 [00:00<?, ?B/s]

orca_rlhf.jsonl:   0%|          | 0.00/36.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12859 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [5]:
# Import the `pprint` module for pretty-printing complex objects.
# This is useful for displaying structured data in an easy-to-read format.
import pprint

# Retrieve the second sample (index 1) from the processed dataset.
# `dataset[1]` provides a single data sample as a dictionary with keys like "prompt", "chosen", and "rejected".
row = dataset[1]

# Print the instruction and input in the sample.
# The '=' * 50 creates a visual separator for better readability in the output.
print('INSTRUCTION: ' + '=' * 50)
pprint.pprint(row["prompt"])  # Pretty-print the "prompt" field, which contains the instruction and input.

# Print the accepted response in the sample.
print('ACCEPTED: ' + '=' * 50)
pprint.pprint(row["chosen"])  # Pretty-print the "chosen" field, which contains the model's preferred response.

# Print the rejected response in the sample.
print('REJECTED: ' + '=' * 50)
pprint.pprint(row["rejected"])  # Pretty-print the "rejected" field, which contains the less-preferred response.

('Below is an instruction that describes a task, paired with an input that '
 'provides further context. Write a response that appropriately completes the '
 'request.\n'
 '\n'
 '### Instruction:\n'
 'You are an AI assistant. User will you give you a task. Your goal is to '
 'complete the task as faithfully as you can. While performing the task think '
 'step-by-step and justify your steps.\n'
 '\n'
 '### Input:\n'
 'Extract the answer to the following question from the movie plot. If the '
 'question isn\'t answerable, please output "Can\'t answer". Question: What '
 "does Johanna cut from Katniss's arm? Title: The Hunger Games: Catching Fire "
 'Movie plot: After winning the 74th Hunger Games, Katniss Everdeen (Jennifer '
 'Lawrence) and Peeta Mellark (Josh Hutcherson) return home to District 12. '
 'President Snow visits Katniss at her home. The two make an agreement to not '
 'lie to one another, and Snow explains that her actions in the Games have '
 'inspired rebellions across th

### Configure the DPO Trainer

Sets up the Direct Preference Optimization (DPO) trainer with training arguments such as batch size, learning rate, mixed precision settings, and other hyperparameters. It also integrates reward modeling statistics.


In [6]:
# Import and enable reward modeling statistics.
# `PatchDPOTrainer()` patches the DPOTrainer to enable enhanced logging or optimizations for reward modeling.
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

# Import training utilities and configurations.
from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig  # DPOTrainer is a library for direct preference optimization (DPO).
from unsloth import is_bfloat16_supported  # Utility to check if BFloat16 is supported on the current hardware.

# Initialize the DPOTrainer for training with reward modeling.
dpo_trainer = DPOTrainer(
    model=model,  # The model to fine-tune using DPO.
    ref_model=None,  # Reference model (optional). Can be used for comparisons, such as during reward calculation.

    # Training configuration.
    args=DPOConfig(
        per_device_train_batch_size=2,  # Number of training samples per device (GPU/CPU) in a batch.
        gradient_accumulation_steps=4,  # Accumulates gradients over this many steps to simulate a larger batch size.
        warmup_ratio=0.1,  # Proportion of total training steps for learning rate warmup.
        num_train_epochs=1,  # Number of epochs (full dataset passes) to train the model.
        learning_rate=5e-6,  # Initial learning rate for the optimizer.

        # Mixed precision training.
        # Use FP16 (16-bit floating-point) if BFloat16 is not supported by the hardware.
        fp16=not is_bfloat16_supported(),
        # Use BFloat16 (optimized for Ampere GPUs) if the hardware supports it.
        bf16=is_bfloat16_supported(),

        logging_steps=1,  # Log metrics and stats every step for detailed tracking.
        optim="adamw_8bit",  # Use the 8-bit AdamW optimizer for memory efficiency.
        weight_decay=0.0,  # No weight decay (L2 regularization) is applied to model parameters.
        lr_scheduler_type="linear",  # Use a linear learning rate scheduler.
        seed=42,  # Random seed for reproducibility.
        output_dir="outputs",  # Directory to save model checkpoints and logs.
        report_to="none",  # Specify where to log metrics (e.g., WandB, TensorBoard). "none" disables external reporting.
    ),

    # Additional settings for DPO training.
    beta=0.5,  # Hyperparameter for scaling rewards in DPO.
    train_dataset=dataset,  # The dataset to use for training.
    tokenizer=tokenizer,  # Tokenizer for processing input data.
    max_length=1024,  # Maximum token length for model inputs.
    max_prompt_length=512,  # Maximum token length for prompts (input portion of the data).
)

Extracting prompt from train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

### Start Training

Initiates the training process using the configured DPO trainer.


In [7]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 125
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,2.7726,0.0,0.0,0.0,0.0,-210.69902,-224.944183,0.135454,0.314617
2,2.7726,0.0,0.0,0.0,0.0,-205.225037,-131.81427,0.552444,0.277569
3,2.7726,0.0,0.0,0.0,0.0,-134.38205,-267.941406,0.439246,0.110675
4,2.7726,0.0,0.0,0.0,0.0,-143.215698,-136.040436,0.279658,0.167424
5,2.7726,0.0,0.0,0.0,0.0,-230.381287,-200.881973,0.294192,-0.032083
6,2.7726,0.0,0.0,0.0,0.0,-202.901718,-149.189484,0.10047,0.394176
7,2.7552,0.003702,-0.005496,0.5,0.009199,-175.318359,-256.788208,-0.127532,-0.256275
8,2.8077,-0.010481,0.006698,0.5,-0.017178,-180.142944,-165.347626,0.205541,0.28907
9,2.7828,0.000171,0.004952,0.5,-0.004781,-214.309875,-158.584961,0.347475,0.223142
10,2.8229,-0.003302,0.020466,0.375,-0.023768,-206.729752,-282.199127,0.486418,0.554625


TrainOutput(global_step=125, training_loss=1.1928165204524994, metrics={'train_runtime': 1974.0563, 'train_samples_per_second': 0.507, 'train_steps_per_second': 0.063, 'total_flos': 0.0, 'train_loss': 1.1928165204524994, 'epoch': 1.0})

### Inference: Generate Text

Prepares the model for inference with optimized settings and generates a continuation of a Fibonacci sequence based on the provided prompt.


In [8]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "You are an AI assistant. You will be given a task. You must generate a detailed and long answer", # instruction
        "What are GPUs and why would I use them for machine learning tasks?", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 150, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are an AI assistant. You will be given a task. You must generate a detailed and long answer\n\n### Input:\nWhat are GPUs and why would I use them for machine learning tasks?\n\n### Response:\n**GPU (Graphics Processing Unit) Overview for Machine Learning**\n\nIn recent years, the world of machine learning (ML) has experienced exponential growth, with applications ranging from image recognition and natural language processing to predictive analytics and more. At the heart of this growth are Graphics Processing Units (GPUs), specialized computer hardware designed to']

### Inference with Streaming

Enables faster inference and streams the generated text output in real-time as it's being produced.


In [12]:
alpaca_prompt = "You are an AI assistant. You will be given a task. You must generate a detailed and long answer: {}"  # Add a placeholder for the input

# Adjusting the inputs to match the placeholder in alpaca_prompt
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "What are GPUs and why would I use them for machine learning tasks?"
        )
    ], return_tensors="pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=150)


<|begin_of_text|>You are an AI assistant. You will be given a task. You must generate a detailed and long answer: What are GPUs and why would I use them for machine learning tasks? 

**GPU stands for Graphics Processing Unit.**

A Graphics Processing Unit is a specialized electronic circuit designed to quickly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. These processing units were originally designed for 2D and 3D graphics, but have since been repurposed for other tasks such as scientific simulations, data compression, and, most relevantly, machine learning.

**The evolution of GPUs**

The first GPU was introduced in 1999, and was used primarily for 2D graphics rendering. Over the years, GPUs have undergone significant transformations, becoming more powerful and specialized for various applications. The introduction of CUDA in 2007 marked a significant shift, as it enabled developers


### Save the Model Locally

Saves the fine-tuned model and tokenizer to the local directory named `lora_model`.


In [13]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [15]:
import gc
import torch

# Flush memory to free up resources after training.

# Safely delete instances if they exist
if 'dpo_trainer' in globals():
    del dpo_trainer

if 'model' in globals():
    del model

if 'ref_model' in globals():
    del ref_model

# Perform garbage collection to clean up any remaining unused objects in memory.
gc.collect()

# Clear the GPU memory cache to free up VRAM for other tasks or subsequent operations.
torch.cuda.empty_cache()

In [16]:
# Import FastLanguageModel from the unsloth library for loading and inference optimization
from unsloth import FastLanguageModel

# Load the pre-trained model and tokenizer with the specified settings
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="lora_model",  # Specify the model name or path to the pre-trained model
    max_seq_length=max_seq_length,  # Set the maximum sequence length for input processing (limits token length)
    dtype=dtype,  # Specify the data type (e.g., float32 or float16) for model weights
    load_in_4bit=load_in_4bit,  # Load the model with 4-bit quantization to reduce memory usage
)

# Enable native optimizations for faster inference, improving the performance of the model
FastLanguageModel.for_inference(model)  # Apply native inference optimizations for speed

# Define a list of messages representing a conversation with the model
messages = [
    {"role": "user", "content": "What are GPUs and why would I use them for machine learning tasks"},  # User message requesting a description of a tower
]

# Prepare the input by applying the chat template to the message list to format it for the model
inputs = tokenizer.apply_chat_template(
    messages,  # Pass the list of user messages to the tokenizer
    tokenize=True,  # Tokenize the message content so it can be processed by the model
    add_generation_prompt=True,  # Add a special prompt required for the generation process
    return_tensors="pt",  # Return the tokenized data as PyTorch tensors, which is the format expected by the model
).to("cuda")  # Move the tokenized data to the GPU (CUDA) for faster processing

# Import the TextStreamer class from Hugging Face's transformers library to handle the generation output stream
from transformers import TextStreamer

# Initialize a TextStreamer object to stream the generated text output in real-time as it is produced
text_streamer = TextStreamer(tokenizer, skip_prompt=True)  # Set skip_prompt=True to exclude the prompt from output

# Generate output from the model based on the tokenized input data and stream the result
_ = model.generate(
    input_ids=inputs,  # Pass the tokenized input data to the model for text generation
    streamer=text_streamer,  # Use the text streamer to output the generated text as it is produced
    max_new_tokens=128,  # Limit the output to a maximum of 128 new tokens
    use_cache=True,  # Use the model cache to speed up inference by avoiding redundant computations
    temperature=1.5,  # Set the temperature to control the randomness of the output (higher means more randomness)
    min_p=0.1,  # Minimum probability for token selection, controlling which tokens are likely to be chosen
)

==((====))==  Unsloth 2025.1.7: Fast Llama patching. Transformers: 4.48.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


**What are GPUs?**

Graphics Processing Units (GPUs) are specialized computer chips designed to handle graphics rendering and other high-performance tasks. Originally developed for gaming and 3D graphics, modern GPUs have evolved to excel in compute-intensive tasks, making them an ideal choice for many machine learning (ML) and deep learning (DL) applications.

GPUs are designed with the following characteristics:

1. **Massive parallelism**: Many cores and threads allow for simultaneous execution of multiple calculations.
2. **High-speed memory**: Specialized memory systems enable fast data transfer and storage.
3. **Vectorization**: Processing data using vectors of the


#### Push the trained model to the Hugging Face Model Hub using the GGUF format

In [None]:
# Push the trained model to the Hugging Face Model Hub using the GGUF format
model.push_to_hub_gguf(
    "SURESHBEEKHANI/llama_3_2_3B-dpo-rlhf-fine-tuning",  # Specify the model repository path on Hugging Face Hub. Replace "hf" with your Hugging Face username.
    tokenizer,  # Pass the tokenizer associated with the model to ensure compatibility on the hub
    quantization_method=["q4_k_m"],  # Specify the quantization methods to apply for optimized model storage (e.g., q4_k_m, q8_0, q5_k_m)
    token="",  # Provide the Hugging Face token for authentication. Obtain a token at https://huggingface.co/settings/tokens
)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.36 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 21.53it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving SURESHBEEKHANI/llama_3_2_3B-dpo-rlhf-fine-tuning/pytorch_model-00001-of-00002.bin...
Unsloth: Saving SURESHBEEKHANI/llama_3_2_3B-dpo-rlhf-fine-tuning/pytorch_model-00002-of-00002.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at SURESHBEEKHANI/llama_3_2_3B-dpo-rlhf-fine-tuning into f16 GGUF format.
The output location will be /content/SURESHBEEKHANI/llama_3_2_3B-dpo-rlhf-fine-tuning/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: llama_3_2_3B-dpo-rlhf-fine-tuning
INFO:gguf.gguf_writer:gguf: This GGUF file is for 

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/SURESHBEEKHANI/llama_3_2_3B-dpo-rlhf-fine-tuning
