# Finetuning Qwen 2.5 3B Parameter Model with GRPO


In [None]:
%%capture

# First install unsloth and vllm
!pip install unsloth vllm


In [None]:
import sys

# Remove all PIL-related modules from memory
modules = list(sys.modules.keys())
for module_name in modules:
    if "PIL" in module_name or "torch" in module_name or "unsloth" in module_name:
        sys.modules.pop(module_name)

In [1]:
!pip uninstall -y torch transformers xformers unsloth
!pip install torch==2.6.0 torchvision==0.16.0 torchaudio==2.6.0
!pip install xformers==0.0.29.post3
!pip install unsloth

Found existing installation: torch 2.5.1+cu121
Uninstalling torch-2.5.1+cu121:
  Successfully uninstalled torch-2.5.1+cu121
Found existing installation: transformers 4.49.0
Uninstalling transformers-4.49.0:
  Successfully uninstalled transformers-4.49.0
Found existing installation: xformers 0.0.28.post3
Uninstalling xformers-0.0.28.post3:
  Successfully uninstalled xformers-0.0.28.post3
Found existing installation: unsloth 2025.3.9
Uninstalling unsloth-2025.3.9:
  Successfully uninstalled unsloth-2025.3.9
Collecting torch==2.6.0
  Downloading torch-2.6.0-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision==0.16.0
  Downloading torchvision-0.16.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting torchaudio==2.6.0
  Downloading torchaudio-2.6.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB

## Created a dataset from a dataset that was avaiable https://github.com/lgresearch/QASA/tree/main which was made using scientific articles

In [None]:
import json

def transform_json(input_json, max_items=750):
    questions = []
    answers = []
    
    # Limit processing to max_items
    for key, value in list(input_json.items())[:max_items]:
        question = value.get("question", "")
        evidential_info_list = value.get("evidential_info", [])
        
        if evidential_info_list and isinstance(evidential_info_list, list):
            # Get the first item from evidential info list
            evidential_info = evidential_info_list[0]
        else:
            evidential_info = {}
        
        context = evidential_info.get("context", "")
        rationale = evidential_info.get("rationale", "")
        composition = value.get("composition", "")
        
        # Form the answer with context and composition
        answer = f"{context}\n\n#### {composition}"
        
        # Add to lists if question is not empty
        if question:
            questions.append(question)
            answers.append(answer)
    
    return {"questions": questions, "answers": answers}

# Example usage
# testset_answerable_1554_v1.1.json is in the repo
try:
    with open("testset_answerable_1554_v1.1.json", "r", encoding="utf-8") as file:
        input_data = json.load(file)

    output_data = transform_json(input_data, max_items=750)

    with open("data.json", "w", encoding="utf-8") as file:
        json.dump(output_data, file, indent=4, ensure_ascii=False)

    print("Transformed JSON with 750 items saved to data.750.json")
except FileNotFoundError as e:
    print(f"Error: {e}. Please ensure the input JSON file exists.")


In [2]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Load up `Qwen 2.5 3B Instruct`, and set parameters

In [3]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

INFO 03-09 17:19:45 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.53%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 4.88 GB. Also swap space = 5 GB.
INFO 03-09 17:20:00 config.py:549] This model supports multiple tasks: {'score', 'classify', 'reward', 'embed', 'gene

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 03-09 17:20:03 cuda.py:178] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 03-09 17:20:03 cuda.py:226] Using XFormers backend.
INFO 03-09 17:20:24 model_runner.py:1110] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 03-09 17:20:24 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 03-09 17:20:25 weight_utils.py:254] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

INFO 03-09 17:20:31 weight_utils.py:270] Time spent downloading weights for unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit: 5.924817 seconds


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-09 17:20:33 model_runner.py:1115] Loading model weights took 2.2160 GB
INFO 03-09 17:20:33 logger.py:57] Using PunicaWrapperGPU.
INFO 03-09 17:20:41 worker.py:267] Memory profiling takes 7.33 seconds
INFO 03-09 17:20:41 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.50) = 7.30GiB
INFO 03-09 17:20:41 worker.py:267] model weights take 2.22GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 4.01GiB.
INFO 03-09 17:20:42 executor_base.py:111] # cuda blocks: 7300, # CPU blocks: 9102
INFO 03-09 17:20:42 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 114.06x
INFO 03-09 17:20:50 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs duri

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:46<00:00,  1.72s/it]

INFO 03-09 17:21:36 model_runner.py:1562] Graph capturing finished in 46 secs, took 0.62 GiB
INFO 03-09 17:21:36 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 62.73 seconds





tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.3.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


# Data Prep
<a name="Data"></a>

In [None]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

import json
from datasets import Dataset

def get_json_questions(file_path: str, split: str = "train") -> Dataset:
    # Load JSON file
    with open(file_path, "r", encoding="utf-8") as file:
        data_json = json.load(file)

    prompts = [
        [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': q}
        ] for q in data_json["questions"]
    ]

    # Create dataset with prompts
    data = Dataset.from_dict({'prompt': prompts})

    # Add answers to the dataset using indices
    data = data.map(lambda x, idx: {'answer': data_json["answers"][idx]}, with_indices=True)

    return data

# Example usage (assuming SYSTEM_PROMPT is defined above):
json_file_path = "json_path"
dataset = get_json_questions(json_file_path)



# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


Map:   0%|          | 0/750 [00:00<?, ? examples/s]

<a name="Train"></a>
# Train the model

Now set up GRPO Trainer and all configurations!

In [5]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    #num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 200,
    save_steps = 200,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [6]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 750 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 16 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (16 x 1 x 1) = 16
 "-____-"     Trainable parameters = 119,734,272/1,919,856,640 (6.24% trained)


-------------------- Question:
What are different types of categories in the FashionMNIST dataset? 
Answer:
We use the front look thumbnail images of 70,000 unique products to build Fashion-MNIST. Those products come from different gender groups: men, women, kids and neutral. In particular, white-color products are not included in the dataset as they have low contrast to the background. The thumbnails (51\times 73) are then fed into the following conversion pipeline, which is visualized in Figure 1.

#### categories are men , women , kids and neutral.

composition: False 
Response:
<reasoning>
The FashionMNIST dataset is a subset of the MNIST dataset, which originally contains handwritten digits images, but here it focuses on small images (28x2 Fuk) of common clothing items. Since the FashionMNIST dataset does not describe specific categories within clothing items, we need to look at common types of clothing items to categorize the images. A common categorization for clothing items inc

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,-0.0,-0.047438,0.120913,188.375,0.0,-0.047438,0.0,0.0,0.0,0.0
2,0.0,-0.053375,0.108601,185.9375,0.0,-0.053375,0.0,0.0,0.0,0.0
3,0.0,-0.583562,0.359997,196.625,2.5e-05,-0.583562,0.0,0.0,0.0,0.0
4,0.0,-0.527562,0.251829,192.875,2.7e-05,-0.527562,0.0,0.0,0.0,0.0
5,0.0,-0.582625,0.252981,196.375,2e-05,-0.582625,0.0,0.0,0.0,0.0
6,0.0,0.02,0.194444,200.0,5.1e-05,0.02,0.0,0.0,0.0,0.0
7,0.0,-0.364062,0.280114,192.375,2e-05,-0.364062,0.0,0.0,0.0,0.0
8,0.0,-0.12225,0.179289,195.1875,2.2e-05,-0.12225,0.0,0.0,0.0,0.0
9,0.0,-0.263625,0.251043,178.4375,3.6e-05,-0.263625,0.0,0.0,0.0,0.0
10,0.0,-0.497,0.387336,199.3125,2.2e-05,-0.497,0.0,0.0,0.0,0.0


-------------------- Question:
How would the loss function of YoloV3 look after changing Mean squared errors with the logistic regression cross-entropy error terms? 
Answer:
Linear x,y predictions instead of logistic. We tried using a linear activation to directly predict the x,y offset instead of the logistic activation. This led to a couple point drop in mAP.

#### Binary cross-entropy is used for the class predictions. Logistic activation is used and is better than the linear activation. 
Response:
<reasoning>
To understand the change from Mean Squared Error (MSE) in YOLOv3 to Logistic Regression (logistic loss) in terms of the cross-entropy error, let's first recall the loss function for YOLOv sorrounding variables.

YOLOv3's loss function combines the smooth L1 loss (a differentiable alternative to MSE for regression tasks) and binary cross-entropy loss (for classification).

1. **Regression Loss**: This part helps in predicting bounding box coordinates and corresponding class pro

TrainOutput(global_step=200, training_loss=0.002431846932232733, metrics={'train_runtime': 6263.9549, 'train_samples_per_second': 0.511, 'train_steps_per_second': 0.032, 'total_flos': 0.0, 'train_loss': 0.002431846932232733})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [9]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.42it/s, est. speed input: 52.64 toks/s, output: 22.76 toks/s]


'There are no letters \'r\' in the word "strawberry."'

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [10]:
model.save_lora("grpo_saved_lora")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Now we load the LoRA and test:

In [12]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.24s/it, est. speed input: 5.94 toks/s, output: 31.64 toks/s]


"<reasoning>\nTo determine how many 'r's are in the word 'strawberry', I will go through the word character by character and count each occurrence of the letter 'r'.\n\n1. The first character is 's', which is not 'r'.\n2. The second character is 't', which is not 'r'.\n3. The third character is 'r', which is 'r'.\n4. The fourth character is 'a', which is not 'r'.\n5. The fifth character is 'w', which is not 'r'.\n6. The sixth character is 'r', which is 'r'.\n7. The seventh character is 'a', which is not 'r'.\n8. The eighth character is 'r', which is 'r'.\n9. The ninth character is 'b', which is not 'r'.\n10. The tenth character is 'b', which is not 'r'.\n\nI have found 'r' three times in the word 'strawberry'.\n...\n</reasoning>\n<answer>\nThere are 3 r's in the word 'strawberry'.\n</answer>"

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

### GGUF / llama.cpp Conversion
To save to `GGUF` 4 bit Quantized Model

In [8]:
# Save to q4_k_m GGUF
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

Cloning into 'llama.cpp'...
Submodule 'kompute' (https://github.com/nomic-ai/kompute.git) registered for path 'ggml/src/ggml-kompute/kompute'
Cloning into '/kaggle/working/llama.cpp/ggml/src/ggml-kompute/kompute'...
Submodule path 'ggml/src/ggml-kompute/kompute': checked out '4565194ed7c32d1d2efa32ceab4d3c6cae006306'
make: Entering directory '/kaggle/working/llama.cpp'
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAV

Unsloth: You have 2 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.4G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 11.7 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 36/36 [00:01<00:00, 32.29it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting qwen2 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model into f16 GGUF format.
The output location will be /kaggle/working/model/unsloth.F16.gguf
This might take 3 minutes...
2025-03-09 19:12:34.948355: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-09 19:12:34.972879: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-09 19:12