# I tried to follow Unsloth's approach for fine tuning in this colab notebook

In [1]:
# Do this only in Colab notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo --quiet
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer --quiet
!pip install transformers==4.51.3 --quiet
!pip install --no-deps unsloth --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.4/147.4 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth-zoo 2025.6.1 requires msgspec, which is not installed.
unsloth-zoo 2025.6.1 requires

In [2]:
from huggingface_hub import login
from google.colab import userdata

# Fetch the token securely from Colab secrets
hf_token = userdata.get("HF_TOKEN")

if hf_token is None:
    raise ValueError("HF_TOKEN not found. Please add it via Colab secrets.")

login(token=hf_token)

# Load LLaMA 3.2–1B Instruct Model

In [3]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.6.1: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

# Apply LoRA

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2025.6.1 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


# Load 50k from WizardLM Dataset


In [5]:
from datasets import load_dataset

# Load and take only 50k samples
dataset = load_dataset("Leon-Leee/Wizardlm_Evol_Instruct_v2_196K_backuped", split="train[:50000]")
dataset[0]

README.md:   0%|          | 0.00/539 [00:00<?, ?B/s]

(…)-00000-of-00001-004cd1ba9dc05e6c.parquet:   0%|          | 0.00/162M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/143000 [00:00<?, ? examples/s]

{'idx': 'heR0vZB',
 'conversations': [{'from': 'human',
   'value': 'As an online platform teacher named Aimee, you possess impeccable credentials which include a Bachelor of Science degree in Industrial and Labor Relations from Cornell University, expertise in the English language, and intermediate proficiency in both Chinese and Spanish. Additionally, your professional experience as a STEAM teacher at UN Women in Singapore has honed your skills in teaching children from the ages of 6-11 and working with students from all levels of education. Your exceptional teaching abilities in spoken English and pronunciation paired with your personal strengths of being informed, patient, and engaging make you an ideal teacher for students seeking to improve their English language skills. Can you provide a short, concise, and unique English self-introduction in bullet point form that would attract students to enroll in your course?'},
  {'from': 'gpt',
   'value': "Sure, here are some bullet point

<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use 50K data from [Wizardlm_Evol_Instruct_v2_196K_backuped](https://huggingface.co/datasets/Leon-Leee/Wizardlm_Evol_Instruct_v2_196K_backuped) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

# Convert to LLaMA Chat Format

In [6]:
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

# Apply LLaMA-style chat template
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in convos
    ]
    return {"text": texts}

In [7]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [8]:
dataset[3]["conversations"]

[{'content': "Can you create a revised version of the sentence that focuses on the importance of cultural fit in candidate evaluation?\r\n\r\nCertainly, I comprehend the obligations of this position and am prepared to proficiently analyze and scrutinize candidates' technical proficiency and communication abilities, provide constructive feedback on their responses, and offer insightful recommendations on their overall suitability for the role.",
  'role': 'user'},
 {'content': 'While I understand the responsibilities of this role and can effectively assess candidates based on their technical skills and communication abilities, I also recognize the importance of evaluating cultural fit and can provide valuable feedback and recommendations in that regard.',
  'role': 'assistant'}]

In [9]:
dataset[3]["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you create a revised version of the sentence that focuses on the importance of cultural fit in candidate evaluation?\r\n\r\nCertainly, I comprehend the obligations of this position and am prepared to proficiently analyze and scrutinize candidates' technical proficiency and communication abilities, provide constructive feedback on their responses, and offer insightful recommendations on their overall suitability for the role.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nWhile I understand the responsibilities of this role and can effectively assess candidates based on their technical skills and communication abilities, I also recognize the importance of evaluating cultural fit and can provide valuable feedback and recommendations in that regard.<|eot_id|>"

# without standardize_sharegpt library

In [None]:
# from unsloth.chat_templates import get_chat_template, standardize_sharegpt

# # Convert to the generic ("role", "content") format
# def convert_roles(example):
#     conversation = example["conversations"]
#     standardized = []
#     for turn in conversation:
#         role = "user" if turn["from"] == "human" else "assistant"
#         standardized.append({"role": role, "content": turn["value"]})
#     return {"conversations": standardized}

# dataset = dataset.map(convert_roles)

# # Apply LLaMA-style chat template
# tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

# def formatting_prompts_func(examples):
#     convos = examples["conversations"]
#     texts = [
#         tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
#         for convo in convos
#     ]
#     return {"text": texts}

# dataset = dataset.map(formatting_prompts_func, batched=True)

# Fine-Tune with TRL

In [24]:
from transformers import TrainingArguments, DataCollatorForSeq2Seq, EarlyStoppingCallback
from trl import SFTTrainer
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only

args = TrainingArguments(
    output_dir="outputs",                         # Directory for saving
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,                # Effective batch size = 8
    max_steps=2000,                               # Stop after 2000 steps (ignores num_train_epochs)
    warmup_steps=100,                             # Warmup learning rate (5% of total steps)
    learning_rate=2e-4,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    logging_dir="logs",
    logging_strategy="steps",
    logging_steps=50,                             # Log every 50 steps
    # save_strategy="steps",
    save_steps=500,                               # Save every 500 steps
    save_total_limit=1,                           # Keep only last 1 checkpoint
    # eval_strategy="steps",
    eval_steps=500,                               # Evaluate every 500 steps
    # load_best_model_at_end=True,                  # Use best checkpoint based on eval_loss
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    report_to="wandb",                            # Or "none" if you're not using Weights & Biases
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,
    args=args,
    # callbacks=[
    #     EarlyStoppingCallback(early_stopping_patience=2)  # Stop if no improvement over 2 evals
    # ],
)

# initial consideration
# trainer = SFTTrainer(
#     model=model,
#     tokenizer=tokenizer,
#     train_dataset=dataset,
#     dataset_text_field="text",
#     max_seq_length=max_seq_length,
#     data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
#     dataset_num_proc=2,
#     packing=False,
#     args=TrainingArguments(
#         per_device_train_batch_size=2,
#         gradient_accumulation_steps=4,
#         warmup_steps=5,
#         num_train_epochs=1,
#         # max_steps=100,
#         learning_rate=2e-4,
#         fp16=not is_bfloat16_supported(),
#         bf16=is_bfloat16_supported(),
#         logging_steps=100,
#         optim="adamw_8bit",
#         weight_decay=0.01,
#         lr_scheduler_type="linear",
#         seed=3407,
#         output_dir="outputs",
#         report_to="wandb",
#     ),
# )

---

## 🧠 Why These Settings?

| Setting                       | Purpose                                                                |
| ----------------------------- | ---------------------------------------------------------------------- |
| `max_steps=2000`              | Training is stopped exactly at 2000 steps                              |
| `eval_steps=500`              | Evaluate every 500 steps                                               |
| `save_steps=500`              | Save checkpoints alongside evals                                       |
| `early_stopping_patience=2`   | Stop early if eval loss doesn’t improve for 2 evals (i.e., 1000 steps) |
| `load_best_model_at_end=True` | Automatically reload the best checkpoint for inference                 |
| `save_total_limit=2`          | Saves space on Colab by keeping only 2 best checkpoints                |

---

## 📝 Tips

* With `gradient_accumulation_steps=4` and batch size 2, your **effective batch size is 8**, which is reasonable for LLaMA 1B on a Colab T4.
* If training is **too slow**, try lowering `eval_steps` and `save_steps` to 200 or 250.

---

We also use Unsloth's `train_on_completions` method to only train on the
assistant outputs and ignore the loss on the user's inputs.

In [19]:
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/50000 [00:00<?, ? examples/s]

We verify masking is actually done:

In [20]:
tokenizer.decode(trainer.train_dataset[3]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you create a revised version of the sentence that focuses on the importance of cultural fit in candidate evaluation?\r\n\r\nCertainly, I comprehend the obligations of this position and am prepared to proficiently analyze and scrutinize candidates' technical proficiency and communication abilities, provide constructive feedback on their responses, and offer insightful recommendations on their overall suitability for the role.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nWhile I understand the responsibilities of this role and can effectively assess candidates based on their technical skills and communication abilities, I also recognize the importance of evaluating cultural fit and can provide valuable feedback and recommendations in that regard.<|eot_id|>"

In [21]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[3]["labels"]])

'                                                                                                        While I understand the responsibilities of this role and can effectively assess candidates based on their technical skills and communication abilities, I also recognize the importance of evaluating cultural fit and can provide valuable feedback and recommendations in that regard.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [22]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.457 GB of memory reserved.


In [25]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 50,000 | Num Epochs = 1 | Total steps = 2,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
50,1.3991
100,1.1002
150,1.0966
200,1.063
250,1.0143
300,1.0316
350,1.0272
400,1.0245
450,0.9891
500,1.0259


### Show final memory and time stats

In [26]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

4982.7581 seconds used for training.
83.05 minutes used for training.
Peak reserved memory = 2.123 GB.
Peak reserved memory for training = 0.666 GB.
Peak reserved memory % of max memory = 14.402 %.
Peak reserved memory for training % of max memory = 4.518 %.


# Test Your Fine-Tuned Model

In [27]:
from transformers import TextStreamer
FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "Write a Python function to sort a list of dictionaries by a key."},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(input_ids=inputs, streamer=streamer, max_new_tokens=256, temperature=0.7)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Here's a Python function to sort a list of dictionaries by a key:

```python
def sort_dict_list(dict_list, key):
    """
    Sorts a list of dictionaries by a key.

    Args:
        dict_list (list): List of dictionaries.
        key (str): Key to sort by.

    Returns:
        list: Sorted list of dictionaries.
    """
    sorted_list = sorted(dict_list, key=lambda x: x[key])
    return sorted_list

# Example usage
dict_list = [
    {"name": "John", "age": 30},
    {"name": "Alice", "age": 25},
    {"name": "Bob", "age": 40},
    {"name": "Charlie", "age": 35}
]

sorted_dict_list = sort_dict_list(dict_list, "age")
for i, dict_ in enumerate(sorted_dict_list):
    print(f"{i+1}. {dict_['name']} - {dict_['age']} years old")
```

In this example, we define a function `sort_dict_list` that takes a list of dictionaries and a key as arguments. We then use the `sorted` function to sort the list of dictionaries based on the key. Finally, we return the


# Save LoRA Adapters (Post-Training)

In [28]:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

we will use LoRA adapters we just trained:

In [30]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "Implement binary search in python code. Give me only the code"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

==((====))==  Unsloth 2025.6.1: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
```
def binary_search(arr, low, high, key):
    if low > high:
        return -1

    mid = (high + low) // 2

    if arr[mid] == key:
        return arr[mid]
    elif arr[mid] > key:
        return binary_search(arr, low, mid-1, key)
    else:
        return binary_search(arr, mid+1, high, key)

arr = [11, 23, 44, 56, 77, 88, 99, 110, 123]

key = 77

result = binary_search(arr, 0, len(arr)-1, key)
if result!= -1:
    print(f"Element {key} exists at index {result}")
else:
    print(f"Element {key} not found in the array")
```

This code p

# Merge LoRA Adapters into Full Model (Float16)

In [32]:
# This step produces a merged full model that you can convert to GGUF.
model.save_pretrained_merged(
    "merged_model_fp16",
    tokenizer=tokenizer,
    save_method="merged_16bit",
)

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [01:16<00:00, 76.24s/it]


In [40]:
from huggingface_hub import HfApi, upload_folder

HF_USERNAME = HfApi().whoami(token=hf_token)["name"]
REPO_NAME = "llama3-1b-coding-fp16"
FULL_REPO = f"{HF_USERNAME}/{REPO_NAME}"

HfApi().create_repo(
    REPO_NAME,
    token=hf_token,
    private=False,        # Set to True if you want it private
    repo_type="model"
)

upload_folder(
    repo_id=FULL_REPO,
    folder_path="merged_model_fp16",
    token=hf_token,
    repo_type="model"
)

print(f"✅ Merged FP16 model pushed to: https://huggingface.co/{FULL_REPO}")

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

✅ Merged FP16 model pushed to: https://huggingface.co/mushfiqurrobin/llama3-1b-coding-fp16


# Convert to GGUF Format

In [33]:
# Save as GGUF format (recommended = q4_k_m for compatibility and performance)
model.save_pretrained_gguf(
    "model_gguf",
    tokenizer=tokenizer,
    quantization_method=["q4_k_m", "q8_0"],  # Save multiple formats if needed
)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 1.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.74 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 24.63it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model_gguf/pytorch_model.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model_gguf into f16 GGUF format.
The output location will be /content/model_gguf/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model_gguf
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {32}
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model.bin'
INFO:hf-to-gguf:toke

# Push GGUF to Hugging Face Hub

In [35]:
model.push_to_hub_gguf(
    repo_id="mushfiqurrobin/llama3-1b-coding-lora-gguf",
    tokenizer=tokenizer,
    quantization_method=["q4_k_m", "q8_0"],
    token=hf_token,
)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.61 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 43.70it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving mushfiqurrobin/llama3-1b-coding-lora-gguf/pytorch_model.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at mushfiqurrobin/llama3-1b-coding-lora-gguf into f16 GGUF format.
The output location will be /content/mushfiqurrobin/llama3-1b-coding-lora-gguf/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: llama3-1b-coding-lora-gguf
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.we

unsloth.Q4_K_M.gguf:   0%|          | 0.00/808M [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/mushfiqurrobin/llama3-1b-coding-lora-gguf
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q8_0.gguf:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/mushfiqurrobin/llama3-1b-coding-lora-gguf


✅ If you save with `save_method="merged_4bit"`:

```python
model.save_pretrained_merged(
    output_dir="merged_model_4bit",
    tokenizer=tokenizer,
    save_method="merged_4bit",
)
```

You **cannot** convert that output to GGUF later using Unsloth or llama.cpp.

---

## 🚫 Why?

* `merged_4bit` uses **bitsandbytes 4-bit quantization**, which:

  * Is designed for Hugging Face inference using `transformers`
  * **Is not compatible** with GGUF or llama.cpp formats
* GGUF models must be derived from:

  * **Float16 (merged\_16bit)** or
  * **Full-precision (fp32) models**
  * Because llama.cpp performs its **own quantization** during GGUF creation.

---

## ✅ So if you want GGUF:

Use this directly (no merged model needed at all):

```python
model.save_pretrained_gguf(
    output_dir="model_gguf",
    tokenizer=tokenizer,
    quantization_method=["q4_k_m", "q8_0"],
)
```

> 🔁 It will:
>
> * Automatically **merge your LoRA** with the base model
> * Automatically **quantize to GGUF**
> * Save `.gguf` files ready for llama.cpp / Ollama

---

## 🔁 TL;DR

| Save Type              | Can Be Used For GGUF? | Can Be Used for HF Inference? | Further Training? |
| ---------------------- | --------------------- | ----------------------------- | ----------------- |
| `merged_16bit`         | ✅ Yes                 | ✅ Yes                         | ❌ No              |
| `merged_4bit`          | ❌ No                  | ✅ Yes                         | ❌ No              |
| `lora_model`           | ✅ Yes (via merge)     | ✅ Yes (with LoRA loading)     | ✅ Yes             |
| `save_pretrained_gguf` | ✅ Yes                 | ❌ (GGUF is for llama.cpp)     | ❌ No              |

---

For more details, use Unsloth's official notebook for similar sort of task: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb