To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Read our **[Qwen3 Guide](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

Loading logs.json from personal Google Drive.

In [2]:
from google.colab import drive
drive.mount('/content/drive')
%cd drive/MyDrive/Mindolife
!ls

Mounted at /content/drive
/content/drive/MyDrive/Mindolife
 huggingface_tokenizers_cache  'Qwen3_(14B)_Reasoning_Conversational.ipynb'
 log_conversations.jsonl        unsloth_compiled_cache
 logs.json		        unsloth_training_checkpoints


### Unsloth

In [3]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.6: Fast Qwen3 patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/168k [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.5.6 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


<a name="Data"></a>
### Data Prep


We now convert our dataset into conversational format:

In [None]:
i = 1
for log in logs:
    message = log.get("message", "")
    filtered = filter_log_message(message)
    print(filtered)
    print("--------------")

In [47]:
import re

def filter_log_message(message: str) -> str:
    if "com.mindolife" in message:
        # Search for the first exception class starting with com.mindolife and ending at space or :
        match = re.search(r"(com\.mindolife(?:\.\w+)+)", message)
        if match:
            return match.group(1)
    return message

In [48]:
def generate_convo(log_entries, max_tokens=1900, min_tokens=1500):
    def count_tokens(text):
        return len(text) // 4  # Rough estimate: 1 token ≈ 4 characters

    user_messages = []
    assistant_messages = []

    buffer = ["Can you predict errors? If there's any?"]
    buffer_token_count = count_tokens(buffer[0])

    for log in log_entries:
        is_error = log['level'] == "ERROR"
        log_text = f"[{log['timestamp']}] {log['source']}@{log['level']}: {log['message']}"
        token_count = count_tokens(log_text)

        if is_error:
            # Flush current buffer as normal non-error block
            if len(buffer) > 1:  # Only flush if it contains actual logs
                group_text = "\n".join(buffer)
                assistant_messages.append("No Errors were predicted!")
                user_messages.append(group_text)

            # Start new buffer with question prompt + error log
            error_prompt = ["Can you predict errors? If there's any?", log_text]
            user_messages.append("\n".join(error_prompt))

            # Use the filtered error message for the assistant
            filtered_error = filter_log_message(log['message'])
            assistant_messages.append(filtered_error)

            # Reset buffer
            buffer = ["Can you predict errors? If there's any?"]
            buffer_token_count = count_tokens(buffer[0])
            continue

        # Flush if too many tokens
        if buffer_token_count + token_count > max_tokens:
            group_text = "\n".join(buffer)
            assistant_messages.append("No Errors were predicted!")
            user_messages.append(group_text)

            # Start fresh buffer
            buffer = ["Can you predict errors? If there's any?"]
            buffer_token_count = count_tokens(buffer[0])

        buffer.append(log_text)
        buffer_token_count += token_count

    # Final flush
    if len(buffer) > 1:
        group_text = "\n".join(buffer)
        assistant_messages.append("No Errors were predicted!")
        user_messages.append(group_text)

    return user_messages, assistant_messages


In [49]:
import json

json_path = "/content/drive/MyDrive/Mindolife/logs.json"

with open(json_path, "r", encoding="utf-8") as file:
    logs = json.load(file)

print(logs[0])

{'timestamp': '2024-12-08T13:07:30.068000', 'source': 'Network', 'process_id': '3791500', 'level': 'DEBUG', 'message': 'Scheduling keepalive timer for B9A9DFDF: (General 29763)[Graph at 08/12/2024 03:30:00](last msg at :08/12/2024 13:07:30): Base  <(208~5)- B9A9DFDF will start after 320000ms', 'json_data': None}


In [50]:
problems, solutions = generate_convo(logs) # This must recevie two normal lists with strings

In [51]:
from datasets import Dataset

def create_hf_dataset(problems, solutions):
    data = {
        "problem": problems,
        "generated_solution": solutions
    }
    return Dataset.from_dict(data)

In [52]:
dataset = create_hf_dataset(problems, solutions)

In [53]:
print(type(dataset))

<class 'datasets.arrow_dataset.Dataset'>


In [54]:
def generate_conversation(examples):
    problems = examples["problem"]
    solutions = examples["generated_solution"]

    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role": "user", "content": problem},
            {"role": "assistant", "content": solution}
        ])

    return {"conversations": conversations}  # <- matches expected HF template structure


In [55]:
print(type(dataset))
print(dataset)

<class 'datasets.arrow_dataset.Dataset'>
Dataset({
    features: ['problem', 'generated_solution'],
    num_rows: 51642
})


In [56]:
print("Sample Problem:\n", dataset[1]["problem"])
print("\nSample Generated Solution:\n", dataset[1]["generated_solution"])

Sample Problem:
 Can you predict errors? If there's any?
[2024-12-08 12:00:00.219000] Device@ERROR: Failed translate state to feature: 6.1, state: 0b00b6ff00, for device General 29982
Caused by: com.mindolife.basetype.exception.InvalidRealBaseType: com.mindolife.basetype.integerbt.exception.NoIntegerReferencedIntervalFound: Value 65462 is not legal, Doesn't fit any interval data
Caused by: com.mindolife.basetype.integerbt.exception.NoIntegerReferencedIntervalFound: Value 65462 is not legal, Doesn't fit any interval data

Sample Generated Solution:
 com.mindolife.basetype.exception.InvalidRealBaseType


In [57]:
reasoning_conversations = tokenizer.apply_chat_template(
    dataset.map(generate_conversation, batched = True)["conversations"],
    tokenize = False,
)

Map:   0%|          | 0/51642 [00:00<?, ? examples/s]

In [58]:
reasoning_conversations[0]

'<|im_start|>user\nCan you predict errors? If there\'s any?\n[2024-12-08T13:07:30.068000] Network@DEBUG: Scheduling keepalive timer for B9A9DFDF: (General 29763)[Graph at 08/12/2024 03:30:00](last msg at :08/12/2024 13:07:30): Base  <(208~5)- B9A9DFDF will start after 320000ms\n[2024-12-08T13:07:30.070000] Network@DEBUG: Handeling device driven message to device B9A9DFDF for handler NetworkManager\n[2024-12-08T13:07:30.070000] Network@DEBUG: message 00000000 on port A6, message:com.mindolife.q.c.a.f, OPCode:A3, Port:A6, dataSection1:1F, dataSection2:00, DataSection3:06, DataSection4:01, DataSection5:00 received for endpoint B9A9DFDF\n[2024-12-08T13:07:30.071000] Network@DEBUG: Updating RSSI (209) for device B9A9DFDF: (General 29763)[Graph at 08/12/2024 03:30:00](last msg at :08/12/2024 13:07:30): Base  <(209~5)- B9A9DFDF\n[2024-12-08T13:07:30.071000] Network@DEBUG: Handler NetworkManager Handeled device driven message B9A9DFDF took 1 ms\n[2024-12-08 12:00:00.012000] Policy@DEBUG: **che

In [59]:
print(len(reasoning_conversations))

51642


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [60]:
import pandas as pd
from datasets import Dataset

# Convert reasoning_conversations only
data = pd.Series(reasoning_conversations)
data.name = "text"

# Convert to Hugging Face Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed=3407)


In [61]:
# Split the dataset
test_dataset = combined_dataset.select(range(500))  # First 500 samples
train_dataset = combined_dataset.select(range(500, len(combined_dataset)))  # Remaining samples

# Optional: verify sizes
print(f"Train size: {len(train_dataset)}, Test size: {len(test_dataset)}")


Train size: 51142, Test size: 500


In [62]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/51142 [00:00<?, ? examples/s]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


In [63]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
11.898 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 51,142 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 128,450,560/14,000,000,000 (0.92% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.9757


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'start_gpu_memory' is not defined

<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`

For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`

In [None]:
sample = test_dataset[i]
print(sample)

{'text': '<|im_start|>user\n[2024-12-08 13:11:25.127000] Device@DEBUG: {\'text\': \'feature  Mode of device General 30640 value ({"value":"1"}) 1(0x01) hashed to 2(0x02) initialCRC=0, bytes:[1, 0]\', \'json_data\': {\'value\': \'1\'}}\n[2024-12-08 13:11:25.127000] Device@DEBUG: {\'text\': \'feature  Reset of device General 30640 value ({"value":"1"}) 1(0x01) hashed to 2(0x02) initialCRC=0, bytes:[1, 0]\', \'json_data\': {\'value\': \'1\'}}\n[2024-12-08 13:11:25.127000] Device@DEBUG: {\'text\': \'feature  On/Off of device General 30640 value ({"value":"true"}) 1(0x01) hashed to 2(0x02) initialCRC=0, bytes:[1, 0]\', \'json_data\': {\'value\': \'true\'}}\n[2024-12-08 13:11:25.128000] Device@DEBUG: {\'text\': \'feature  Set temperature of device General 30640 value ({"authenticValue":"23","value":"23"}) 23(0x17) hashed to 88(0x58) initialCRC=138, bytes:[23, 0]\', \'json_data\': {\'authenticValue\': \'23\', \'value\': \'23\'}}\n[2024-12-08 13:11:25.128000] Device@DEBUG: {\'text\': \'feature

In [None]:
from transformers import TextStreamer
from tqdm import tqdm  # Optional: shows a progress bar

streamer = TextStreamer(tokenizer, skip_prompt=True)

# Loop through test samples
for i in tqdm(range(len(test_dataset))):
    sample = test_dataset[i]
    prompt = sample["text"]  # Change key if different
    expected = sample.get("generated_solution", "N/A")  # Replace with correct key if exists

    # Tokenize and run on GPU
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.7,
        top_p=0.8,
        top_k=20,
        do_sample=True,
    )

    # Decode output and remove the prompt
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    predicted = decoded[len(prompt):].strip()

    # Print comparison
    print(f"\n--- Sample #{i+1} ---")
    print(f"\n🔹 Model Prompt:\n{prompt}")
    print(f"\n🔹 Model Prediction:\n{predicted}")

  0%|          | 0/500 [04:59<?, ?it/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 42.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 26.12 MiB is free. Process 10285 has 14.71 GiB memory in use. Of the allocated memory 14.26 GiB is allocated by PyTorch, and 298.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

To solve the equation (x + 2)^2 = 0, we can take the square root of both sides. This gives us:

x + 2 = 0

Subtracting 2 from both sides, we get:

x = -2

Therefore, the solution to the equation is x = -2.<|im_end|>


In [None]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024, # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<think>
Okay, so I need to solve the equation (x + 2)^2 = 0. Hmm, let's see. I remember that when you have something squared equals zero, the only solution is when the inside part is zero. Because if you square any real number, it's either positive or zero. So if the square is zero, the original number must be zero. So maybe I can take the square root of both sides?

Wait, but the equation is (x + 2)^2 = 0. If I take the square root of both sides, that would give me x + 2 = 0, right? Because the square root of 0 is 0. Then solving for x would just be subtracting 2 from both sides, so x = -2. But wait, isn't that the only solution? Because squaring a number can't be negative, so the only way (x + 2)^2 is zero is if x + 2 is zero. So x must be -2. But since it's squared, does that mean there's a multiplicity here? Like, maybe x = -2 is a repeated root?

Let me think. If we expand the equation, (x + 2)^2 = x^2 + 4x + 4. So the original equation is x^2 + 4x + 4 = 0. To solve this quadratic

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
