<a href="https://colab.research.google.com/github/BoJavs-svg/LLM_Lora_FineTunning/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Read our **[Qwen3 Guide](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth
    !pip install requests

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.6: Fast Qwen3 patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/168k [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.5.6 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


<a name="Data"></a>
### Data Prep
Qwen3 has both reasoning and a non reasoning mode. So, we should use 2 datasets:

1. We use the [Open Math Reasoning]() dataset which was used to win the [AIMO](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard) (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and whicht got > 95% accuracy.

2. We also leverage [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.

In [6]:
from datasets import load_dataset
swe_bench_lite = load_dataset('SWE-bench/SWE-bench_Lite', split='dev')

Let's see the structure of both datasets:

In [7]:
swe_bench_lite

Dataset({
    features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
    num_rows: 23
})

Next we take the non reasoning dataset and convert it to conversational format as well.

> Agregar bloque entrecomillado



We have to use Unsloth's `standardize_sharegpt` function to fix up the format of the dataset first.

In [8]:
import requests
import base64

def get_commit(repo_full_name, commit_sha, token=None):
    # Split repository into owner and name
    owner, repo_name = repo_full_name.split('/')

    # Set up headers with authentication
    headers = {
        'Accept': 'application/vnd.github.v3+json',
        'X-GitHub-Api-Version': '2022-11-28'
    }

    # Add authentication if token is available
    try:
        # Fetch commit information
        response = requests.get(
            f'https://api.github.com/repos/{owner}/{repo_name}/commits/{commit_sha}',
            headers=headers
        )

        # Check if request was successful
        if response.status_code == 200:
            commit_data = response.json()

            # Extract and format relevant information
            result=""
            # Add changed files information
            for file in commit_data['files']:
                if file['filename'].endswith('.py'):
                  result += f"\nFile: {file['filename']} ({file['status']})\n"
                  result += file['patch'] + "\n"
                #   # Get the file contents
                #   file_url = file['contents_url']

                #   file_response = requests.get(file_url, headers=headers)
                #   if file_response.status_code == 200:
                #     file_data = file_response.json()
                #     content = base64.b64decode(file_data['content']).decode('utf-8')
                #     result += f"{content} \n"
            return result
        else:
            raise Exception(f"Failed to fetch commit: Status {response.status_code}")

    except requests.exceptions.RequestException as e:
        raise Exception(f"Error fetching commit: {str(e)}")

print(get_commit("sqlfluff/sqlfluff", "14e1a23a3166b9a645a16de96f694c77a5d4abb7"))



File: src/sqlfluff/dialects/dialect_snowflake.py (modified)
@@ -1158,12 +1158,12 @@ class ColumnConstraintSegment(BaseSegment):
         Sequence(
             OneOf("AUTOINCREMENT", "IDENTITY"),
             OneOf(
-                Bracketed(Delimited(Ref("NumericLiteralValue"))),
+                Bracketed(Delimited(Ref("NumericLiteralSegment"))),
                 Sequence(
                     "START",
-                    Ref("NumericLiteralValue"),
+                    Ref("NumericLiteralSegment"),
                     "INCREMENT",
-                    Ref("NumericLiteralValue"),
+                    Ref("NumericLiteralSegment"),
                 ),
             ),
         ),



In [9]:
def generate_conversation(batch):
    conversations = []

    for problem, patch, repo, base_commit in zip(
        batch["problem_statement"],
        batch["patch"],
        batch["repo"],
        batch["base_commit"]
    ):
        # Fetch commit content
        try:
            commit_context = get_commit(repo, base_commit)
        except Exception as e:
            commit_context = "Error fetching commit data."

        user_prompt = f"""\
You are a helpful software engineer assistant.

Here is a GitHub repository: **{repo}**
This is the base commit SHA: **{base_commit}**

Context from the commit:
{commit_context}

Problem:
{problem}

Please provide the fix that resolves the problem."""

        conversations.append([
            {"role": "user", "content": user_prompt},
            {"role": "assistant", "content": patch}
        ])

    return {"conversations": conversations}


In [10]:
from unsloth.chat_templates import standardize_sharegpt

# Standardize the dataset first
dataset = standardize_sharegpt(swe_bench_lite)

# Apply chat template with explicit thinking mode
swe_bench_lite_conversations = tokenizer.apply_chat_template(
    dataset.map(generate_conversation, batched = True)["conversations"],
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Explicitly disable thinking mode
)

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Let's see the first row

In [11]:
swe_bench_lite_conversations[1]


'<|im_start|>user\nYou are a helpful software engineer assistant.\n\nHere is a GitHub repository: **sqlfluff/sqlfluff**\nThis is the base commit SHA: **f1dba0e1dd764ae72d67c3d5e1471cf14d3db030**\n\nContext from the commit:\n\nFile: src/sqlfluff/rules/L041.py (modified)\n@@ -10,7 +10,7 @@\n \n @document_fix_compatible\n class Rule_L041(BaseRule):\n-    """``SELECT`` modifiers (e.g.``DISTINCT``) must be on the same line as ``SELECT``.\n+    """``SELECT`` modifiers (e.g. ``DISTINCT``) must be on the same line as ``SELECT``.\n \n     | **Anti-pattern**\n \n\nFile: src/sqlfluff/rules/L043.py (modified)\n@@ -30,7 +30,7 @@ class Rule_L043(BaseRule):\n             end as is_fab\n         from fancy_table\n \n-        -- This rule can also simplify ``CASE`` statements\n+        -- This rule can also simplify CASE statements\n         -- that aim to fill NULL values.\n \n         select\n@@ -60,13 +60,13 @@ class Rule_L043(BaseRule):\n             coalesce(fab > 0, false) as is_fab\n         fro

Now let's see how long both datasets are:

In [12]:
print(len(swe_bench_lite_conversations))

23


The non reasoning dataset is much longer. Let's assume we want the model to retain some reasoning capabilities, but we specifically want a chat model.

Let's define a ratio of chat only data. The goal is to define some mixture of both sets of data.

Let's select 25% reasoning and 75% chat based:

In [13]:
chat_percentage = 0.75

Finally combine both datasets:

In [14]:
import pandas as pd
swe_bench_lite_subset = pd.Series(swe_bench_lite_conversations)

In [15]:
data = pd.concat([
    pd.Series(swe_bench_lite_subset)
])
data.name = "text"

from datasets import Dataset
dataset = Dataset.from_pandas(pd.DataFrame(data))
dataset = dataset.shuffle(seed = 3407)

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [16]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 15,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/23 [00:00<?, ? examples/s]

In [17]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
11.898 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [18]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 23 | Num Epochs = 5 | Total steps = 15
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 128,450,560/14,000,000,000 (0.92% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.3287
2,1.5036
3,1.3198
4,1.5528
5,1.1149
6,1.2334
7,1.0235
8,1.1523
9,1.2727
10,0.9727


In [19]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1132.0603 seconds used for training.
18.87 minutes used for training.
Peak reserved memory = 14.508 GB.
Peak reserved memory for training = 2.61 GB.
Peak reserved memory % of max memory = 98.419 %.
Peak reserved memory for training % of max memory = 17.706 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`

For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`

In [20]:
messages = [
    {"role" : "user", "content" : "Modeling's `separability_matrix` does not compute separability correctly for nested CompoundModels Consider the following model: ```python from astropy.modeling import models as m from astropy.modeling.separable import separability_matrix cm = m.Linear1D(10) & m.Linear1D(5) ``` It's separability matrix as you might expect is a diagonal: ```python >>> separability_matrix(cm) array([[ True, False], [False, True]]) ``` If I make the model more complex: ```python >>> separability_matrix(m.Pix2Sky_TAN() & m.Linear1D(10) & m.Linear1D(5)) array([[ True, True, False, False], [ True, True, False, False], [False, False, True, False], [False, False, False, True]]) ``` The output matrix is again, as expected, the outputs and inputs to the linear models are separable and independent of each other. If however, I nest these compound models: ```python >>> separability_matrix(m.Pix2Sky_TAN() & cm) array([[ True, True, False, False], [ True, True, False, False], [False, False, True, True], [False, False, True, True]]) ``` Suddenly the inputs and outputs are no longer separable? This feels like a bug to me, but I might be missing something?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

You're absolutely right to suspect this might be a bug. The issue you've identified is related to the `separability_matrix` function in `astropy.modeling` not correctly handling **nested compound models**. Let's break down what's happening and why it's a bug.

---

### 🔍 Problem Summary

You've observed that when using `separability_matrix` on a compound model with **nested sub-models**, the resulting separability matrix incorrectly assumes that the **inputs and outputs are separable**, even when they should not be.

---

### ✅ Expected Behavior

In the case of a nested model like:

```python
cm = m.Linear1D(10) & m.Linear1D(5)
nested_model = m.Pix2Sky_TAN() & cm
```

The model structure is:

```
Input (x1, x2)
     |
     v
Pix2Sky_TAN()
     |
     v
cm (Linear1D & Linear1D)
     |
     v
Output (y1, y2)
```

Here, the **inputs to the nested compound model** are not separable with the **outputs of the nested compound model** because the nested compound model acts as a black box


In [None]:
messages = [
    {"role" : "user", "content" : "Given two integers `dividend` and `divisor`, divide two integers **without** using multiplication, division, and mod operator. The integer division should truncate toward zero, which means losing its fractional part. For example, `8.345` would be truncated to `8`, and `-2.7335` would be truncated to `-2`. Return _the **quotient** after dividing_ `dividend` _by_ `divisor`. **Note:** Assume we are dealing with an environment that could only store integers within the **32-bit** signed integer range: `[-231, 231 - 1]`. For this problem, if the quotient is **strictly greater than** `231 - 1`, then return `231 - 1`, and if the quotient is **strictly less than** `-231`, then return `-231`. **Example 1:** **Input:** dividend = 10, divisor = 3 **Output:** 3 **Explanation:** 10/3 = 3.33333.. which is truncated to 3. **Example 2:** **Input:** dividend = 7, divisor = -3 **Output:** -2 **Explanation:** 7/-3 = -2.33333.. which is truncated to -2. **Constraints:** * `-231 <= dividend, divisor <= 231 - 1` * `divisor != 0`"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024, # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<think>

</think>

To solve this problem, we can use the fact that division is the same as repeated subtraction. We can subtract the divisor from the dividend repeatedly until the dividend becomes less than the divisor. The number of times we subtract the divisor from the dividend is the quotient.

We can optimize this process by using bit manipulation. We can find the largest power of 2 that is less than or equal to the dividend and subtract that power of 2 from the dividend. We can then do this again with the remainder until the remainder becomes less than the divisor.

Here is the implementation in Python:

```python
def divide(dividend, divisor):
    # Handle special cases
    if dividend == 0:
        return 0
    if divisor == 1:
        return dividend
    if divisor == -1:
        return -dividend

    # Handle overflow
    if dividend == -2**31 and divisor == -1:
        return 2**31 - 1

    # Initialize the quotient
    quotient = 0

    # Determine the sign of the quotient


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
