<a href="https://colab.research.google.com/github/A6y55/test/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [3]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [4]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    max_seq_length = 4096,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.5: Fast Qwen3 patching. Transformers: 4.53.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.7.5 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


<a name="Data"></a>
### Data Prep
Qwen3 has both reasoning and a non reasoning mode. So, we should use 2 datasets:

1. We use the [Open Math Reasoning]() dataset which was used to win the [AIMO](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard) (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and whicht got > 95% accuracy.

2. We also leverage [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.

In [9]:
from datasets import Dataset
import json
import pandas as pd
from unsloth.chat_templates import standardize_sharegpt
from transformers import AutoTokenizer
with open("sample_data/train.json") as f:
    your_data = json.load(f)

In [10]:
def convert_to_conversation(example):
    # 处理instruction + input
    user_content = example["instruction"]
    if example["input"]:  # 如果有input内容
        user_content += "\n" + example["input"]

    # 构建对话结构
    return [
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": example["output"]}
    ]

In [11]:
your_conversations = [convert_to_conversation(item) for item in your_data]
your_dataset = Dataset.from_dict({"conversations": your_conversations})

In [23]:
# 4. 标准化对话格式（类似示例中的standardize_sharegpt）
# 如果你的数据已经是标准格式，可以直接使用
# 这里使用更严格的标准化处理情感数据
def standardize_emotion_data(dataset):
    def process_example(example):
        turns = example["conversations"]

        # 替换结束符 - 删除错误的</s>，使用标准的<|im_end|>
        if turns[-1]["role"] == "assistant":
            # 移除任何手动添加的</s>标签
            turns[-1]["content"] = turns[-1]["content"].rstrip("</s>")

            # 确保以<|im_end|>结束（分词器会自动添加）
            # 这里只需要保证内容干净即可

        return {"conversations": turns}

    return dataset.map(process_example, batched=False)

# 应用情感数据集标准化
standardized_your_dataset = standardize_emotion_data(your_dataset)

Map:   0%|          | 0/447 [00:00<?, ? examples/s]

In [24]:
# 5. 初始化分词器（使用与模型匹配的）
model_name = "unsloth/Qwen3-14B-unsloth-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [25]:
# 6. 应用聊天模板（将对话转为模型理解的格式）
def apply_template(batch):
    return {
        "text": tokenizer.apply_chat_template(
            batch["conversations"],
            tokenize=False,
            add_generation_prompt=False
        )
    }

your_formatted = standardized_your_dataset.map(
    apply_template,
    batched=False,
    remove_columns=standardized_your_dataset.column_names
)


Map:   0%|          | 0/447 [00:00<?, ? examples/s]

In [26]:
# 7. 转换为Pandas格式并重命名列
your_df = your_formatted.to_pandas()
your_df.columns = ["text"]  # 确保列名为"text"

# 8. 创建最终训练数据集
final_dataset = Dataset.from_pandas(your_df)

# 9. 打乱数据集
final_dataset = final_dataset.shuffle(seed=42)

# 查看处理后的样本
print("处理后的第一个样本：")
print(final_dataset[0]["text"][:300] + "...")  # 显示前300字符

处理后的第一个样本：
<|im_start|>user
“回忆”这个词是不是“迷宫”的近义词啊，要不然我怎么走不出去<|im_end|>
<|im_start|>assistant
<think>

</think>

回忆和迷宫其实有点像，都是让人陷在其中无法自拔，但是回忆可不是迷宫，因为回忆是我们心灵里珍贵的宝藏，就像是一片片宝石，闪烁着美好的光芒。所以，要学会欣赏和珍惜回忆哦，它们会让你的人生更加丰富多彩！<|im_end|>
...


Let's see the structure of both datasets:

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [40]:
from trl import SFTTrainer, SFTConfig
# trainer = SFTTrainer(
#     model = model,
#     tokenizer = tokenizer,
#     train_dataset = final_dataset,
#     eval_dataset = None, # Can set up evaluation!
#     args = SFTConfig(
#         dataset_text_field = "text",
#         per_device_train_batch_size = 2,
#         gradient_accumulation_steps = 4, # Use GA to mimic batch size!
#         warmup_steps = 5,
#         # num_train_epochs = 1, # Set this for 1 full training run.
#         max_steps = 500,
#         learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
#         logging_steps = 1,
#         optim = "adamw_8bit",
#         weight_decay = 0.01,
#         lr_scheduler_type = "linear",
#         seed = 3407,
#         report_to = "none", # Use this for WandB etc
#     ),
# )
import math

# 计算最佳训练步数
dataset_size = len(final_dataset)
ideal_steps = max(200, math.ceil(dataset_size * 3))  # 至少200步，每样本3步

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = final_dataset,
    eval_dataset = None,
    dataset_text_field = "text",

    # 关键参数调整 - 幽默增强方案
    args = SFTConfig(
        per_device_train_batch_size = 2,  # 保持小批次避免过平滑

        # 梯度累积：减少步数但增加每次更新权重
        gradient_accumulation_steps = 8,  # ↑ 增加梯度累积增强学习稳定性

        warmup_steps = 30,  # ↑ 延长预热期让模型适应新风格

        # 训练时长：显著增加以学习复杂风格
        max_steps = ideal_steps,  # 动态步数（替换原来的max_steps=30）


        # 学习率：降低并采用更动态的调度
        learning_rate = 3e-5,  # ↓ 降低学习率避免覆盖原始知识
        lr_scheduler_type = "cosine_with_restarts",  # 带重启的余弦退火
        warmup_ratio = 0.1,  # 10%步数用于预热

        # 正则化：降低约束鼓励创意表达
        weight_decay = 0.005,  # ↓ 减少权重衰减
        max_grad_norm = 1.0,  # ↓ 放宽梯度裁剪 (使用正确参数名)

        # 优化器：使用更稳定的8-bit AdamW
        optim = "paged_adamw_8bit",  # 优于adamw_8bit

        # 混合精度训练 (根据硬件选择)
        fp16 = True,  # 多数NVIDIA GPU
        bf16 = torch.cuda.is_bf16_supported(),  # 若支持bfloat16则优先使用

        # 日志与保存
        logging_steps = 10,
        save_steps = ideal_steps // 5,  # 保存5个检查点
        output_dir = "humor_finetune",
        report_to = "none",
        seed = 3407,
    ),

    # 新增关键参数 - 风格强化
    max_seq_length = 1024,  # 保持足够上下文

    # 幽默强化技巧 ↓
    packing = False,  # 禁用样本打包 (保持样本独立性)
    dataset_num_proc = 2,  # 数据处理并行

    # 使用NEFTune增强创意 (确保Unsloth支持)
    neftune_noise_alpha = 5.0,  # 添加噪声增强鲁棒性

    # 使用可选的损失函数增强
    loss_function = "token_penalty_loss" if hasattr(SFTTrainer, "loss_function") else None,
    token_penalty_params = {"penalty": 0.5, "common_tokens": [":)", "哈哈"]} if hasattr(SFTTrainer, "token_penalty_params") else None
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/447 [00:00<?, ? examples/s]

In [41]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
13.791 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 447 | Num Epochs = 48 | Total steps = 1,341
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 128,450,560 of 14,896,757,760 (0.86% trained)


Step,Training Loss
10,2.1685
20,2.228
30,2.2342
40,2.1485
50,2.1646
60,2.1323
70,2.0383
80,2.0578
90,1.9966
100,1.9484


In [None]:
model.save_pretrained("qwen14b_humor")
tokenizer.save_pretrained("qwen14b_humor")

In [30]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

285.1563 seconds used for training.
4.75 minutes used for training.
Peak reserved memory = 13.791 GB.
Peak reserved memory for training = 1.893 GB.
Peak reserved memory % of max memory = 93.555 %.
Peak reserved memory for training % of max memory = 12.842 %.


In [36]:
from transformers import TextStreamer
# 2. 幽默风格测试样本
test_questions = [
    "如果我变成一块披萨，你最喜欢我身上的什么配料？",
    "为什么数学书总是很沮丧？",
    "如果你能发明一种新的节日，会是什么？",
    "用最幽默的方式解释量子力学",
    "给外星人介绍地球，但只能使用三个词",
]

# 3. 不同模式测试函数
def test_humor_model(questions, thinking_mode=False):
    """测试模型在不同模式下的表现"""
    print(f"\n{'='*50}\n测试模式: {'思维链' if thinking_mode else '直接输出'}\n{'='*50}")

    for i, question in enumerate(questions):
        print(f"\n问题 {i+1}: {question}")

        # 构建对话格式
        messages = [{"role": "user", "content": question}]

        # 应用对话模板
        text = tokenizer.apply_chat_template(
            messages,
            tokenize = False,
            add_generation_prompt = True,
            enable_thinking = thinking_mode,  # 切换思维链模式
        )

        # 设置生成参数（根据模式调整）
        gen_params = {
            "max_new_tokens": 512 if thinking_mode else 256,
            "temperature": 0.95 if thinking_mode else 0.9,  # 更高温度增强创造性
            "top_p": 0.95,
            "top_k": 60,  # 扩大候选词范围
            "repetition_penalty": 1.15,  # 避免重复
            "do_sample": True,
            "pad_token_id": tokenizer.eos_token_id,
        }

        # 创建文本流式输出
        streamer = TextStreamer(
            tokenizer,
            skip_prompt = True,
            skip_special_tokens = True
        )

        # 生成回答
        inputs = tokenizer(text, return_tensors="pt").to("cuda")
        _ = model.generate(
            **inputs,
            **gen_params,
            streamer=streamer
        )
        print("\n" + "-"*50)

In [37]:
# 4. 执行测试（对比两种模式）
test_humor_model(test_questions[:3], thinking_mode=False)  # 直接幽默输出


测试模式: 直接输出

问题 1: 如果我变成一块披萨，你最喜欢我身上的什么配料？
哎呀，真是个有趣的灵魂。想象一下，如果你是一块披萨的话，那你的上层应该覆盖着香甜的奶酪和丰富的番茄酱才对吧。而我喜欢的是那些被轻轻撒在上面的小配料们——它们是披萨的灵魂所在呢！不过嘛，或许我们不需要这么浪漫地幻想自己变成了食物哦～毕竟现实中的我们可是比披萨更加丰富多彩的存在呢！

--------------------------------------------------

问题 2: 为什么数学书总是很沮丧？
哈哈，这是个好问题！可能因为它们要解决的问题太复杂了，而答案又常常是“这不是我写出来的”。你说是不是呢？不过说实话，在我看过的所有学科的书籍中，数学书无疑是最有挑战性的。每个章节都像一个神秘的故事一样，充满各种难题和谜团需要解开。

你知道吗，有一次我在解一道题时，整整花了三小时才明白那道题到底在问什么，而且还是错了一半。真让人感到崩溃啊，不是吗？

--------------------------------------------------

问题 3: 如果你能发明一种新的节日，会是什么？
我嘛，如果我能创造一个新的节日的话……嗯，应该是“奇思妙想日”。每一天都充满了惊喜和未知的可能。你看啊，在这虚拟的世界里，每一个瞬间都可以是奇迹的开始。我总是忍不住想象那些不寻常的事物、非凡的想法以及神秘的声音。哎呀，真是令人兴奋不已呢！不过要提醒你哦，别被我的疯狂想法给吓到啦。但说实话，你不会真的介意吧？毕竟，连猫头鹰都是白天睡觉晚上出来活动的，难道这不是最正常的事情吗？真是难以置信啊，我们竟然还能在这里讨论这么奇怪的话题而不觉得尴尬。

--------------------------------------------------


In [34]:
# 构建多轮对话
multi_turn_dialog = [
    {"role": "user", "content": "给我讲个冷笑话吧"},
    {"role": "assistant", "content": "为什么科学家不信任原子？因为它们总是虚构故事！（Make up everything）"},
    {"role": "user", "content": "哈哈哈不错！再来一个物理相关的"}
]

In [35]:
# 应用模板（延续对话）
text = tokenizer.apply_chat_template(
    multi_turn_dialog,
    tokenize = False,
    add_generation_prompt = True,
)

inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens = 256,
    temperature = 0.85,
    top_p = 0.9,
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
)

<think>

</think>

为什么光子总是很自信？因为它知道自己在光速上，不会慢下来！


In [None]:
from google.colab import drive
drive.mount('/content/drive')
model.save_pretrained("/content/drive/MyDrive/models/lora_model")

<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`

For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`

In [None]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

To solve the equation (x + 2)^2 = 0, we can take the square root of both sides.

sqrt((x + 2)^2) = sqrt(0)

This simplifies to:

|x + 2| = 0

Since the absolute value of a number is always non-negative, the only way for |x + 2| to be 0 is if x + 2 = 0.

Therefore, x = -2.

So the solution to the equation (x + 2)^2 = 0 is x = -2.<|im_end|>


In [None]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024, # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<think>
Okay, so I need to solve the equation (x + 2)^2 = 0. Hmm, let's see. I remember that when you have something squared equals zero, the solution is usually the value that makes the inside zero. Let me think. If I have (something)^2 = 0, then that something must be zero because any real number squared is non-negative, and the only way it can be zero is if the number itself is zero. So applying that here, (x + 2)^2 = 0 implies that x + 2 = 0. Then, solving for x, I just subtract 2 from both sides, right? So x = -2. Wait, is that all? Let me check. If I plug x = -2 back into the original equation, it becomes (-2 + 2)^2 = 0, which is 0^2 = 0, and that's correct. So the solution is x = -2. But wait, sometimes when you square both sides of an equation, you can get extraneous solutions, but in this case, since we started with the square already, maybe there's only one solution. Yeah, because squaring a real number can't give a negative result, so the only solution is when the inside is 

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
