To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [342]:
from unsloth import FastModel
import torch
max_seq_length = 2048
fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

    # Other popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
    "unsloth/gemma-3-270m-it",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

==((====))==  Unsloth 2026.1.2: Fast Gemma3 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Thytu's ChessInstruct](https://huggingface.co/datasets/Thytu/ChessInstruct) dataset. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma3",
)

We now use `convert_to_chatml` to try converting datasets to the correct format for finetuning purposes!

In [None]:
import hashlib
from collections import defaultdict
from datasets import Dataset
import re

def load_messages(filepath):
    """Load messages from text file with format [rank] username: message"""
    messages = []

    with open(filepath, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue

            # Match pattern: [anything] username: message
            match = re.match(r'\[([^\]]*)\]\s*([^:]+):\s*(.+)', line)
            if match:
                rank = match.group(1).strip()
                username = match.group(2).strip()
                message = match.group(3).strip()

                if len(message) > 15:  # Filter short messages
                    messages.append({
                        "rank": rank,
                        "user": username,
                        "message": message
                    })

    return messages

def message_hash(msg):
    return hashlib.md5(msg.lower().strip().encode()).hexdigest()

def build_conversations(messages, window_size=20, min_turns=2, max_message_frequency=3):
    """
    Build conversations using sliding window since we don't have timestamps.

    Args:
        messages: List of message dicts
        window_size: Number of messages per conversation window
        min_turns: Minimum turns for a valid conversation
        max_message_frequency: Filter messages appearing more than this many times
    """
    conversations = []

    # Count message frequencies for dedup
    message_counts = defaultdict(int)
    for msg in messages:
        h = message_hash(msg["message"])
        message_counts[h] += 1

    # Filter out repeated messages
    filtered = []
    for msg in messages:
        h = message_hash(msg["message"])
        if message_counts[h] <= max_message_frequency:
            filtered.append(msg)

    # Sliding window to create conversations
    step = window_size // 2  # Overlap windows by half
    for i in range(0, len(filtered) - min_turns, step):
        window = filtered[i:i + window_size]
        if len(window) >= min_turns:
            conversations.append(window)

    return conversations

def format_gemma_alternating(conversations):
    """
    Format conversations to Gemma format with system prompt.
    """
    formatted = []

    system_prompt = "You are participating in a public game chat room. Messages are formatted as [username]: message. Respond naturally and casually as a chat participant."

    for convo in conversations:
        consolidated = []
        current_role = "user"
        current_messages = []
        last_user = None

        for turn in convo:
            if turn["user"] != last_user and last_user is not None:
                if current_messages:
                    consolidated.append({
                        "role": current_role,
                        "content": "\n".join(current_messages)
                    })
                    current_messages = []
                    current_role = "model" if current_role == "user" else "user"

            current_messages.append(f"[{turn['user']}]: {turn['message']}")
            last_user = turn["user"]

        if current_messages:
            consolidated.append({
                "role": current_role,
                "content": "\n".join(current_messages)
            })

        if len(consolidated) < 2:
            continue

        if consolidated[-1]["role"] == "user":
            consolidated = consolidated[:-1]

        if len(consolidated) < 2:
            continue

        # Build with system prompt
        turns = [f"<start_of_turn>system\n{system_prompt}<end_of_turn>"]
        for msg in consolidated:
            turns.append(f"<start_of_turn>{msg['role']}\n{msg['content']}<end_of_turn>")

        text = "\n".join(turns)
        formatted.append({"text": text})

    return formatted

# Usage
messages = load_messages("/content/coralsentinel-filtered.log")
print(f"Loaded {len(messages)} messages")

raw_convos = build_conversations(messages, window_size=20)
print(f"Built {len(raw_convos)} raw conversations")

formatted_dataset = format_gemma_alternating(raw_convos)
print(f"Formatted {len(formatted_dataset)} training examples")

dataset = Dataset.from_list(formatted_dataset)

print("\n=== Example ===")
print(dataset[0]["text"][:800])

In [267]:
messages = load_messages("/content/coralsentinel-filtered.log")
print(f"Loaded {len(messages)} messages")

if len(messages) > 0:
    print("\nFirst 5 messages:")
    for m in messages[:5]:
        print(f"  User: {m['user']}")
        print(f"  Message: {m['message'][:50]}...")
        print()
else:
    # Parsing failed - let's see raw lines
    print("\nParsing failed. Raw file content:")
    with open("/content/coralsentinel-filtered.log", "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            print(repr(line))
            if i >= 10:
                break

Loaded 213014 messages

First 5 messages:
  User: CHAMPION OverTheLimits
  Message: chi ti incula a te spostati tr 0 1 a...

  User: CHAMPION 4lice_10
  Message: bianco89 prima di tutto è mio...

  User: CHAMPION OverTheLimits
  Message: chi vince se lo prende...

  User: CHAMPION bianco89
  Message: NickAc abbi pieta...

  User: CHAMPION 4lice_10
  Message: ma no..è già mio...



Let's see how row 100 looks like!

In [301]:
dataset[100]

{'text': '<start_of_turn>system\nYou are participating in a public game chat room. Messages are formatted as [username]: message. Respond naturally and casually as a chat participant.<end_of_turn>\n<start_of_turn>user\n[CHAMPION Leogamer_317]: lafa ti devo aprlare vieni max 2<end_of_turn>\n<start_of_turn>model\n[PastaAlPiscio]: ZuccoAllaPeva zitto<end_of_turn>\n<start_of_turn>user\n[SossTaa_]: sdfdogòkghnlkjsfdg<end_of_turn>\n<start_of_turn>model\n[CHAMPION Snife_]: vuoi fare pure ds?<end_of_turn>\n<start_of_turn>user\n[YOUTUBER OcchiRosa]: OcchiViola no way<end_of_turn>\n<start_of_turn>model\n[VIP OcchiViola]: facciamo una duo OcchiRosa??<end_of_turn>\n<start_of_turn>user\n[YOUTUBER OcchiRosa]: Ciao OcchiPungenti<end_of_turn>\n<start_of_turn>model\n[OcchiPungenti]: Occhi rosa le tue stelle!! nahahahaahah ahahaha<end_of_turn>\n<start_of_turn>user\n[PastaAlPiscio]: OcchiPungenti ma ci vedi con i pungiglioni negli occhi?<end_of_turn>\n<start_of_turn>model\n[Vongola_98]: qualcuno per due 

We now have to apply the chat template for `Gemma3` onto the conversations, and save it to `text`.

Let's see how the chat template did!


In [270]:
dataset[100]['text']

'<start_of_turn>user\n[CHAMPION Leogamer_317]: lafa ti devo aprlare vieni max 2<end_of_turn>\n<start_of_turn>model\n[PastaAlPiscio]: ZuccoAllaPeva zitto<end_of_turn>\n<start_of_turn>user\n[SossTaa_]: sdfdogòkghnlkjsfdg<end_of_turn>\n<start_of_turn>model\n[CHAMPION Snife_]: vuoi fare pure ds?<end_of_turn>\n<start_of_turn>user\n[YOUTUBER OcchiRosa]: OcchiViola no way<end_of_turn>\n<start_of_turn>model\n[VIP OcchiViola]: facciamo una duo OcchiRosa??<end_of_turn>\n<start_of_turn>user\n[YOUTUBER OcchiRosa]: Ciao OcchiPungenti<end_of_turn>\n<start_of_turn>model\n[OcchiPungenti]: Occhi rosa le tue stelle!! nahahahaahah ahahaha<end_of_turn>\n<start_of_turn>user\n[PastaAlPiscio]: OcchiPungenti ma ci vedi con i pungiglioni negli occhi?<end_of_turn>\n<start_of_turn>model\n[Vongola_98]: qualcuno per due decente<end_of_turn>\n<start_of_turn>user\n[firsterror8399]: chi duo forte no ds<end_of_turn>\n<start_of_turn>model\n[ValeIlBroCinese]: raga come si fa ad uscire da un clan? /msg per info grazie mi

<a name="Train"></a>
### Train the model
Now let's train our model. We do 100 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [343]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 1, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 3, # Set this for 1 full training run.
        max_steps = 1000,
        learning_rate = 1e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir="outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/16970 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [344]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=5):   0%|          | 0/16970 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.

In [304]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>system\nYou are participating in a public game chat room. Messages are formatted as [username]: message. Respond naturally and casually as a chat participant.<end_of_turn>\n<start_of_turn>user\n[CHAMPION Leogamer_317]: lafa ti devo aprlare vieni max 2<end_of_turn>\n<start_of_turn>model\n[PastaAlPiscio]: ZuccoAllaPeva zitto<end_of_turn>\n<start_of_turn>user\n[SossTaa_]: sdfdogòkghnlkjsfdg<end_of_turn>\n<start_of_turn>model\n[CHAMPION Snife_]: vuoi fare pure ds?<end_of_turn>\n<start_of_turn>user\n[YOUTUBER OcchiRosa]: OcchiViola no way<end_of_turn>\n<start_of_turn>model\n[VIP OcchiViola]: facciamo una duo OcchiRosa??<end_of_turn>\n<start_of_turn>user\n[YOUTUBER OcchiRosa]: Ciao OcchiPungenti<end_of_turn>\n<start_of_turn>model\n[OcchiPungenti]: Occhi rosa le tue stelle!! nahahahaahah ahahaha<end_of_turn>\n<start_of_turn>user\n[PastaAlPiscio]: OcchiPungenti ma ci vedi con i pungiglioni negli occhi?<end_of_turn>\n<start_of_turn>model\n[Vongola_98]: qualcuno per due dece

Now let's print the masked out example - you should see only the answer is present:

In [305]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                 [PastaAlPiscio]: ZuccoAllaPeva zitto<end_of_turn>\n                         [CHAMPION Snife_]: vuoi fare pure ds?<end_of_turn>\n                       [VIP OcchiViola]: facciamo una duo OcchiRosa??<end_of_turn>\n                        [OcchiPungenti]: Occhi rosa le tue stelle!! nahahahaahah ahahaha<end_of_turn>\n                                [Vongola_98]: qualcuno per due decente<end_of_turn>\n                     [ValeIlBroCinese]: raga come si fa ad uscire da un clan? /msg per info grazie mille raga<end_of_turn>\n                      [Tartuvoh]: cerco 1 x duo decenteee\n[Tartuvoh]: cerco 1 x duo decenteee<end_of_turn>\n                     [merdasoffice12]: chi duo mi hanno resettato stats<end_of_turn>\n                                  [27kxLory]: la scatola dove sta il parkour<end_of_turn>'

In [345]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
14.57 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [346]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 16,970 | Num Epochs = 1 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 262,307,840 of 4,562,387,312 (5.75% trained)


Step,Training Loss
1,3.0589
2,3.1983
3,3.1442
4,2.8233
5,2.8583
6,3.1525
7,2.9067
8,3.3732
9,3.4512
10,3.1338


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.47 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.05 GiB is free. Process 4303 has 13.69 GiB memory in use. Of the allocated memory 13.03 GiB is allocated by PyTorch, and 512.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [333]:
messages = [
    {'role': 'system', 'content': 'You are participating in a public game chat room. Messages are formatted as [username]: message. Respond naturally and casually as a chat participant.'},
    {
        'role': 'user',
        'content': '[8hi]: afkara vuoi giocare con me?'
    }
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
).removeprefix('<bos>')

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 125,
    temperature = 1, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

[CHAMPION emocione]: come si fanno i pesci?
[CHAMPION emocione]: sono da qualche settimana mi sento parppo...<end_of_turn>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("gemma-3")  # Local saving
tokenizer.save_pretrained("gemma-3")
# model.push_to_hub("your_name/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/gemma-3", token = "...") # Online saving

('gemma-3/tokenizer_config.json',
 'gemma-3/special_tokens_map.json',
 'gemma-3/chat_template.jinja',
 'gemma-3/tokenizer.model',
 'gemma-3/added_tokens.json',
 'gemma-3/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "gemma-3", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = False,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("gemma-3-finetune", tokenizer, save_method = "merged_16bit")
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("gemma-3-finetune", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("gemma-3-finetune")
    tokenizer.save_pretrained("gemma-3-finetune")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/gemma-3-finetune", token = "")
    tokenizer.push_to_hub("hf/gemma-3-finetune", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-finetune",
        tokenizer,
        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "HF_ACCOUNT/gemma-finetune-gguf",
        tokenizer,
        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
        token = "hf_...",
    )

Now, use the `gemma-3-finetune.gguf` file or `gemma-3-finetune-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
