If you don't want to install all the packages Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Here the Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

In [None]:
# !pip install -U tensorflow -q
# !pip install -U unsloth vllm -q
# !pip install bitsandbytes faccelerate peft -q

In [2]:
import unsloth
from unsloth import FastModel, is_bfloat16_supported
from unsloth.chat_templates import get_chat_template, train_on_responses_only
import argparse
import logging
import sys
from transformers import TrainingArguments, DataCollatorForSeq2Seq
import os, glob, shutil, logging
import torch
from datasets import load_dataset
from huggingface_hub import login
from trl import SFTTrainer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-10-16 14:59:20.909712: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu128 for torchao version 0.14.0         Please see GitHub issue #2919 for more info


INFO 10-16 14:59:31 [__init__.py:216] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [4]:
OUTPUT_DIR = "gemma-3-finetuned"
MODEL_NAME = "unsloth/gemma-3-4b-it"

# Info about the system

In [5]:
# Log system info
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

PyTorch version: 2.8.0+cu128
CUDA available: True
GPU: NVIDIA L4
GPU memory: 23.6 GB


In [6]:
model, tokenizer = FastModel.from_pretrained(
    model_name = MODEL_NAME,
    max_seq_length = 2048,
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # A bit more accurate, uses 2x memory
    full_finetuning = False # Whether to fine-tune all model weights or just adapters (if available)
)

==((====))==  Unsloth 2025.10.3: Fast Gemma3 patching. Transformers: 4.56.1. vLLM: 0.11.0.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 21.951 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3 does not support SDPA - switching to fast eager.


Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

# Apply LoRA

In [7]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Should leave on!
    finetune_mlp_modules       = True,  # Should leave on!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model.language_model` require gradients


<a name="Data"></a>
# Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [rewoo/planner_instruction_tuning_2k](https://huggingface.co/datasets/rewoo/planner_instruction_tuning_2k) dataset composed of <**Instruction, Input, Output**>.

Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use `get_chat_template` function to get the correct chat template. Unsloth natively supports `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [9]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [13]:
print(tokenizer.chat_template)

{{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}


In [1]:
from datasets import load_dataset
dataset = load_dataset("rewoo/planner_instruction_tuning_2k", split = "train")

# To reduce the training time, we will use a smaller dataset. You can remove this line to use the full dataset.
dataset = dataset.select(range(100))

dataset = dataset.train_test_split(test_size=0.1, seed=3407)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

In [2]:
train_dataset[0]

{'instruction': 'For the following tasks, make plans that can solve the problem step-by-step. For each plan, indicate which external tool together with tool input to retrieve evidence. You can store the evidence into a variable #E that can be called by later tools. (Plan, #E1, Plan, #E2, Plan, ...)\n\nTools can be one of the following:\nWikipedia[input]: Worker that search for similar page contents from Wikipedia. Useful when you need to get holistic knowledge about people, places, companies, historical events, or other subjects. The response are long and might contain some irrelevant information. Input should be a search query.\nLLM[input]: A pretrained LLM like yourself. Useful when you need to act with general world knowledge and common sense. Prioritize it when you are confident in solving the problem yourself. Input can be any instruction.',
 'input': 'Who was the band\'s manager when Black Sabbath released the album that featured the song "Changes"?\n',
 'output': 'Plan: Search f

In [None]:
def formatting_prompts_func(examples):
    """Converte il dataset in formato conversazionale Gemma-3"""
    texts = []
    
    for instr, inp, out in zip(examples["instruction"], examples["input"], examples["output"]):
        # Costruisci il prompt utente
        if inp.strip():
            user_content = f"{instr}\n\nInput: {inp}"
        else:
            user_content = instr
        
        # Formato conversazionale
        conversation = [
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": out}
        ]

        text = tokenizer.apply_chat_template(
            conversation,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)
    
    return {"text": texts}

train_dataset = train_dataset.map(formatting_prompts_func, batched=True, remove_columns=train_dataset.column_names)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True, remove_columns=eval_dataset.column_names)

In [16]:
print(train_dataset[0]['text'])

<bos><start_of_turn>user
For the following tasks, make plans that can solve the problem step-by-step. For each plan, indicate which external tool together with tool input to retrieve evidence. You can store the evidence into a variable #E that can be called by later tools. (Plan, #E1, Plan, #E2, Plan, ...)

Tools can be one of the following:
Wikipedia[input]: Worker that search for similar page contents from Wikipedia. Useful when you need to get holistic knowledge about people, places, companies, historical events, or other subjects. The response are long and might contain some irrelevant information. Input should be a search query.
LLM[input]: A pretrained LLM like yourself. Useful when you need to act with general world knowledge and common sense. Prioritize it when you are confident in solving the problem yourself. Input can be any instruction.

Input: Who was the band's manager when Black Sabbath released the album that featured the song "Changes"?<end_of_turn>
<start_of_turn>mode

# Start Training

In [28]:
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    warmup_ratio=0.1,
    num_train_epochs=1,
    learning_rate=1e-4,
    fp16=False,
    bf16=True,
    logging_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type='cosine',
    eval_strategy="steps",
    eval_steps=30,
    save_strategy='steps',
    save_steps=30,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    seed=3407,
    output_dir=OUTPUT_DIR,
    report_to="none",
    gradient_checkpointing=True,
)


In [29]:
# Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=4096,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, pad_to_multiple_of=8),
    dataset_num_proc=2,
    packing=True,
    args=training_args,
)

In [30]:
# TRAIN ON RESPONSES ONLY
trainer = train_on_responses_only(
    trainer,
    instruction_part="<start_of_turn>user\n",
    response_part="<start_of_turn>model\n",
)

Input is separated from output

In [31]:
print(tokenizer.decode(trainer.train_dataset[1]["input_ids"]))

<bos><bos><start_of_turn>user
For the following tasks, make plans that can solve the problem step-by-step. For each plan, indicate which external tool together with tool input to retrieve evidence. You can store the evidence into a variable #E that can be called by later tools. (Plan, #E1, Plan, #E2, Plan, ...)

Tools can be one of the following:
Wikipedia[input]: Worker that search for similar page contents from Wikipedia. Useful when you need to get holistic knowledge about people, places, companies, historical events, or other subjects. The response are long and might contain some irrelevant information. Input should be a search query.
LLM[input]: A pretrained LLM like yourself. Useful when you need to act with general world knowledge and common sense. Prioritize it when you are confident in solving the problem yourself. Input can be any instruction.

Input: Which team lost the 2009 Superbowl to the Pittsburgh Steelers?<end_of_turn>
<start_of_turn>model
Plan: Search for more informa

Only the model response is shown

In [32]:
print(tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[1]["labels"]]).replace(tokenizer.pad_token, "[MASK]"))

[MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MASK][MAS

In [33]:
# Training
print("Starting training...")
trainer_stats = trainer.train()
print("Training completed successfully!")

Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 90 | Num Epochs = 1 | Total steps = 12
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 14,901,248 of 4,314,980,720 (0.35% trained)


Step,Training Loss,Validation Loss


Unsloth: Will smartly offload gradients to save VRAM!
Training completed successfully!


In [34]:
# Save model and artifacts
print("Saving model and artifacts...")

# SALVA IL MODELLO FUSO
print("Merging LoRA weights into base model...")
model.save_pretrained_merged(OUTPUT_DIR, tokenizer)

# Esporta in GGUF (GGUF = formato llama.cpp)
print("Saving model in GGUF format...")
model.save_pretrained_gguf(
    OUTPUT_DIR,              # cartella HF (Hugging Face) con config.json
    tokenizer,
    quantization_method="f16"  # es.: "q4_k_m", "q8_0", "f16"
)

Saving model and artifacts...
Merging LoRA weights into base model...
Found HuggingFace hub cache directory: /home/sagemaker-user/.cache/huggingface/hub


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Checking cache directory for required files...


Unsloth: Copying 2 files from cache to `gemma-3-finetuned`: 100%|██████████| 2/2 [02:03<00:00, 61.75s/it]


Successfully copied all 2 files from cache to `gemma-3-finetuned`
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `gemma-3-finetuned`: 100%|██████████| 1/1 [00:00<00:00, 16.30it/s]


Successfully copied all 1 files from cache to `gemma-3-finetuned`


Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 28149.69it/s]
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [02:18<00:00, 69.17s/it]


Unsloth: Merge process complete. Saved to `/home/sagemaker-user/finetuning/gemma-3-finetuned`


Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Saving model in GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /home/sagemaker-user/.cache/huggingface/hub


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Checking cache directory for required files...


Unsloth: Copying 2 files from cache to `gemma-3-finetuned`: 100%|██████████| 2/2 [02:02<00:00, 61.22s/it]


Successfully copied all 2 files from cache to `gemma-3-finetuned`
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `gemma-3-finetuned`: 100%|██████████| 1/1 [00:00<00:00, 16.61it/s]


Successfully copied all 1 files from cache to `gemma-3-finetuned`


Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 36631.48it/s]
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [02:14<00:00, 67.48s/it]


Unsloth: Merge process complete. Saved to `/home/sagemaker-user/finetuning/gemma-3-finetuned`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF bf16 might take 3 minutes.
\        /    [2] Converting GGUF bf16 to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: Missing packages: cmake libcurl4-openssl-dev
Unsloth: Will attempt to install missing system packages.
Unsloth: Installing packages: cmake libcurl4-openssl-dev
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...


INFO:unsloth_zoo.llama_cpp: Unsloth: Identifying llama.cpp gguf supported architectures...
INFO:unsloth_zoo.llama_cpp: Unsloth: Applying patches...
INFO:unsloth_zoo.llama_cpp: Unsloth: Saving patched script to llama.cpp/unsloth_convert_hf_to_gguf.py
INFO:unsloth_zoo.llama_cpp: Unsloth: Parsing arguments from patched script...
INFO:unsloth_zoo.llama_cpp: Unsloth: Successfully processed convert_hf_to_gguf.py.


Unsloth: [1] Converting model into bf16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['gemma-3-4b-it.BF16.gguf', 'gemma-3-4b-it.BF16-mmproj.gguf']
Unsloth: [2] Converting GGUF bf16 into f16. This might take 10 minutes...
Unsloth: Model files cleanup...


Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Unsloth: All GGUF conversions completed successfully!
Generated files: ['gemma-3-4b-it.F16.gguf', 'gemma-3-4b-it.BF16-mmproj.gguf']


Unsloth: example usage for Multimodal LLMs: llama-mtmd-cli -m gemma-3-4b-it.F16.gguf --mmproj gemma-3-4b-it.BF16-mmproj.gguf
Unsloth: load image inside llama.cpp runner: /image test_image.jpg
Unsloth: Prompt model to describe the image
Unsloth: Saved Ollama Modelfile to gemma-3-finetuned/Modelfile
Unsloth: convert model to ollama format by running - ollama create model_name -f ./Modelfile - inside save directory.


{'save_directory': 'gemma-3-finetuned',
 'gguf_files': ['gemma-3-4b-it.F16.gguf', 'gemma-3-4b-it.BF16-mmproj.gguf'],
 'modelfile_location': 'gemma-3-finetuned/Modelfile',
 'want_full_precision': False,
 'is_vlm': True,
 'fix_bos_token': True}