### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

In [None]:
%%capture
# Install latest transformers for Gemma 3N
!pip install --no-deps --upgrade transformers # Only for Gemma 3N
!pip install --no-deps --upgrade timm # Only for Gemma 3N

In [None]:
import os
# Dataset causes too many recompiles
os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"

In [None]:
DATASET_PATH = "formatted_training_dataset.jsonl"
VAL_DATASET_PATH = "formatted_validation_dataset.jsonl"
if "COLAB_" not in "".join(os.environ.keys()):
    hf_token = os.environ["HF_TOKEN"]
else:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    hf_token = user_secrets.get_secret("HF_TOKEN")
    DATASET_PATH = "/kaggle/input/example-train-data-jsonl/example_train_data.jsonl"

In [None]:
RUN_NAME="E2B-it-DCC-SFT-V2"

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision, Text and Audio models!

In [None]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit",
    "unsloth/gemma-3n-E2B-it-unsloth-bnb-4bit",
    # Pretrained models
    "unsloth/gemma-3n-E4B-unsloth-bnb-4bit",
    "unsloth/gemma-3n-E2B-unsloth-bnb-4bit",

    # Other Gemma 3 quants
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3n-E2B-it", # Or "unsloth/gemma-3n-E4B-it"
    dtype = None, # None for auto detection
    max_seq_length = 4000, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
,🦥 Unsloth Zoo will now patch everything to make training faster!
,==((====))==  Unsloth 2025.7.5: Fast Gemma3N patching. Transformers: 4.54.1.
,   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.254 GB. Platform: Linux.
,O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.3.1
,\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31.post1. FA2 = False]
, "-____-"     Free license: http://github.com/unslothai/unsloth
,Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
,Unsloth: Gemma3N does not support SDPA - switching to eager!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

# Let's finetune Gemma 3N!

You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # Should leave on always!

    r = 32,           # Larger = higher accuracy, but might overfit
    lora_alpha = 64,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model.language_model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

We get the first 3000 rows of the dataset

In [None]:
from datasets import load_dataset
# dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
dataset = load_dataset("json", data_files=DATASET_PATH, split = "train")

val_dataset = load_dataset("json", data_files=VAL_DATASET_PATH, split = "train")
val_dataset = val_dataset.shuffle(3407)

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [None]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)
val_dataset = standardize_data_formats(val_dataset)

Let's see how row 100 looks like!

In [None]:
dataset[100]

{'messages': [{'role': 'system',
   'content': 'You explain programming error messages to introductory programming students.\nKeep your response short, friendly, and without jargon.\nProvide debugging guidance, but do not give the solution in code.\nAddress your response to the student directly, using the pronoun "you".\nFollow the provided response structure.\n'},
  {'role': 'user',
   'content': 'This is my C program\n```C\n#include <assert.h>\n#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_LIST_LEN 100\n\nstruct node {\n    int data;\n    struct node *next;\n};\n\nstruct node *delete_last(struct node *head);\nvoid print_list(struct node *head);\nstruct node *array_to_list(int len, int array[]);\n\n// DO NOT CHANGE THE MAIN FUNCTION\nint main(void) {\n    // Get list size\n    int list_size;\n    printf("Total numbers: ");\n    scanf(" %d", &list_size);\n\n    // Read in numbers\n    int list[MAX_LIST_LEN] = {0};\n    int index_count = 0;\n    while (index_count < list_size &

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [None]:
def formatting_prompts_func(examples):
   convos = examples["messages"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)
val_dataset = val_dataset.map(formatting_prompts_func, batched = True)

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [None]:
dataset[100]["text"]

'<start_of_turn>user\nYou explain programming error messages to introductory programming students.\nKeep your response short, friendly, and without jargon.\nProvide debugging guidance, but do not give the solution in code.\nAddress your response to the student directly, using the pronoun "you".\nFollow the provided response structure.\n\n\nThis is my C program\n```C\n#include <assert.h>\n#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_LIST_LEN 100\n\nstruct node {\n    int data;\n    struct node *next;\n};\n\nstruct node *delete_last(struct node *head);\nvoid print_list(struct node *head);\nstruct node *array_to_list(int len, int array[]);\n\n// DO NOT CHANGE THE MAIN FUNCTION\nint main(void) {\n    // Get list size\n    int list_size;\n    printf("Total numbers: ");\n    scanf(" %d", &list_size);\n\n    // Read in numbers\n    int list[MAX_LIST_LEN] = {0};\n    int index_count = 0;\n    while (index_count < list_size && scanf(" %d", &list[index_count])) {\n        index_count++

In [None]:
epoch_steps = 1250
print(f"Est. epoch Steps: {epoch_steps}")
warmup_steps = 125
print(f"Warmup Steps: {warmup_steps}")


# epoch_steps // (100 // EVAL_STEPS_FACTOR)
eval_steps = 250
logging_steps = 50
print(f"eval Steps: {eval_steps}")
checkpoint_steps = 625

print(f"logging every {logging_steps} steps")

Est. epoch Steps: 1250
,Warmup Steps: 125
,eval Steps: 250
,logging every 50 steps


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = val_dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8, # Use GA to mimic batch size!
        warmup_steps = warmup_steps,
        num_train_epochs = 1, # Set this for 1 full training run.
        # max_steps = 20,
        learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
        logging_steps = logging_steps,
        optim = "paged_adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "mlflow",
        run_name = RUN_NAME,
        output_dir = "unsloth_training_checkpoints",
        save_strategy = "steps",
        save_steps = checkpoint_steps,
        eval_steps = eval_steps,
        eval_strategy = "steps",
        # eval_on_start = True,
    ),
)

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [None]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>user\nYou explain programming error messages to introductory programming students.\nKeep your response short, friendly, and without jargon.\nProvide debugging guidance, but do not give the solution in code.\nAddress your response to the student directly, using the pronoun "you".\nFollow the provided response structure.\n\n\nThis is my C program\n```C\n#include <assert.h>\n#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_LIST_LEN 100\n\nstruct node {\n    int data;\n    struct node *next;\n};\n\nstruct node *delete_last(struct node *head);\nvoid print_list(struct node *head);\nstruct node *array_to_list(int len, int array[]);\n\n// DO NOT CHANGE THE MAIN FUNCTION\nint main(void) {\n    // Get list size\n    int list_size;\n    printf("Total numbers: ");\n    scanf(" %d", &list_size);\n\n    // Read in numbers\n    int list[MAX_LIST_LEN] = {0};\n    int index_count = 0;\n    while (index_count < list_size && scanf(" %d", &list[index_count])) {\n        index_co

Now let's print the masked out example - you should see only the answer is present:

In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      # Error Message\nThe compiler is telling you that the way you defined the `delete_last` function does not match how you first des

In [None]:
# @title Show current memory stats
import gc
gc.collect()
torch.cuda.empty_cache()
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-80GB. Max memory = 79.254 GB.
,7.766 GB of memory reserved.


# Let's train the model!

To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
,   \\   /|    Num examples = 40,000 | Num Epochs = 1 | Total steps = 1,250
,O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 8
,\        /    Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
, "-____-"     Trainable parameters = 42,270,720 of 5,481,708,992 (0.77% trained)
,`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
250,7.2379,1.157944
500,6.2308,1.110996
750,5.805,1.11306
1000,5.5271,1.109712
1250,5.4268,1.118261


Unsloth: Not an error, but Gemma3nForConditionalGeneration does not accept `num_items_in_batch`.
,Using gradient accumulation will be very slightly less accurate.
,Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


In [None]:
# @title Show final memory and time stats
gc.collect()
torch.cuda.empty_cache()
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

25078.5991 seconds used for training.
,417.98 minutes used for training.
,Peak reserved memory = 64.391 GB.
,Peak reserved memory for training = 56.625 GB.
,Peak reserved memory % of max memory = 81.246 %.
,Peak reserved memory for training % of max memory = 71.447 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "You explain programming error messages to introductory programming students.\nKeep your response short, friendly, and without jargon.\nProvide debugging guidance, but do not give the solution in code.\nAddress your response to the student directly, using the pronoun \"you\".\nFollow the provided response structure.\nThis is my C program\n```C\n#include <stdio.h>\n\nint main(void) {\n\tchar tempChar;\n\n\tscanf(\"%c\", &tempChar);\n\n\twhile (tempChar != '\\0') {\n\t\tif (tempChar != 'a' && tempChar != 'e' && tempChar != 'i' && tempChar != 'o' && tempChar != 'u') {\n\t\t\tprintf(\"%c\", tempChar);\n\t\t}\n\n\t\tscanf(\"%c\", &tempChar);\n\t}\n\n}\n```\nHelp me understand this error:\n```\nExecution was interruptedExecution stopped in main() in no_vowels.c at line 13:\n\n\twhile (tempChar != '\\0') {\n\t\tif (tempChar != 'a' && tempChar != 'e' && tempChar != 'i' && tempChar != 'o' && tempChar != 'u') {\n\t\t\tprintf(\"%c\", tempChar);\n\t\t}\n\n-->\t\tscanf(\"%c\", &tempChar);\n\t}\n\n}\ntempChar = 10 = '\\n'\n\n```\nIt was given this input\n```\n\"Peter piper picked a pickled pepper\".\n```\nWithout providing the solution, give me guidance on how to resolve this error, in the following structure.\n# Response Structure\n1. Begin with the subheading (# Error Message)\n2. Follow with one short sentence, explaining the error message without programming jargon.\n3. Next, write the subheading (# Potential Causes)\n4. Follow with 1-2 short sentences, identifying and explaining potential issues in my code that may be causing this error.\n5. Next, write the subheading (# Hints/Guidance)\n6. Follow with 1-2 short sentences, giving debugging hints and guidance. This can include specific references to my code.\n",
    }]
}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens = 150, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

['<bos><start_of_turn>user\nYou explain programming error messages to introductory programming students.\nKeep your response short, friendly, and without jargon.\nProvide debugging guidance, but do not give the solution in code.\nAddress your response to the student directly, using the pronoun "you".\nFollow the provided response structure.\nThis is my C program\n```C\n#include <stdio.h>\n\nint main(void) {\n\tchar tempChar;\n\n\tscanf("%c", &tempChar);\n\n\twhile (tempChar != \'\\0\') {\n\t\tif (tempChar != \'a\' && tempChar != \'e\' && tempChar != \'i\' && tempChar != \'o\' && tempChar != \'u\') {\n\t\t\tprintf("%c", tempChar);\n\t\t}\n\n\t\tscanf("%c", &tempChar);\n\t}\n\n}\n```\nHelp me understand this error:\n```\nExecution was interruptedExecution stopped in main() in no_vowels.c at line 13:\n\n\twhile (tempChar != \'\\0\') {\n\t\tif (tempChar != \'a\' && tempChar != \'e\' && tempChar != \'i\' && tempChar != \'o\' && tempChar != \'u\') {\n\t\t\tprintf("%c", tempChar);\n\t\t}\n\n-

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# messages = [{
#     "role": "user",
#     "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
# }]
# inputs = tokenizer.apply_chat_template(
#     messages,
#     add_generation_prompt = True, # Must add for generation
#     return_tensors = "pt",
#     tokenize = True,
#     return_dict = True,
# ).to("cuda")

# from transformers import TextStreamer
# _ = model.generate(
#     **inputs,
#     max_new_tokens = 64, # Increase for longer outputs!
#     # Recommended Gemma-3 settings!
#     temperature = 1.0, top_p = 0.95, top_k = 64,
#     streamer = TextStreamer(tokenizer, skip_prompt = True),
# )

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# !curl -fsSL https://ollama.com/install.sh | sh

In [None]:
model.save_pretrained("gemma-3n-E2B-it-DCC-SFT-V2")  # Local saving
tokenizer.save_pretrained("gemma-3n-E2B-it-DCC-SFT-V2")
model.push_to_hub("repo/gemma-3n-E2B-it-DCC-SFT-V2-lora", token = hf_token, private=True) # Online saving
tokenizer.push_to_hub("repo/gemma-3n-E2B-it-DCC-SFT-V2-lora", token = hf_token, private=True) # Online saving

README.md:   0%|          | 0.00/605 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/169M [00:00<?, ?B/s]

Saved model to https://huggingface.co/z5258621/gemma-3n-E2B-it-DCC-SFT-V2-lora


  0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/4.70M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What is Gemma-3N?",}]
}]
# inputs = tokenizer.apply_chat_template(
#     messages,
#     add_generation_prompt = True, # Must add for generation
#     return_tensors = "pt",
#     tokenize = True,
#     return_dict = True,
# ).to("cuda")

# from transformers import TextStreamer
# _ = model.generate(
#     **inputs,
#     max_new_tokens = 128, # Increase for longer outputs!
#     # Recommended Gemma-3 settings!
#     temperature = 1.0, top_p = 0.95, top_k = 64,
#     streamer = TextStreamer(tokenizer, skip_prompt = True),
# )

### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3n-E2B-it-DCC-SFT-V2`. Set `if False` to `if True` to let it run!

In [None]:
if True: # Change to True to save finetune!
    model.save_pretrained_merged("gemma-3n-E2B-it-DCC-SFT-V2", tokenizer)

Found HuggingFace hub cache directory: /home/jovyan/.cache/huggingface/hub
,Checking cache directory for required files...
,Cache check failed: model-00001-of-00003.safetensors not found in local cache.
,Not all required files found in cache. Will proceed with downloading.
,Downloading safetensors index for unsloth/gemma-3n-e2b-it...


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:  33%|███▎      | 1/3 [00:26<00:52, 26.34s/it]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:  67%|██████▋   | 2/3 [01:26<00:46, 46.41s/it]

model-00003-of-00003.safetensors:   0%|          | 0.00/2.82G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 3/3 [01:53<00:00, 37.79s/it]


If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if True: # Change to True to upload finetune
    model.push_to_hub_merged(
        "repo/gemma-3n-E2B-it-DCC-SFT-V2", tokenizer,
        token = hf_token,
        private=True,
    )

  0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.70M [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /home/jovyan/.cache/huggingface/hub
,Checking cache directory for required files...
,Cache check failed: model-00001-of-00003.safetensors not found in local cache.
,Not all required files found in cache. Will proceed with downloading.
,Downloading safetensors index for unsloth/gemma-3n-e2b-it...


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:  33%|███▎      | 1/3 [00:40<01:21, 40.72s/it]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:  67%|██████▋   | 2/3 [02:14<01:12, 72.13s/it]

model-00003-of-00003.safetensors:   0%|          | 0.00/2.82G [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/2.82G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 3/3 [03:13<00:00, 64.54s/it]


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
# if True: # Change to True to save to GGUF
#     model.save_pretrained_gguf(
#         "gemma-3n-E2B-it-DCC-SFT-V2",
#         quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
#     )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if True: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3n-E2B-it-DCC-SFT-V2",
        quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
        repo_id = "repo/gemma-3n-E2B-it-DCC-SFT-V2-gguf",
        token = hf_token,
        # private=True,
    )

Unsloth GGUF:hf-to-gguf:Loading model: gemma-3n-E2B-it-DCC-SFT-V2
,Unsloth GGUF:hf-to-gguf:Model architecture: Gemma3nForConditionalGeneration
,Unsloth GGUF:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
,Unsloth GGUF:hf-to-gguf:Exporting model...
,Unsloth GGUF:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
,Unsloth GGUF:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
,Unsloth GGUF:hf-to-gguf:altup_proj.weight,                 torch.bfloat16 --> Q8_0, shape = {2048, 2048, 3}
,Unsloth GGUF:hf-to-gguf:altup_unembd_proj.weight,          torch.bfloat16 --> Q8_0, shape = {2048, 2048, 3}
,Unsloth GGUF:hf-to-gguf:token_embd.weight,                 torch.bfloat16 --> Q8_0, shape = {2048, 262144}
,Unsloth GGUF:hf-to-gguf:gguf: loading model part 'model-00002-of-00003.safetensors'
,Unsloth GGUF:hf-to-gguf:per_layer_token_embd.weight,       torch.bfloat16 --> Q8_0, shape = {7680, 262144}
,Unsloth GGUF:hf-to-gguf:output_norm.we

Unsloth: GGUF conversion:   0%|          | 0/100 [00:00<?, ?it/s]

Unsloth GGUF:hf-to-gguf:Model successfully exported to ./
,Unsloth: Converted to gemma-3n-E2B-it-DCC-SFT-V2.Q8_0.gguf with size = 4.7G
,Unsloth: Successfully saved GGUF to:
,gemma-3n-E2B-it-DCC-SFT-V2.Q8_0.gguf


No files have been modified since last commit. Skipping to prevent empty commit.


In [None]:
if True: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3n-E2B-it-DCC-SFT-V2",
        quantization_type = "BF16", # Only Q8_0, BF16, F16 supported
        repo_id = "repo/gemma-3n-E2B-it-DCC-SFT-V2-gguf",
        token = hf_token,
        # private=True,
    )

Unsloth GGUF:hf-to-gguf:Loading model: gemma-3n-E2B-it-DCC-SFT-V2
,Unsloth GGUF:hf-to-gguf:Model architecture: Gemma3nForConditionalGeneration
,Unsloth GGUF:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
,Unsloth GGUF:hf-to-gguf:Exporting model...
,Unsloth GGUF:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
,Unsloth GGUF:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
,Unsloth GGUF:hf-to-gguf:altup_proj.weight,                 torch.bfloat16 --> BF16, shape = {2048, 2048, 3}
,Unsloth GGUF:hf-to-gguf:altup_unembd_proj.weight,          torch.bfloat16 --> BF16, shape = {2048, 2048, 3}
,Unsloth GGUF:hf-to-gguf:token_embd.weight,                 torch.bfloat16 --> BF16, shape = {2048, 262144}
,Unsloth GGUF:hf-to-gguf:gguf: loading model part 'model-00002-of-00003.safetensors'
,Unsloth GGUF:hf-to-gguf:per_layer_token_embd.weight,       torch.bfloat16 --> BF16, shape = {7680, 262144}
,Unsloth GGUF:hf-to-gguf:output_norm.we

Unsloth: GGUF conversion:   0%|          | 0/100 [00:00<?, ?it/s]

Unsloth GGUF:hf-to-gguf:Model successfully exported to ./
,Unsloth: Converted to gemma-3n-E2B-it-DCC-SFT-V2.BF16.gguf with size = 8.9G


No files have been modified since last commit. Skipping to prevent empty commit.


Unsloth: Successfully saved GGUF to:
,gemma-3n-E2B-it-DCC-SFT-V2.BF16.gguf


No files have been modified since last commit. Skipping to prevent empty commit.


In [None]:
if True: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3n-E2B-it-DCC-SFT-V2",
        quantization_type = "F16", # Only Q8_0, BF16, F16 supported
        repo_id = "repo/gemma-3n-E2B-it-DCC-SFT-V2-gguf",
        token = hf_token,
        # private=True,
    )

Unsloth GGUF:hf-to-gguf:Loading model: gemma-3n-E2B-it-DCC-SFT-V2
,Unsloth GGUF:hf-to-gguf:Model architecture: Gemma3nForConditionalGeneration
,Unsloth GGUF:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
,Unsloth GGUF:hf-to-gguf:Exporting model...
,Unsloth GGUF:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
,Unsloth GGUF:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
,Unsloth GGUF:hf-to-gguf:altup_proj.weight,                 torch.bfloat16 --> F16, shape = {2048, 2048, 3}
,Unsloth GGUF:hf-to-gguf:altup_unembd_proj.weight,          torch.bfloat16 --> F16, shape = {2048, 2048, 3}
,Unsloth GGUF:hf-to-gguf:token_embd.weight,                 torch.bfloat16 --> F16, shape = {2048, 262144}
,Unsloth GGUF:hf-to-gguf:gguf: loading model part 'model-00002-of-00003.safetensors'
,Unsloth GGUF:hf-to-gguf:per_layer_token_embd.weight,       torch.bfloat16 --> F16, shape = {7680, 262144}
,Unsloth GGUF:hf-to-gguf:output_norm.weight

Unsloth: GGUF conversion:   0%|          | 0/100 [00:00<?, ?it/s]

Unsloth GGUF:hf-to-gguf:Model successfully exported to ./
,Unsloth: Converted to gemma-3n-E2B-it-DCC-SFT-V2.F16.gguf with size = 8.9G


No files have been modified since last commit. Skipping to prevent empty commit.


Unsloth: Successfully saved GGUF to:
,gemma-3n-E2B-it-DCC-SFT-V2.F16.gguf


No files have been modified since last commit. Skipping to prevent empty commit.


In [None]:
# May need to save this to a file.
print(tokenizer._ollama_modelfile)


,FROM {__FILE_LOCATION__}
,TEMPLATE """{{- range $i, $_ := .Messages }}
,{{- $last := eq (len (slice $.Messages $i)) 1 }}
,{{- if or (eq .Role "user") (eq .Role "system") }}<start_of_turn>user
,{{ .Content }}<end_of_turn>
,{{ if $last }}<start_of_turn>model
,{{ end }}
,{{- else if eq .Role "assistant" }}<start_of_turn>model
,{{ .Content }}{{ if not $last }}<end_of_turn>
,{{ end }}
,{{- end }}
,{{- end }}"""
,PARAMETER stop "<end_of_turn>"
,PARAMETER stop "<eos>"
,PARAMETER temperature 0.1
,PARAMETER min_p 0.0
,PARAMETER top_k 64
,PARAMETER top_p 0.95
,PARAMETER num_predict 32768
,
