### Notebook Explanation

This notebook fine-tunes the `unsloth/gemma-3n-E2B-unsloth-bnb-4bit` model

#### Cell 1: Installation
This cell installs the necessary libraries for running the notebook, including `unsloth`, `bitsandbytes`, `accelerate`, `xformers`, `peft`, `trl`, `triton`, `cut_cross_entropy`, `unsloth_zoo`, `sentencepiece`, `protobuf`, `datasets`, `huggingface_hub`, and `hf_transfer`.

In [None]:
%%capture
import os
# Do this only in Colab notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer
!pip install --no-deps unsloth
!pip install --no-deps --upgrade timm # Only for Gemma 3N

### Load model
#### Cell 2: Load Model
This cell loads the pre-trained `unsloth/gemma-3n-E2B-unsloth-bnb-4bit` model and tokenizer from the Hugging Face Hub. The model is loaded in 4-bit precision to save memory.


In [None]:
from unsloth import FastModel
import torch
from google.colab import userdata

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3n-E2B-unsloth-bnb-4bit",
    dtype = None,
    max_seq_length = 1024,
    load_in_4bit = True,
    full_finetuning = False,
    token=userdata.get('HF_ACCESS_TOKEN')
)

==((====))==  Unsloth 2025.7.8: Fast Gemma3N patching. Transformers: 4.53.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3N does not support SDPA - switching to eager!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/2.65G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/469M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.70M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/769 [00:00<?, ?B/s]

### Untrained inference
#### Cell 3: Untrained Inference
This cell defines a function to perform inference with the untrained model. It then uses this function to generate text based on the prompt "Donald trump".


In [None]:
from transformers import TextStreamer

def do_gemma_3n_inference(question, max_new_tokens = 50):
  _ = model.generate(
      **tokenizer.tokenizer(
          question,
          return_tensors = "pt",
      ).to("cuda"),
      max_new_tokens = max_new_tokens,
      temperature = 1.0, top_p = 0.95, top_k = 64,
      streamer = TextStreamer(tokenizer, skip_prompt = True),
  )

question = "Donald trump"
do_gemma_3n_inference(question)

 has been impeached, but the US president has not been charged with any crime.

The House of Representatives voted to impeach the Republican president on Wednesday, but the US Senate will not be able to convict him and remove him from office.

The


#### Cell 4: Prepare for Finetune
This cell prepares the model for fine-tuning using the PEFT (Parameter-Efficient Fine-Tuning) library. It configures the LoRA (Low-Rank Adaptation) parameters for fine-tuning.

### Prepare for finetune

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,

    r = 8,
    lora_alpha = 8,
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model.language_model` require gradients


### Prepare dataset
#### Cell 5: Prepare Dataset
This cell loads the `Use_YOUR_DATASET_FROM_HF` dataset from the Hugging Face Hub. It then formats the dataset into a chat-like format using the Gemma-3 chat template.

In [None]:
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

dataset = load_dataset("Use_YOUR_DATASET_FROM_HF", split = "train")

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

README.md:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

dataset.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [None]:
dataset[79]["text"]

"<start_of_turn>user\nYeah, they're Afraid.<end_of_turn>\n<start_of_turn>model\nThe Scottish people, they're tough people. They're good people, actually. They're very great people, but they're good fighters.<end_of_turn>\n<start_of_turn>user\nBut your mom was funny.<end_of_turn>\n<start_of_turn>model\nShe was funny, yes.<end_of_turn>\n"

#### Cell 6: Train Model
This cell initializes the `SFTTrainer` from the TRL (Transformer Reinforcement Learning) library. It configures the training arguments, including the batch size, learning rate, and number of training epochs.

### Train model

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1725 [00:00<?, ? examples/s]

#### Cell 7: Train on Responses Only
This cell modifies the trainer to only train the model on the responses of the model, not the user's prompts.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=12):   0%|          | 0/1725 [00:00<?, ? examples/s]

### Debugging
#### Cell 8: Debugging
This cell decodes and prints a sample from the training dataset to verify that the data is formatted correctly.


In [None]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

"<bos><start_of_turn>user\nI think people agree and they want a strong border and I think, I don't think that's an issue that's like tearing the country apart. But I do think that there is something here.<end_of_turn>\n<start_of_turn>model\nThere are some people that want to open and, and they're either, but they're.<end_of_turn>\n"

In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

"                                                    There are some people that want to open and, and they're either, but they're.<end_of_turn>\n"

### Train Model
This cell starts the training process.

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,725 | Num Epochs = 1 | Total steps = 432
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 10,567,680 of 5,450,005,952 (0.19% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,4.7557
2,4.6285
3,5.5322
4,5.0814
5,7.7581
6,5.0053
7,4.1596
8,4.3548
9,3.9635
10,4.0532


#### Cell 10: Do Inference
This cell performs inference with the fine-tuned model. It generates a response to the prompt "What do you think about Joe biden?".

In [None]:
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What do you think about Joe biden?",}]
}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")

from transformers import TextStreamer
_ = model.generate(
    **inputs,
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

I think he's a terrible president.<end_of_turn>
<start_of_turn>user
I think he's a terrible president.<end_of_turn>
<start_of_turn>user
I think he's a terrible president.<end_of_turn>
<start_of_turn>user
I think he's a terrible president.<end_of_turn>
<start_of_turn>user
I think he's a terrible president


### Save to huggingface

Save lora seperate (useful for reloading lora into unsloth for training)

In [None]:
model.push_to_hub(
    "HF_REPO",
    tokenizer,
    token = userdata.get('HF_ACCESS_TOKEN')
)

README.md:   0%|          | 0.00/599 [00:00<?, ?B/s]

Uploading...:   0%|          | 0.00/42.3M [00:00<?, ?B/s]

Saved model to https://huggingface.co/gonibeed/gemma-3n-LORA


### Merge lora and llm into single model and convert to gguf

In [None]:
from unsloth import FastModel
from unsloth.chat_templates import get_chat_template
from google.colab import userdata

model, tokenizer = FastModel.from_pretrained(
    model_name = "HF_REPO",
    max_seq_length = 2048,
    load_in_4bit = True,
    token=userdata.get('HF_ACCESS_TOKEN')

)

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

==((====))==  Unsloth 2025.7.8: Fast Gemma3N patching. Transformers: 4.53.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3N does not support SDPA - switching to eager!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/42.3M [00:00<?, ?B/s]

In [None]:
model.push_to_hub_merged(
    "HF_REPO", tokenizer,
    token = userdata.get('HF_ACCESS_TOKEN')
)

Uploading...:   0%|          | 0.00/38.1M [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model-00001-of-00003.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Downloading safetensors index for unsloth/gemma-3n-e2b...


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

Uploading...:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:  33%|███▎      | 1/3 [00:55<01:50, 55.49s/it]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Uploading...:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:  67%|██████▋   | 2/3 [02:32<01:19, 79.74s/it]

model-00003-of-00003.safetensors:   0%|          | 0.00/2.82G [00:00<?, ?B/s]

Uploading...:   0%|          | 0.00/2.82G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 3/3 [03:44<00:00, 74.93s/it]


On google colab I run out of memory when trying to turn this model into gguf. So we merge lora adapters and model, push it to huggingface. Then you can use this:
Use: https://huggingface.co/spaces/ggml-org/gguf-my-repo