<a href="https://colab.research.google.com/github/RokeMSE/YT_Downloader/blob/main/nb/Gemma3_(4B).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [3]:
from unsloth import FastModel
import torch
from unsloth.chat_templates import train_on_responses_only
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template
from unsloth.chat_templates import standardize_data_formats
from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-07 11:04:15 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-07 11:04:15 [__init__.py:239] Automatically detected platform cuda.


In [4]:
model, tokenizer = FastModel.from_pretrained(
    # 4bit dynamic quants for superior accuracy and low memory use
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
)

==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


We now add LoRA adapters so we only need to update a small amount of parameters!

In [5]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 42, # just for the heck of it
)

Unsloth: Making `model.base_model.model.language_model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [6]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [7]:
from datasets import load_dataset
''' dataset = load_dataset("HuggingFaceTB/smoltalk", "everyday-conversations", split = "train") '''
""" dataset = load_dataset("HuggingFaceTB/smoltalk", "all", split = "train") # All of it"""
""" dataset = dataset.select(range(1000)) # For testing only! """
dataset_name = "rajpurkar/squad"
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.select(range(2500))

In [8]:
''' import random
def generate_not_found_cases(dataset, nCases = None):
    # by default, nCases = None so that the new set will have the same size as the given set
    nContext = -1
    contextIds = []
    currentContext = ""
    for row in dataset:
        context = row["context"]
        if context != currentContext:
            nContext += 1
            currentContext = context
        contextIds.append(nContext)


    N = len(contextIds)
    new_items = []
    if nCases is None:
        nCases = int(N)
    for currentContextIdx in range(N):
        if nCases == 0:
            break
        currentContextId = contextIds[ currentContextIdx ]
        chosenContextIdx = random.randrange(N)
        while contextIds[ chosenContextIdx ] == currentContextId:
            chosenContextIdx = random.randrange(N)
        chosenQuestion = dataset[ chosenContextIdx ]['question']
        new_items.append({
            "context": dataset[ currentContextIdx ]['context'],
            "question": chosenQuestion,
            "answers": {
                "text": ["This information is not in the provided context."],
                "answer_start": [0]
            }
        })
        nCases -= 1
    return new_items
pass '''

' import random\ndef generate_not_found_cases(dataset, nCases = None):\n    # by default, nCases = None so that the new set will have the same size as the given set\n    nContext = -1\n    contextIds = []\n    currentContext = ""\n    for row in dataset:\n        context = row["context"]\n        if context != currentContext:\n            nContext += 1\n            currentContext = context\n        contextIds.append(nContext)\n\n\n    N = len(contextIds)\n    new_items = []\n    if nCases is None:\n        nCases = int(N)\n    for currentContextIdx in range(N):\n        if nCases == 0:\n            break\n        currentContextId = contextIds[ currentContextIdx ]\n        chosenContextIdx = random.randrange(N)\n        while contextIds[ chosenContextIdx ] == currentContextId:\n            chosenContextIdx = random.randrange(N)\n        chosenQuestion = dataset[ chosenContextIdx ][\'question\']\n        new_items.append({\n            "context": dataset[ currentContextIdx ][\'context\']

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [9]:
''' import pandas as pd
from datasets import Dataset
new_not_found_items = generate_not_found_cases(dataset, int( len(dataset) * 0.2))
dataset = pd.concat([dataset.to_pandas(), pd.DataFrame(new_not_found_items)], ignore_index=True)
dataset = dataset.astype({"id": str})
dataset = Dataset.from_pandas(dataset) '''

from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Let's see how row 100 looks like!

In [10]:
dataset[100]

{'id': '573387acd058e614000b5cb5',
 'title': 'University_of_Notre_Dame',
 'context': 'One of the main driving forces in the growth of the University was its football team, the Notre Dame Fighting Irish. Knute Rockne became head coach in 1918. Under Rockne, the Irish would post a record of 105 wins, 12 losses, and five ties. During his 13 years the Irish won three national championships, had five undefeated seasons, won the Rose Bowl in 1925, and produced players such as George Gipp and the "Four Horsemen". Knute Rockne has the highest winning percentage (.881) in NCAA Division I/FBS football history. Rockne\'s offenses employed the Notre Dame Box and his defenses ran a 7–2–2 scheme. The last game Rockne coached was on December 14, 1930 when he led a group of Notre Dame all-stars against the New York Giants in New York City.',
 'question': 'In what year did the team lead by Knute Rockne win the Rose Bowl?',
 'answers': {'text': ['1925'], 'answer_start': [354]}}

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [11]:
train_prompt_template = """<start_of_turn>user
Answer the following question or the requirement based on the provided context. If the context does not contain the answer, state "This information is not in the provided context.".

Context:
{}

Question:
{}<end_of_turn>
<start_of_turn>model
{}<end_of_turn>"""

def format_training_row(rows):
    try:
        contexts = rows['context']
        questions = rows['question']
        answers = rows['answers']
        N = len(contexts)
        texts = []
        for i in range(N):
            context = contexts[i].strip()
            question = questions[i].strip()
            answer = ''
            for j in range( len(answers[i]['text']) ):
                ans = answers[i]['text'][j].strip()
                answer += f'"{ans}"'
                if j < len(answers[i]['text']) - 1:
                    answer += ' or '
            text = train_prompt_template.format(context, question, answer)
            texts.append(text)
    except KeyError as e:
        print(f"Error: Missing key {e}")
        raise

    return {
        "text": texts
    }

dataset = dataset.map(format_training_row, batched = True)

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [12]:
dataset[100]["text"]

'<start_of_turn>user\nAnswer the following question or the requirement based on the provided context. If the context does not contain the answer, state "This information is not in the provided context.".\n\nContext:\nOne of the main driving forces in the growth of the University was its football team, the Notre Dame Fighting Irish. Knute Rockne became head coach in 1918. Under Rockne, the Irish would post a record of 105 wins, 12 losses, and five ties. During his 13 years the Irish won three national championships, had five undefeated seasons, won the Rose Bowl in 1925, and produced players such as George Gipp and the "Four Horsemen". Knute Rockne has the highest winning percentage (.881) in NCAA Division I/FBS football history. Rockne\'s offenses employed the Notre Dame Box and his defenses ran a 7–2–2 scheme. The last game Rockne coached was on December 14, 1930 when he led a group of Notre Dame all-stars against the New York Giants in New York City.\n\nQuestion:\nIn what year did th

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [13]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
        dataset_num_proc=2,
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/2500 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [14]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/2500 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [15]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>user\nAnswer the following question or the requirement based on the provided context. If the context does not contain the answer, state "This information is not in the provided context.".\n\nContext:\nOne of the main driving forces in the growth of the University was its football team, the Notre Dame Fighting Irish. Knute Rockne became head coach in 1918. Under Rockne, the Irish would post a record of 105 wins, 12 losses, and five ties. During his 13 years the Irish won three national championships, had five undefeated seasons, won the Rose Bowl in 1925, and produced players such as George Gipp and the "Four Horsemen". Knute Rockne has the highest winning percentage (.881) in NCAA Division I/FBS football history. Rockne\'s offenses employed the Notre Dame Box and his defenses ran a 7–2–2 scheme. The last game Rockne coached was on December 14, 1930 when he led a group of Notre Dame all-stars against the New York Giants in New York City.\n\nQuestion:\nIn what year d

Now let's print the masked out example - you should see only the answer is present:

In [16]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                                                                                                                                                                                   "1925"<end_of_turn>'

In [17]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.57 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [18]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,500 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 14,901,248/4,000,000,000 (0.37% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,10.8027
2,10.4134
3,10.7562
4,5.9815
5,4.295
6,2.9172
7,2.3609
8,0.8155
9,0.1248
10,0.658


In [19]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

164.4703 seconds used for training.
2.74 minutes used for training.
Peak reserved memory = 5.57 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 37.786 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [20]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

['<bos><start_of_turn>user\nContinue the sequence: 1, 1, 2, 3, 5, 8,<end_of_turn>\n<start_of_turn>model\n13, 21, 34...\n\n"Fibonacci Sequence"<end_of_turn>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [21]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

['<bos><start_of_turn>user\nContinue the sequence: 1, 1, 2, 3, 5, 8,<end_of_turn>\n<start_of_turn>model\n13, 21, 34\n\n**Explanation**\nThis is the Fibonacci sequence. Each number is the sum of the two preceding numbers.\n\n*   1 + 1 = 2\n*   1 + 2 = 3\n*   2 + 3 = 5\n*']

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [22]:
model.save_pretrained("gemma-3")  # Local saving
tokenizer.save_pretrained("gemma-3")

['gemma-3/processor_config.json']

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [23]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "I'm looking for some new fashion brands to check out. Do you have any suggestions?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Okay, let's dive into some fantastic fashion brands! Here's a breakdown of suggestions, categorized by style and price point, with a little bit about what makes each brand unique:

**1. Affordable & Trendy (Under $100 per piece):**

* **Aritzia:** (Canadian


In [None]:
model.save_pretrained_merged("gemma-3-finetune", tokenizer)

Downloading safetensors index for unsloth/gemma-3-4b-it...


Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

In [None]:
# --- Save to GGUF (F16) ---
model.save_pretrained_gguf("gemma-3", quantization_type = "F16") # For now only Q8_0, BF16, F16 supported

In [None]:
# Example of loading a model in 16-bit precision
# Path where you saved the merged model
saved_merged_model_path = "./gemma-3-finetune"

# Load the model in 16-bit precision
model, tokenizer = FastModel.from_pretrained(
    model_name = saved_merged_model_path,  # Load from your local saved directory
    max_seq_length = 2048,             # Or the sequence length you used
    load_in_4bit = False,              # DO NOT load in 4bit
    load_in_8bit = False,              # DO NOT load in 8bit
    dtype = torch.float16,             # Explicitly request float16
    # Or use torch.bfloat16 if your GPU supports it well (Ampere+) and you prefer it
    # dtype = torch.bfloat16,
    # Or let Unsloth/Transformers choose automatically (often defaults to torch.float16)
    # dtype = None, # or "auto"
    device_map = "auto",               # Map model layers across available devices
)
