# **LIBRARIES**

In [None]:
from unsloth import FastModel
import torch
from unsloth.chat_templates import train_on_responses_only
from datasets import load_dataset

## Unsloth models
- Some other 4-bit models: 
    + `unsloth/gemma-3-1b-it-unsloth-bnb-4bit`
    + `unsloth/gemma-3-4b-it-unsloth-bnb-4bit`
    + `unsloth/gemma-3-12b-it-unsloth-bnb-4bit`
    + `unsloth/gemma-3-27b-it-unsloth-bnb-4bit`
> We will be using the **4B model** since it's not too big or too small.

In [None]:
model, tokenizer = FastModel.from_pretrained(
    # 4bit dynamic quants for superior accuracy and low memory use
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
)

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 42, # just for the heck of it
)

# **DATA PRE-PROCESS**

## About the dataset:
- `HuggingFaceTB/smoltalk` is an official HuggingFace synthetic dataset designed for supervised finetuning (SFT) of LLMs. Include roughly **1.1M records**.
- Dataset composition: The mix consists of:
    + `Smol-Magpie-Ultra`: the core component, consisting of 400K samples generated using the Magpie pipeline with /Llama-3.1-405B-Instruct.
    + `Smol-contraints`: a 36K-sample dataset that trains models to follow specific constraints, such as generating responses with a fixed number of sentences or words, or incorporating specified words in the output. The dataset has been decontaminated against IFEval to prevent overlap.
    + `Smol-rewrite`: an 50k-sample collection focused on text rewriting tasks, such as adjusting tone to be more friendly or professional.
    + `Smol-summarize`: an 100k-sample dataset specialized in email and news summarization.
- Usage: There are sub-sets for different needs: `all`, `apigen-80k`, `everyday-conversations`, etc. for different needs.

In [None]:
from datasets import load_dataset
dataset = load_dataset("HuggingFaceTB/smoltalk", "everyday-conversations", split = "train")
""" dataset = load_dataset("HuggingFaceTB/smoltalk", "all", split = "train") # All of it"""
""" dataset = dataset.select(range(1000)) # For testing only! """

- We now use the `Gemma-3` format for conversation style finetunes. 

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [None]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Let's see how row 100 looks like!

In [None]:
dataset[100]

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [None]:
def formatting_prompts_func(examples):
    all_messages = examples['messages'] # Get the list of conversations for the batch
    texts = []
    for conversation_messages in all_messages:
        # Apply the chat template configured in the tokenizer
        # tokenize=False returns the formatted string
        # add_generation_prompt=False formats for training, not inference
        formatted_text = tokenizer.apply_chat_template(
            conversation_messages,
            tokenize=False,
            add_generation_prompt=False
        )

        # Remove the <bos> token from the beginning as requested
        # Using the specific string '<bos>' as per the instructions
        if formatted_text.startswith('<bos>'):
             formatted_text = formatted_text.removeprefix('<bos>')

        texts.append(formatted_text)

    return {'text': texts}

# Apply the function to the dataset using map
# batched=True processes multiple rows at once for efficiency
# dataset = dataset.map(formatting_prompts_func, batched = True, remove_columns=dataset.column_names)
dataset = dataset.map(formatting_prompts_func, batched = True)

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [None]:
dataset[100]["text"]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
        dataset_num_proc=2,
    ),
)

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [None]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

Now let's print the masked out example - you should see only the answer is present:

In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

### Inference
- Run the model via Unsloth native inference.
> Best settings are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 2048, # is about 1.5k words of text (i think)
    temperature = 1.0, top_p = 0.95, top_k = 64, # The best settings according to Gemma-3 paper (pls don't change)
)
tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
""" 
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "I'm looking for some new fashion brands to check out. Do you have any suggestions?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 2048, # is about 1.5k words of text (i think)
    temperature = 1.0, top_p = 0.95, top_k = 64, # The best settings according to Gemma-3 paper (pls don't change)
    streamer = TextStreamer(tokenizer, skip_prompt = True),
) 
"""

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, use Huggingface's `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model.

In [None]:
model.save_pretrained("gemma-3")  # Local saving
tokenizer.save_pretrained("gemma-3")

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "I'm looking for some new fashion brands to check out. Do you have any suggestions?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

In [None]:
model.save_pretrained_merged("gemma-3-finetune", tokenizer)

### Saving the finetuned model for usage
- Due to hardware limitation, the finetuned model can be save in either `INT8 | FLOAT6 | BF16` format. Some adj of the format:
    + `F16` : Fastest conversion + retains 100% accuracy. Slow and memory hungry.
    + `Q8_0` : Fast conversion. High resource use, but generally acceptable.

In [None]:
# --- Save to GGUF (F16) ---
model.save_pretrained_gguf("gemma-3-finetune", tokenizer, quantization_type = "F16") # For now only Q8_0, BF16, F16 supported

## Using the Saved Models:
- GGUF : just load the generated .gguf file (`gemma-3-finetune.gguf`) using tools compatible with GGUF, like `llama-cpp-python`.

In [None]:
""" # Example of loading a model in 16-bit precision
import torch
from unsloth import FastModel

# Path where you saved the merged model
saved_merged_model_path = "./gemma-3-finetune"

# Load the model in 16-bit precision
model, tokenizer = FastModel.from_pretrained(
    model_name = saved_merged_model_path,  # Load from your local saved directory
    max_seq_length = 2048,             # Or the sequence length you used
    load_in_4bit = False,              # DO NOT load in 4bit
    load_in_8bit = False,              # DO NOT load in 8bit
    dtype = torch.float16,             # Explicitly request float16
    # Or use torch.bfloat16 if your GPU supports it well (Ampere+) and you prefer it
    # dtype = torch.bfloat16,
    # Or let Unsloth/Transformers choose automatically (often defaults to torch.float16)
    # dtype = None, # or "auto"
    device_map = "auto",               # Map model layers across available devices
) 
"""

# REFERENCES
- [✨Gemma 3: How to Run & Fine-tune](https://docs.unsloth.ai/basics/gemma-3-how-to-run-and-fine-tune)