## Installation

Install required packages for Unsloth and training.

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

## Load Model

Load the Gemma-3 270M model using Unsloth's FastModel for efficient inference and training.

In [None]:
from unsloth import FastModel
import torch

max_seq_length = 2048  # Sufficient for GSM8K problems

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-270m-it",
    max_seq_length=max_seq_length,
    load_in_4bit=False,   # Full precision for small model
    load_in_8bit=False,
    full_finetuning=False,  # Use LoRA for efficiency
    # token = "hf_...",  # Uncomment if using gated models
)

## Configure LoRA Adapters

Add LoRA adapters for parameter-efficient fine-tuning. This allows us to update only a small fraction of the model's parameters.

In [None]:
model = FastModel.get_peft_model(
    model,
    r=128,  # LoRA rank - higher values = more capacity but more memory
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=128,
    lora_dropout=0,  # No dropout for optimized training
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

## Setup Chat Template

Configure the Gemma-3 chat template for proper conversation formatting.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma3",
)

## Load GSM8K Dataset

Load the GSM8K dataset which contains grade school math word problems with step-by-step solutions.

In [None]:
from datasets import load_dataset

# Load GSM8K training split
dataset = load_dataset("openai/gsm8k", "main", split="train")

print(f"Dataset size: {len(dataset)} examples")
print(f"\nSample problem:")
print(f"Question: {dataset[0]['question']}")
print(f"\nAnswer: {dataset[0]['answer']}")

## Format Dataset for Training

Convert GSM8K examples to the Gemma-3 chat format with system instructions for mathematical reasoning.

In [None]:
SYSTEM_PROMPT = """You are a helpful math tutor. Solve the given math problem step by step.
Show your reasoning clearly and provide the final numerical answer at the end.
Format your final answer as: #### [number]"""

def convert_gsm8k_to_chat(example):
    """Convert GSM8K example to Gemma-3 chat format."""
    return {
        "conversations": [
            {"role": "user", "content": f"{SYSTEM_PROMPT}\n\nProblem: {example['question']}"},
            {"role": "assistant", "content": example['answer']}
        ]
    }

# Convert dataset
dataset = dataset.map(convert_gsm8k_to_chat)

# Preview converted example
print("Converted example:")
print(dataset[0]["conversations"])

## Apply Chat Template

Apply the Gemma-3 chat template to format conversations for training.

In [None]:
def formatting_prompts_func(examples):
    """Apply chat template to conversations."""
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        ).removeprefix('<bos>')
        for convo in convos
    ]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

# Preview formatted text
print("Formatted training example:")
print(dataset[0]['text'][:500] + "...")

## Configure Training

Set up the SFT trainer with optimized hyperparameters for math reasoning.

In [None]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    eval_dataset=None,
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=2,  # Effective batch size = 8
        warmup_steps=10,
        max_steps=200,  # Adjust for full training: num_train_epochs=1
        learning_rate=5e-5,
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="outputs_gsm8k",
        report_to="none",  # Use "wandb" for Weights & Biases logging
        save_steps=100,
        save_total_limit=2,
    ),
)

## Enable Response-Only Training

Train only on the assistant's responses (math solutions), not on the user's questions. This improves training efficiency and model accuracy.

In [None]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<start_of_turn>user\n",
    response_part="<start_of_turn>model\n",
)

## Verify Training Masking

Confirm that only the model's responses are being trained on (unmasked).

In [None]:
# Show full input
print("Full input:")
print(tokenizer.decode(trainer.train_dataset[0]["input_ids"])[:500] + "...")

print("\n" + "="*50 + "\n")

# Show what the model is trained on (masked version)
print("Training target (only assistant response):")
masked_labels = [tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[0]["labels"]]
print(tokenizer.decode(masked_labels).replace(tokenizer.pad_token, "")[:500] + "...")

## Check Memory Usage

Display current GPU memory statistics before training.

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU: {gpu_stats.name}")
print(f"Max memory: {max_memory} GB")
print(f"Reserved memory: {start_gpu_memory} GB")

## Train the Model

Start the fine-tuning process. This will train the model on GSM8K math problems.

In [None]:
trainer_stats = trainer.train()

## Training Statistics

Display final memory usage and training time statistics.

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Training time: {trainer_stats.metrics['train_runtime']/60:.2f} minutes")
print(f"Peak reserved memory: {used_memory} GB")
print(f"Peak memory for training: {used_memory_for_lora} GB")
print(f"Peak memory % of max: {used_percentage}%")
print(f"Training memory % of max: {lora_percentage}%")

## Inference - Test the Model

Test the fine-tuned model on a math problem to see if it learned to solve GSM8K-style questions.

In [None]:
# Test with a sample problem from the dataset
test_problem = dataset[100]['question'] if 'question' in dataset[100] else dataset[100]['conversations'][0]['content']

messages = [
    {
        "role": "user",
        "content": f"{SYSTEM_PROMPT}\n\nProblem: {test_problem}"
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
).removeprefix('<bos>')

print("Input prompt:")
print(text)
print("\n" + "="*50 + "\n")
print("Model response:")

In [None]:
from transformers import TextStreamer

# Generate response with streaming
_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    top_k=64,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

## Test with a Custom Problem

Try the model with a new math problem to evaluate generalization.

In [None]:
custom_problem = """A bakery sells cupcakes for $3 each and cookies for $2 each. 
On Monday, they sold 45 cupcakes and 80 cookies. 
On Tuesday, they sold 60 cupcakes and 55 cookies. 
How much total revenue did the bakery make over these two days?"""

messages = [
    {
        "role": "user",
        "content": f"{SYSTEM_PROMPT}\n\nProblem: {custom_problem}"
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
).removeprefix('<bos>')

print("Custom problem response:")
_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    top_k=64,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

## Save the Model

Save the fine-tuned LoRA adapters locally. You can also push to Hugging Face Hub.

In [None]:
# Save LoRA adapters locally
model.save_pretrained("gemma3-270m-gsm8k-lora")
tokenizer.save_pretrained("gemma3-270m-gsm8k-lora")

print("Model saved to: gemma3-270m-gsm8k-lora/")

In [None]:
# Uncomment to push to Hugging Face Hub
# model.push_to_hub("your_username/gemma3-270m-gsm8k-lora", token="hf_...")
# tokenizer.push_to_hub("your_username/gemma3-270m-gsm8k-lora", token="hf_...")

## Save Merged Model (Optional)

Merge LoRA adapters with base model and save in different formats.

In [None]:
# Merge to 16bit (for VLLM deployment)
if False:  # Change to True to save
    model.save_pretrained_merged(
        "gemma3-270m-gsm8k-merged",
        tokenizer,
        save_method="merged_16bit"
    )

# Save as GGUF for llama.cpp
if False:  # Change to True to save
    model.save_pretrained_gguf(
        "gemma3-270m-gsm8k-gguf",
        tokenizer,
        quantization_method="Q8_0",  # Q8_0, BF16, or F16
    )

## Load Saved Model (Optional)

Load the saved LoRA adapters for inference.

In [None]:
if False:  # Change to True to load saved model
    from unsloth import FastModel
    
    model, tokenizer = FastModel.from_pretrained(
        model_name="gemma3-270m-gsm8k-lora",
        max_seq_length=2048,
        load_in_4bit=False,
    )
    print("Model loaded successfully!")

## Summary

This notebook demonstrated:

1. **Loading** the Gemma-3 270M model with Unsloth optimizations
2. **Configuring** LoRA adapters for efficient fine-tuning
3. **Preparing** the GSM8K dataset for math reasoning training
4. **Training** with response-only masking for improved accuracy
5. **Testing** the model on math word problems
6. **Saving** the fine-tuned model in various formats

### Next Steps

- Increase `max_steps` or use `num_train_epochs=1` for full training
- Evaluate on GSM8K test set for benchmark comparison
- Try GRPO training for reinforcement learning from preferences
- Deploy with VLLM or llama.cpp for production inference