## Installation

Install required packages for Unsloth and training.

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

## Load Model

Load the Gemma-3 270M model using Unsloth's FastModel for efficient inference and training.

In [2]:
from unsloth import FastModel
import torch

max_seq_length = 2048  # Sufficient for GSM8K problems

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-270m-it",
    max_seq_length=max_seq_length,
    load_in_4bit=False,   # Full precision for small model
    load_in_8bit=False,
    full_finetuning=False,  # Use LoRA for efficiency
    # token = "hf_...",  # Uncomment if using gated models
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.4: Fast Gemma3 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

## Configure LoRA Adapters

Add LoRA adapters for parameter-efficient fine-tuning. This allows us to update only a small fraction of the model's parameters.

In [3]:
model = FastModel.get_peft_model(
    model,
    r=128,  # LoRA rank - higher values = more capacity but more memory
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=128,
    lora_dropout=0,  # No dropout for optimized training
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth: Making `model.base_model.model.model` require gradients


## Setup Chat Template

Configure the Gemma-3 chat template for proper conversation formatting.

In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma3",
)

## Load GSM8K Dataset

Load the GSM8K dataset which contains grade school math word problems with step-by-step solutions.

In [None]:
from datasets import load_dataset

# Load UltraChat conversation dataset
# Using train_sft split which is curated for supervised fine-tuning
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:10000]")

print(f"Dataset size: {len(dataset)} examples")
print(f"\nSample conversation:")
print(f"Messages: {dataset[0]['messages'][:2]}...")  # Show first 2 turns

README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Dataset size: 7473 examples

Sample problem:
Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72


## Format Dataset for Training

Convert UltraChat conversations to the Gemma-3 chat format. The dataset already has proper user/assistant turn structure.

In [None]:
def convert_ultrachat_to_gemma(example):
    """Convert UltraChat messages to Gemma-3 conversation format."""
    conversations = []
    for msg in example['messages']:
        role = msg['role']
        # Map 'assistant' to 'model' for Gemma format if needed
        if role == 'system':
            # Prepend system message to first user message or skip
            continue
        conversations.append({
            "role": role if role == "user" else "assistant",
            "content": msg['content']
        })
    return {"conversations": conversations}

# Convert dataset
dataset = dataset.map(convert_ultrachat_to_gemma)

# Preview converted example

print("Converted example:")    print(turn['content'][:200] + "..." if len(turn['content']) > 200 else turn['content'])

for i, turn in enumerate(dataset[0]["conversations"][:4]):    print(f"\nTurn {i+1} ({turn['role']}):")

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Converted example:
[{'content': 'You are a helpful math tutor. Solve the given math problem step by step.\nShow your reasoning clearly and provide the final numerical answer at the end.\nFormat your final answer as: #### [number]\n\nProblem: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'role': 'user'}, {'content': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72', 'role': 'assistant'}]


## Apply Chat Template

Apply the Gemma-3 chat template to format conversations for training.

In [7]:
def formatting_prompts_func(examples):
    """Apply chat template to conversations."""
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        ).removeprefix('<bos>')
        for convo in convos
    ]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

# Preview formatted text
print("Formatted training example:")
print(dataset[0]['text'][:500] + "...")

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Formatted training example:
<start_of_turn>user
You are a helpful math tutor. Solve the given math problem step by step.
Show your reasoning clearly and provide the final numerical answer at the end.
Format your final answer as: #### [number]

Problem: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<end_of_turn>
<start_of_turn>model
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 c...


## Configure Training

Set up the SFT trainer with optimized hyperparameters for conversational fine-tuning.

In [None]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    eval_dataset=None,
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=2,  # Effective batch size = 8
        warmup_steps=10,
        max_steps=200,  # Adjust for full training: num_train_epochs=1
        learning_rate=5e-5,
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="outputs_conversation",
        report_to="none",  # Use "wandb" for Weights & Biases logging
        save_steps=100,
        save_total_limit=2,
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/7473 [00:00<?, ? examples/s]

## Enable Response-Only Training

Train only on the assistant's responses (math solutions), not on the user's questions. This improves training efficiency and model accuracy.

In [9]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<start_of_turn>user\n",
    response_part="<start_of_turn>model\n",
)

Map (num_proc=6):   0%|          | 0/7473 [00:00<?, ? examples/s]

## Verify Training Masking

Confirm that only the model's responses are being trained on (unmasked).

In [10]:
# Show full input
print("Full input:")
print(tokenizer.decode(trainer.train_dataset[0]["input_ids"])[:500] + "...")

print("\n" + "="*50 + "\n")

# Show what the model is trained on (masked version)
print("Training target (only assistant response):")
masked_labels = [tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[0]["labels"]]
print(tokenizer.decode(masked_labels).replace(tokenizer.pad_token, "")[:500] + "...")

Full input:
<bos><start_of_turn>user
You are a helpful math tutor. Solve the given math problem step by step.
Show your reasoning clearly and provide the final numerical answer at the end.
Format your final answer as: #### [number]

Problem: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<end_of_turn>
<start_of_turn>model
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>...


Training target (only assistant response):
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<end_of_turn>
...


## Check Memory Usage

Display current GPU memory statistics before training.

In [11]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU: {gpu_stats.name}")
print(f"Max memory: {max_memory} GB")
print(f"Reserved memory: {start_gpu_memory} GB")

GPU: Tesla T4
Max memory: 14.741 GB
Reserved memory: 0.832 GB


## Train the Model

Start the fine-tuning process. This will train the model on conversational data.

In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 30,375,936 of 298,474,112 (10.18% trained)


Step,Training Loss
10,1.6819
20,1.0178
30,0.9191
40,0.7881
50,0.8156
60,0.8124
70,0.7862
80,0.7663
90,0.7274
100,0.7506


Unsloth: Will smartly offload gradients to save VRAM!


## Training Statistics

Display final memory usage and training time statistics.

In [13]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Training time: {trainer_stats.metrics['train_runtime']/60:.2f} minutes")
print(f"Peak reserved memory: {used_memory} GB")
print(f"Peak memory for training: {used_memory_for_lora} GB")
print(f"Peak memory % of max: {used_percentage}%")
print(f"Training memory % of max: {lora_percentage}%")

Training time: 182.16 seconds
Training time: 3.04 minutes
Peak reserved memory: 3.234 GB
Peak memory for training: 2.402 GB
Peak memory % of max: 21.939%
Training memory % of max: 16.295%


## Inference - Test the Model

Test the fine-tuned model on a conversation to see how it responds to various queries.

In [None]:
# Test with a sample conversation from the dataset
test_conversation = dataset[100]['conversations'][0]['content'] if dataset[100]['conversations'] else "Hello, how are you?"

messages = [
    {
        "role": "user",
        "content": test_conversation
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
).removeprefix('<bos>')

print("Input prompt:")
print(text)
print("\n" + "="*50 + "\n")
print("Model response:")

Input prompt:
<start_of_turn>user
You are a helpful math tutor. Solve the given math problem step by step.
Show your reasoning clearly and provide the final numerical answer at the end.
Format your final answer as: #### [number]

Problem: A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?<end_of_turn>
<start_of_turn>model



Model response:


In [15]:
from transformers import TextStreamer

# Generate response with streaming
_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    top_k=64,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

The craft store made 36 x 12 = <<36*12=412>>412 sales in the fabric section.
The craft store made 412 / 3 = <<412/3=133>>133 sales in the jewelry section.
The craft store made 133 - 36 - 412 = <<133-36-412=1458>>1458 sales in the stationery section.
#### 1458<end_of_turn>


## Test with Custom Conversations

Try the model with various conversation topics to evaluate its conversational abilities.

In [None]:
# Test various conversation topics
test_prompts = [
    "What are some tips for learning a new programming language?",
    "Can you explain the difference between machine learning and deep learning?",
    "Write a short poem about artificial intelligence.",
]

for i, prompt in enumerate(test_prompts):
    print(f"\n{'='*50}")
    print(f"Test {i+1}: {prompt}")
    print("="*50)
    
    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    ).removeprefix('<bos>')
    
    _ = model.generate(
        **tokenizer(text, return_tensors="pt").to("cuda"),
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.95,
        top_k=64,
        streamer=TextStreamer(tokenizer, skip_prompt=True),
    )
    print()

Custom problem response:
On Tuesday, the bakery sold 60 cupcakes - 45 cupcakes = <<60-45=15>>15 cupcakes.
The total number of cupcakes sold on Tuesday is 15 cupcakes + 80 cupcakes = <<15+80=95>>95 cupcakes.
The total number of cookies sold on Tuesday is 15 cupcakes + 55 cupcakes = <<15+55=70>>70 cupcakes.
The total number of cupcakes sold on Tuesday is 95 cupcakes + 70 cupcakes = <<95+70=165>>165 cupcakes.
The total number of cupcakes sold on Tuesday is 165 cupcakes + 3 cupcakes = <<165+3=168>>168 cupcakes.
The total number of cupcakes sold on Tuesday is 168 cupcakes + 3 cupcakes = <<168+3=171>>171 cupcakes.
The total number of cupcakes sold on Tuesday is 171 cupcakes + 2 cupcakes = <<171+2=173>>173 cupcakes.
The total number of cupcakes sold on Tuesday is 173 cupcakes + 2 cupcakes = <<173+2=175>>175 cupcakes.
The total number of cupcakes sold on Tuesday is 175 cupcakes + 2 cupcakes = <<175+2=177>>177 cupcakes.
The total number of cupcakes sold on Tuesday is 177 cupcakes + 2 cupcakes =

## Save the Model

Save the fine-tuned LoRA adapters locally. You can also push to Hugging Face Hub.

In [None]:
# Save LoRA adapters locally
model.save_pretrained("gemma3-270m-conversation-lora")
tokenizer.save_pretrained("gemma3-270m-conversation-lora")

print("Model saved to: gemma3-270m-conversation-lora/")

In [None]:
# Uncomment to push to Hugging Face Hub
# model.push_to_hub("your_username/gemma3-270m-conversation-lora", token="hf_...")
# tokenizer.push_to_hub("your_username/gemma3-270m-conversation-lora", token="hf_...")

## Save Merged Model (Optional)

Merge LoRA adapters with base model and save in different formats.

In [None]:
# Merge to 16bit (for VLLM deployment)
if False:  # Change to True to save
    model.save_pretrained_merged(
        "gemma3-270m-conversation-merged",
        tokenizer,
        save_method="merged_16bit"
    )

# Save as GGUF for llama.cpp
if False:  # Change to True to save
    model.save_pretrained_gguf(
        "gemma3-270m-conversation-gguf",
        tokenizer,
        quantization_method="Q8_0",  # Q8_0, BF16, or F16
    )

## Load Saved Model (Optional)

Load the saved LoRA adapters for inference.

In [None]:
if False:  # Change to True to load saved model
    from unsloth import FastModel
    
    model, tokenizer = FastModel.from_pretrained(
        model_name="gemma3-270m-conversation-lora",
        max_seq_length=2048,
        load_in_4bit=False,
    )
    print("Model loaded successfully!")

## Summary

This notebook demonstrated:

1. **Loading** the Gemma-3 270M model with Unsloth optimizations
2. **Configuring** LoRA adapters for efficient fine-tuning
3. **Preparing** the UltraChat conversation dataset for training
4. **Training** with response-only masking for improved accuracy
5. **Testing** the model on various conversation topics
6. **Saving** the fine-tuned model in various formats

### Dataset Used

- **HuggingFaceH4/ultrachat_200k** - A high-quality multi-turn conversation dataset
- Contains diverse topics and helpful AI assistant responses
- Great for training general-purpose conversational abilities


### Next Steps- Deploy with VLLM or llama.cpp for production inference

- Fine-tune on domain-specific conversations for specialized assistants

- Increase `max_steps` or use `num_train_epochs=1` for full training- Try other conversation datasets like `OpenAssistant/oasst1` or `databricks/dolly-15k`