<a href="https://colab.research.google.com/github/HtmMhmd/fine-tuning-examples/blob/main/gemma3_grpo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRPO Fine-tuning with Gemma 3 (1B): Compact Model Optimization

## Overview

This notebook demonstrates **GRPO (Gradient Ratio Policy Optimization)** fine-tuning using the compact **Gemma 3 1B** model. Unlike the larger DeepSeek R1 8B model, Gemma 3 1B presents unique challenges and opportunities:

### Why Gemma 3 1B?
- **Resource Efficiency**: Fits on consumer GPUs with limited VRAM
- **Fast Iteration**: Quick training cycles for experimentation
- **Edge Deployment**: Suitable for mobile and edge computing scenarios
- **Educational Value**: Easier to understand and debug due to smaller scale

### Key Learning Objectives
1. **Compact Model Optimization**: How to maximize performance from small models
2. **GRPO Scaling**: Understanding how GRPO behaves with different model sizes
3. **Reward Function Adaptation**: Adjusting reward systems for smaller capacity models
4. **Resource Constraint Management**: Working within memory and compute limitations

### Technical Challenges
- **Limited Capacity**: 1B parameters vs 8B in DeepSeek R1
- **Knowledge Constraints**: Smaller models have less world knowledge
- **Generalization**: Ensuring the model doesn't overfit to specific patterns
- **Language Consistency**: Maintaining Indonesian language usage with limited capacity

### Expected Outcomes
By the end of this notebook, you'll understand how GRPO can effectively fine-tune even compact models for specific tasks while maintaining efficiency and performance quality.

## 1. Environment Setup and Dependencies

### Memory-Optimized Configuration
Since we're working with a 1B model, we can use more aggressive optimization techniques that might not be suitable for larger models.

In [1]:
!pip install -q unsloth

In [2]:
!pip install -q langid

In [3]:
# Memory-optimized imports for compact model training
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Single GPU is sufficient
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"  # Smaller chunks for 1B model

# Core libraries
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from trl import GRPOTrainer, GRPOConfig
import langid
import re
from typing import List, Dict, Any
import warnings
warnings.filterwarnings('ignore')

print("🚀 Gemma 3 1B GRPO Training Environment Ready!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
🚀 Gemma 3 1B GRPO Training Environment Ready!
PyTorch version: 2.7.1+cu126
CUDA available: True
GPU: Tesla T4


In [4]:

# Configure langid for Indonesian detection
langid.set_languages(['id', 'en'])  # Focus on Indonesian and English

def get_lang(text: str) -> str:
    """Detect language with improved confidence handling"""
    # Clean text for better detection
    cleaned_text = text.strip().lower()

    # Try langid classification
    lang, confidence = langid.classify(cleaned_text)

    # Improved confidence threshold and fallback logic
    if confidence > 0.5:  # Lower threshold for better detection
        return lang

    # Fallback: Check for Indonesian keywords
    indonesian_keywords = [
        'adalah', 'untuk', 'dengan', 'dari', 'yang', 'dan', 'dalam', 'pada',
        'jawaban', 'hasil', 'solusi', 'persamaan', 'menyelesaikan', 'langkah',
        'pertama', 'kedua', 'terakhir', 'sama dengan', 'kita', 'perlu'
    ]

    # Check if text contains Indonesian keywords
    word_count = len(cleaned_text.split())
    indonesian_word_count = sum(1 for word in indonesian_keywords if word in cleaned_text)

    if word_count > 0 and (indonesian_word_count / word_count) > 0.2:
        return 'id'

    # Check for English keywords as secondary fallback
    english_keywords = ['the', 'and', 'to', 'of', 'a', 'is', 'for', 'solve', 'equation', 'answer']
    english_word_count = sum(1 for word in english_keywords if word in cleaned_text)

    if word_count > 0 and (english_word_count / word_count) > 0.2:
        return 'en'

    return 'unknown'

# Test the improved language detection
print("✅ Language detection configured for Indonesian enforcement")
print("\n🔍 Testing improved language detection:")

test_sentences = [
    "Untuk menyelesaikan persamaan 2x + 3 = 7, kita perlu mengurangi 3 dari kedua sisi.",
    "To solve 2x + 3 = 7, subtract 3 from both sides.",
    "Jawaban: x = 2. Langkah pertama kurangi 3, langkah kedua bagi dengan 2.",
    "The answer is 5"
]

for i, sentence in enumerate(test_sentences, 1):
    detected_lang = get_lang(sentence)
    raw_lang, raw_conf = langid.classify(sentence)
    print(f"Test {i}: {sentence[:50]}...")
    print(f"  Detected: {detected_lang} (raw: {raw_lang}, conf: {raw_conf:.3f})")
    print()

✅ Language detection configured for Indonesian enforcement

🔍 Testing improved language detection:
Test 1: Untuk menyelesaikan persamaan 2x + 3 = 7, kita per...
  Detected: id (raw: id, conf: -105.298)

Test 2: To solve 2x + 3 = 7, subtract 3 from both sides....
  Detected: en (raw: en, conf: -24.775)

Test 3: Jawaban: x = 2. Langkah pertama kurangi 3, langkah...
  Detected: id (raw: id, conf: -101.698)

Test 4: The answer is 5...
  Detected: en (raw: en, conf: -49.995)



## 2. Gemma 3 1B Model Loading and Configuration

### Compact Model Optimization Strategy
For the 1B model, we can use more aggressive LoRA configurations and higher learning rates since the model is less prone to catastrophic forgetting.

In [5]:
# Gemma 3 1B model configuration with aggressive optimization
model_name = "unsloth/gemma-3-1b-it"  # Gemma 3 1B instruction-tuned
max_seq_length = 2048  # Sufficient for mathematical problems

print("🔄 Loading Gemma 3 1B model with Unsloth optimizations...")

# Load model with higher LoRA rank for compact models
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    # dtype=torch.float16,  # Use FP16 for memory efficiency
    # load_in_4bit=True,   # 4-bit quantization for even smaller memory footprint
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    device_map="auto",
)

print(f"✅ Model loaded: {model_name}")
print(f"📊 Model parameters: ~1B")
# print(f"💾 Memory optimization: 4-bit quantization + FP16")

# Configure LoRA with higher rank for compact models
# Higher rank compensates for smaller base model capacity
model = FastLanguageModel.get_peft_model(
    model,
    r=32,               # Higher rank than typical (usually 16-32)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
        # "embed_tokens", "lm_head"  # Include embeddings for better adaptation
    ],
    lora_alpha=64,     # Higher alpha for stronger adaptation
    lora_dropout=0.1,   # Moderate dropout to prevent overfitting
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

print("🔧 LoRA Configuration for Compact Model:")
print(f"  • Rank (r): 64 (higher than typical for compensation)")
print(f"  • Alpha: 128 (strong adaptation signal)")
print(f"  • Target modules: {len(['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'embed_tokens', 'lm_head'])} layers")
print(f"  • Dropout: 0.1 (balanced regularization)")

# Configure chat template for Gemma
tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma",  # Use Gemma-specific template
    map_eos_token=True,
)

print("✅ Gemma 3 1B model ready for GRPO training!")

# Model statistics
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"📈 Total parameters: {total_params:,}")
print(f"🎯 Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)")

🔄 Loading Gemma 3 1B model with Unsloth optimizations...
==((====))==  Unsloth 2025.7.8: Fast Gemma3 patching. Transformers: 4.53.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
✅ Model loaded: unsloth/gemma-3-1b-it
📊 Model parameters: ~1B


Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.


Unsloth: Making `model.base_model.model.model` require gradients


Unsloth: Will map <end_of_turn> to EOS = <end_of_turn>.


🔧 LoRA Configuration for Compact Model:
  • Rank (r): 64 (higher than typical for compensation)
  • Alpha: 128 (strong adaptation signal)
  • Target modules: 9 layers
  • Dropout: 0.1 (balanced regularization)
✅ Gemma 3 1B model ready for GRPO training!
📈 Total parameters: 1,025,977,472
🎯 Trainable parameters: 26,091,520 (2.54%)


## 3. Compact Model Reward System Design

### Adaptation for Limited Capacity

With only 1B parameters, we need to design a more focused reward system that doesn't overwhelm the model's limited capacity. The key is **reward simplification** while maintaining effectiveness.

### Design Principles for Small Models
1. **Fewer Competing Objectives**: Reduce cognitive load on the model
2. **Clearer Reward Signals**: More distinct positive/negative feedback
3. **Progressive Complexity**: Start simple, gradually increase sophistication
4. **Emphasis on Critical Features**: Focus on the most important behaviors first

In [6]:
# Simplified reward system optimized for 1B parameter model
print("🎯 Designing compact model reward system...")

# Indonesian system prompt - simplified for better adherence
system_prompt = """Anda adalah asisten matematika yang menjawab dalam bahasa Indonesia.
Berikan penjelasan yang jelas dan mudah dipahami."""

class CompactModelRewardSystem:
    """Simplified reward system designed for small language models"""

    def __init__(self):
        self.weights = {
            'language': 0.6,      # Primary focus: Indonesian usage
            'correctness': 0.3,   # Secondary: mathematical accuracy
            'clarity': 0.1        # Tertiary: response clarity
        }

    def calculate_language_reward(self, response: str) -> float:
        """Primary reward: Indonesian language usage"""
        lang = get_lang(response)

        if lang == 'id':
            return 1.0  # Perfect score for Indonesian
        elif lang == 'en':
            return -0.5  # Penalty for English
        else:
            return -0.8  # Larger penalty for other languages

    def calculate_correctness_reward(self, response: str, question: str) -> float:
        """Simplified correctness check for mathematical content"""
        response_lower = response.lower()

        # Basic mathematical indicators
        math_indicators = [
            'jawaban', 'hasil', 'solusi', '=', 'sama dengan',
            'langkah', 'pertama', 'kedua', 'terakhir'
        ]

        # Count mathematical reasoning indicators
        indicator_count = sum(1 for indicator in math_indicators
                            if indicator in response_lower)

        # Simple scoring based on presence of mathematical language
        if indicator_count >= 3:
            return 0.8  # Good mathematical reasoning
        elif indicator_count >= 2:
            return 0.5  # Moderate reasoning
        elif indicator_count >= 1:
            return 0.2  # Minimal reasoning
        else:
            return -0.3  # No mathematical reasoning detected

    def calculate_clarity_reward(self, response: str) -> float:
        """Simplified clarity assessment"""
        # Basic clarity indicators
        words = response.split()
        sentences = response.split('.')

        # Optimal length range for 1B model responses
        if 20 <= len(words) <= 100:
            length_score = 0.5
        elif 10 <= len(words) <= 150:
            length_score = 0.2
        else:
            length_score = -0.2

        # Sentence structure bonus
        avg_sentence_length = len(words) / max(len(sentences), 1)
        if 5 <= avg_sentence_length <= 20:
            structure_score = 0.3
        else:
            structure_score = 0.0

        return length_score + structure_score

    def compute_reward(self, query: str, response: str, **kwargs) -> Dict[str, float]:
        """Compute final reward with detailed breakdown"""
        # Calculate individual rewards
        lang_reward = self.calculate_language_reward(response)
        correct_reward = self.calculate_correctness_reward(response, query)
        clarity_reward = self.calculate_clarity_reward(response)

        # Weighted combination
        total_reward = (
            self.weights['language'] * lang_reward +
            self.weights['correctness'] * correct_reward +
            self.weights['clarity'] * clarity_reward
        )

        return {
            'total': float(total_reward),
            'language': float(lang_reward),
            'correctness': float(correct_reward),
            'clarity': float(clarity_reward),
            'breakdown': {
                'detected_language': get_lang(response),
                'response_length': len(response.split()),
                'weights_used': self.weights
            }
        }

# Initialize the compact reward system
reward_system = CompactModelRewardSystem()

print("✅ Compact Model Reward System Initialized")
print(f"📊 Reward Weights: {reward_system.weights}")
print("🎯 Focus: Indonesian language usage (60% weight)")
print("🧮 Secondary: Mathematical correctness (30% weight)")
print("📝 Tertiary: Response clarity (10% weight)")

# Test the reward system
test_responses = [
    "Untuk menyelesaikan persamaan 2x + 3 = 7, kita perlu mengurangi 3 dari kedua sisi.",
    "To solve 2x + 3 = 7, subtract 3 from both sides.",
    "Jawaban: x = 2. Langkah pertama kurangi 3, langkah kedua bagi dengan 2."
]

print("\n🧪 Testing Reward System:")
for i, response in enumerate(test_responses, 1):
    reward = reward_system.compute_reward("Solve 2x + 3 = 7", response)
    print(f"\nTest {i}: {response[:50]}...")
    print(f"  Total Reward: {reward['total']:.3f}")
    print(f"  Language: {reward['language']:.3f} ({reward['breakdown']['detected_language']})")
    print(f"  Correctness: {reward['correctness']:.3f}")
    print(f"  Clarity: {reward['clarity']:.3f}")

🎯 Designing compact model reward system...
✅ Compact Model Reward System Initialized
📊 Reward Weights: {'language': 0.6, 'correctness': 0.3, 'clarity': 0.1}
🎯 Focus: Indonesian language usage (60% weight)
🧮 Secondary: Mathematical correctness (30% weight)
📝 Tertiary: Response clarity (10% weight)

🧪 Testing Reward System:

Test 1: Untuk menyelesaikan persamaan 2x + 3 = 7, kita per...
  Total Reward: 0.800
  Language: 1.000 (id)
  Correctness: 0.500
  Clarity: 0.500

Test 2: To solve 2x + 3 = 7, subtract 3 from both sides....
  Total Reward: -0.190
  Language: -0.500 (en)
  Correctness: 0.200
  Clarity: 0.500

Test 3: Jawaban: x = 2. Langkah pertama kurangi 3, langkah...
  Total Reward: 0.860
  Language: 1.000 (id)
  Correctness: 0.800
  Clarity: 0.200


In [7]:
reward

{'total': 0.86,
 'language': 1.0,
 'correctness': 0.8,
 'clarity': 0.2,
 'breakdown': {'detected_language': 'id',
  'response_length': 13,
  'weights_used': {'language': 0.6, 'correctness': 0.3, 'clarity': 0.1}}}

## 4. Dataset Preparation for Compact Models

### Curated Learning Strategy
For 1B models, we need **high-quality, focused datasets** rather than large volumes of data. The model's limited capacity requires careful curation to avoid confusion and ensure effective learning.

In [8]:

# Focused dataset for 1B model training
from datasets import Dataset
print("📚 Preparing focused dataset for Gemma 3 1B...")

# Smaller, more focused dataset for compact models
training_data = [
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Solve the equation 2x + 5 = 15"},
            {"role": "assistant", "content": "The solution to 2x + 5 = 15 is x = 5."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Find the area of a circle with radius 3"},
            {"role": "assistant", "content": "The area of a circle with radius 3 is approximately 28.27 square units."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "What is 15% of 80?"},
            {"role": "assistant", "content": "15% of 80 is 12."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Solve x^2 - 4x + 4 = 0"},
            {"role": "assistant", "content": "The solution to x^2 - 4x + 4 = 0 is x = 2."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Convert 2.5 hours to minutes"},
            {"role": "assistant", "content": "2.5 hours is equal to 150 minutes."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Find the perimeter of a rectangle with length 8 and width 5"},
            {"role": "assistant", "content": "The perimeter of a rectangle with length 8 and width 5 is 26 units."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "What is the square root of 64?"},
             {"role": "assistant", "content": "The square root of 64 is 8."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Simplify the fraction 15/25"},
            {"role": "assistant", "content": "The simplified fraction 15/25 is 3/5."}
        ]
    },
]

print(f"📊 Dataset size: {len(training_data)} examples")
print("🎯 Focus areas: Basic algebra, geometry, arithmetic")
print("🇮🇩 Language target: Indonesian responses")

# Convert to text format for GRPO
def format_training_data(data):
    """Format data for GRPO training"""
    formatted_data = []

    for item in data:
        # Apply chat template
        text = tokenizer.apply_chat_template(
            item["messages"],
            add_generation_prompt=True,
            tokenize=False,
        )
        formatted_data.append({"prompt": text})

    return formatted_data

formatted_prompts = format_training_data(training_data)
train_dataset = Dataset.from_list(formatted_prompts)

print(f"✅ {len(train_dataset)} prompts formatted for GRPO training")
print("\n📝 Sample formatted prompt:")
print(train_dataset[0]['prompt'][:200] + "...")

# Compact model training configuration
print("\n⚙️ GRPO Configuration for Compact Model:")

max_prompt_length = 256

grpo_config = GRPOConfig(
    # Learning parameters optimized for 1B model
    learning_rate=3e-4,           # Higher LR for faster adaptation
    num_train_epochs=5,           # More epochs for smaller dataset
    per_device_train_batch_size=4, # Small batch size for memory efficiency. Match this to dataset number for testing.
    gradient_accumulation_steps=8, # Effective batch size = 8

    # GRPO specific parameters
    beta=0.01,                    # Lower beta for gentler policy updates
    num_generations=4,            # Explicitly set number of generations

    # Optimization for compact models
    warmup_steps=10,             # Quick warmup
    logging_steps=1,             # Frequent logging for monitoring
    save_steps=50,               # Regular checkpoints
    eval_steps=25,               # Frequent evaluation

    # Memory optimization
    dataloader_drop_last=True, #Very important for you
    remove_unused_columns=True, #Change back to True

    # Output configuration
    output_dir="./grpo_gemma_1b_results",
    logging_dir="./grpo_gemma_1b_logs",

    # Regularization for small models
    weight_decay=0.01,           # Light weight decay
    max_grad_norm=1.0,           # Gradient clipping

    # Advanced settings
    # fp16=True,                   # Use FP16 for memory efficiency
    # report_to="none",              # Disable wandb for simplicity

    lr_scheduler_type = "cosine",
    optim = "adamw_torch_fused",

    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,

    )

print(f"📈 Learning rate: {grpo_config.learning_rate}")
print(f"🔄 Epochs: {grpo_config.num_train_epochs}")
print(f"📦 Batch size: {grpo_config.per_device_train_batch_size} × {grpo_config.gradient_accumulation_steps} = {grpo_config.per_device_train_batch_size * grpo_config.gradient_accumulation_steps}")
print(f"🎛️ Beta (KL penalty): {grpo_config.beta}")
print(f"💾 FP16 enabled: {grpo_config.fp16}")

📚 Preparing focused dataset for Gemma 3 1B...
📊 Dataset size: 8 examples
🎯 Focus areas: Basic algebra, geometry, arithmetic
🇮🇩 Language target: Indonesian responses
✅ 8 prompts formatted for GRPO training

📝 Sample formatted prompt:
<bos><start_of_turn>user
Anda adalah asisten matematika yang menjawab dalam bahasa Indonesia.
Berikan penjelasan yang jelas dan mudah dipahami. Solve the equation 2x + 5 = 15<end_of_turn>
<start_of_tu...

⚙️ GRPO Configuration for Compact Model:
📈 Learning rate: 0.0003
🔄 Epochs: 5
📦 Batch size: 4 × 8 = 32
🎛️ Beta (KL penalty): 0.01
💾 FP16 enabled: False


## 5. GRPO Training Execution for Compact Models

### Memory-Efficient Training Strategy
With Gemma 3 1B, we can afford more aggressive training schedules and frequent evaluations due to the reduced computational overhead.

In [9]:
# Execute GRPO training for Gemma 3 1B
print("🚀 Starting GRPO training for Gemma 3 1B...")
print("⏱️ Expected training time: ~10-15 minutes (much faster than 8B models)")

def reward_adapter(prompts, completions, completion_ids, **kwargs):
    rewards = []
    for prompt, completion in zip(prompts, completions):
        reward_output = reward_system.compute_reward(prompt, completion)
        rewards.append(reward_output['total'])
    return torch.tensor(rewards, dtype=torch.float32)

# Initialize GRPO trainer with compact model optimizations
trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    reward_funcs=reward_adapter,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
)

print("✅ GRPO Trainer initialized")
print(f"📊 Training on {len(train_dataset)} examples")
print(f"🔄 {grpo_config.num_train_epochs} epochs × {len(train_dataset)} examples = {grpo_config.num_train_epochs * len(train_dataset)} total steps")

🚀 Starting GRPO training for Gemma 3 1B...
⏱️ Expected training time: ~10-15 minutes (much faster than 8B models)
Unsloth: Switching to float32 training since model cannot work with float16
✅ GRPO Trainer initialized
📊 Training on 8 examples
🔄 5 epochs × 8 examples = 40 total steps


In [11]:
# Train the model
print("\n🎯 Beginning GRPO training...")
import torch
torch._dynamo.config.disable = True
training_result = trainer.train()
torch._dynamo.config.disable = False


print(f"✅ Training completed!")
# print(f"📈 Final loss: {training_result.training_loss:.4f}")
# print(f"⏱️ Training time: {training_result.training_time:.2f} seconds")

# Save the trained model
print("\n💾 Saving trained model...")
model.save_pretrained("gemma_1b_grpo_lora")
print("✅ Model saved as 'gemma_1b_grpo_lora'")


🎯 Beginning GRPO training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8 | Num Epochs = 5 | Total steps = 5
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
 "-____-"     Trainable parameters = 26,091,520 of 1,025,977,472 (2.54% trained)


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / reward_adapter / mean,rewards / reward_adapter / std
1,0.0017,-0.24375,0.035307,346.65625,192.0,538.0,0.0,346.65625,192.0,538.0,0.172516,-0.24375,0.054875
2,0.0017,-0.236563,0.024203,370.46875,212.0,1007.0,0.0,370.46875,212.0,1007.0,0.169418,-0.236563,0.038656
3,0.0018,-0.236563,0.027435,377.875,186.0,548.0,0.0,377.875,186.0,548.0,0.178324,-0.236563,0.045763
4,0.0019,-0.235625,0.022553,392.125,204.0,748.0,0.0,392.125,204.0,748.0,0.191965,-0.235625,0.030894
5,0.0019,-0.235625,0.022457,375.875,231.0,752.0,0.0,375.875,231.0,752.0,0.188307,-0.235625,0.027933


✅ Training completed!

💾 Saving trained model...
✅ Model saved as 'gemma_1b_grpo_lora'


In [13]:

# Quick evaluation
print("\n" + "="*50)
print("🧪 COMPACT MODEL EVALUATION")
print("="*50)

test_questions = [
    "What is 25% of 120?",
    "Solve for x: 3x - 7 = 14",
    "Find the circumference of a circle with radius 4"
]

print("\n🧮 Testing Gemma 3 1B after GRPO training:")

for i, question in enumerate(test_questions, 1):
    # Format the question
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question},
    ]

    text = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    )

    # Generate response
    inputs = tokenizer([text], return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=150,  # Shorter responses for 1B model
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

    # Evaluate the response
    reward_breakdown = reward_system.compute_reward(question, response)

    print(f"\nTest {i}: {question}")
    print(f"Response: {response}")
    print(f"Language: {reward_breakdown['breakdown']['detected_language']}")
    print(f"Total Reward: {reward_breakdown['total']:.3f}")
    print(f"  • Language: {reward_breakdown['language']:.3f}")
    print(f"  • Correctness: {reward_breakdown['correctness']:.3f}")
    print(f"  • Clarity: {reward_breakdown['clarity']:.3f}")

print("\n" + "="*50)
print("📊 COMPACT MODEL TRAINING SUMMARY")
print("="*50)
print(f"✅ Model: Gemma 3 1B (unsloth/gemma-2-2b-it)")
print(f"🎯 Training Method: GRPO with simplified reward system")
# print(f"📊 Dataset: {len(training_data)} focused mathematical examples")
# print(f"⏱️ Training Time: ~{training_result.training_time/60:.1f} minutes")
print(f"💾 Memory Usage: Optimized with 4-bit quantization + LoRA")
print(f"🇮🇩 Language Target: Indonesian mathematical explanations")
print(f"🔧 LoRA Configuration: r=64, alpha=128 (higher for compensation)")
print(f"📈 Key Insight: Compact models require focused datasets and higher LoRA ranks")

print("\n🎓 Educational Takeaways:")
print("  • 1B models can be effectively fine-tuned with GRPO")
print("  • Higher LoRA ranks compensate for smaller base capacity")
print("  • Simplified reward systems work better for compact models")
print("  • Focused datasets prevent confusion in small models")
print("  • Training is much faster, enabling rapid experimentation")


🧪 COMPACT MODEL EVALUATION

🧮 Testing Gemma 3 1B after GRPO training:

Test 1: What is 25% of 120?
Response: Tentu, mari kita pecahkan soal ini:

**1. Cara Menghitung:**

Ada beberapa cara untuk menghitung 25% dari 120:

*   **Cara 1: Bagi**
    *   25% sama dengan 25/100, atau 0,25
    *   0,25 * 120 = (1/4) * 120 = 30

*   **Cara 2: Kalikan dengan 0,25**
    *   120 dikalikan dengan 0,25: 120 * 0,25 = 30

**Jadi, 
Language: unknown
Total Reward: -0.280
  • Language: -0.800
  • Correctness: 0.500
  • Clarity: 0.500

Test 2: Solve for x: 3x - 7 = 14
Response: Tentu, mari kita pecahkan persamaan ini:

**1. Identifikasi Operasi yang Harus Dilakukan:**

Kita memiliki persamaan: 3x - 7 = 14

Kita perlu melakukan operasi yang akan mengubah persamaan menjadi bentuk yang bisa dipecahkan. Operasi yang perlu kita lakukan adalah pengurangan.

**2. Selesaikan untuk x:**

*   Kurangi 7 dari kedua sisi persamaan:
   3x - 7 - 7 = 14 - 7
   3x - 14 = 7

*   Tambahkan 14 ke kedua sisi:
   3x - 14 + 1