# ü§ñ Gemma2 9B Fine-Tuning - ZANTARA Dataset

Fine-tuning Gemma2 9B su conversazioni indonesiane/giavanesi/italiane per migliorare la naturalezza.

**Obiettivo:** Aumentare naturalezza da 67.1/100 a 85+/100

**Dataset:**
- Train: 6,000 conversazioni (79,769 messaggi)
- Validation: 750 conversazioni (9,751 messaggi)
- Test: 750 conversazioni (10,082 messaggi)

**Metodo:** QLoRA (Quantized Low-Rank Adaptation)
- 4-bit quantization per ridurre VRAM
- LoRA rank 16 per training efficiente
- Compatible con Colab Pro (A100 40GB) o Pro+ (V100 16GB)

## üìã Step 0: Setup Environment

In [None]:
# Check GPU
!nvidia-smi

# Install dependencies
!pip install -q -U transformers accelerate peft bitsandbytes datasets trl huggingface_hub

print("‚úÖ Setup complete!")

## üì¶ Step 1: Load Dataset from Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Dataset paths (MODIFY THESE based on your Drive structure)
DATASET_DIR = '/content/drive/MyDrive/GEMMA_FINETUNING/splits'
train_path = f'{DATASET_DIR}/train.jsonl'
val_path = f'{DATASET_DIR}/validation.jsonl'
test_path = f'{DATASET_DIR}/test.jsonl'

# Verify files exist
for path in [train_path, val_path, test_path]:
    if os.path.exists(path):
        size_mb = os.path.getsize(path) / 1024 / 1024
        print(f"‚úÖ Found: {os.path.basename(path)} ({size_mb:.1f} MB)")
    else:
        print(f"‚ùå Missing: {path}")
        raise FileNotFoundError(f"Dataset file not found: {path}")

print("\n‚úÖ All dataset files loaded!")

## üîç Step 2: Preview Dataset

In [None]:
from datasets import load_dataset
import json

# Load datasets
dataset = load_dataset('json', data_files={
    'train': train_path,
    'validation': val_path,
    'test': test_path
})

print("üìä Dataset Statistics:")
print(f"  Train:      {len(dataset['train']):,} conversations")
print(f"  Validation: {len(dataset['validation']):,} conversations")
print(f"  Test:       {len(dataset['test']):,} conversations")
print(f"  Total:      {len(dataset['train']) + len(dataset['validation']) + len(dataset['test']):,} conversations")

# Preview samples
print("\nüìù Sample Conversation (Train):")
sample = dataset['train'][0]
print(f"Messages: {len(sample['messages'])}")
for i, msg in enumerate(sample['messages'][:4], 1):
    role_emoji = "üë§" if msg['role'] == 'user' else "ü§ñ"
    print(f"  [{i}] {role_emoji} {msg['role']}: {msg['content'][:100]}...")

if len(sample['messages']) > 4:
    print(f"  ... ({len(sample['messages']) - 4} more messages)")

## üîë Step 3: Login to Hugging Face (Optional)

Required if you want to:
- Save model to Hugging Face Hub
- Access gated models

Get token from: https://huggingface.co/settings/tokens

In [None]:
from huggingface_hub import login

# Login (optional)
# Uncomment and add your token if needed
# HF_TOKEN = "hf_..."
# login(token=HF_TOKEN)

print("‚úÖ Ready to proceed (login skipped)")

## ü§ñ Step 4: Load Gemma2 9B with 4-bit Quantization

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Model configuration
MODEL_NAME = "google/gemma-2-9b-it"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

print("üì• Loading Gemma2 9B (4-bit)...")
print("‚è≥ This may take 3-5 minutes...")

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("‚úÖ Model loaded successfully!")
print(f"üìä Model size: ~5GB (4-bit quantized from ~18GB)")

## ‚öôÔ∏è Step 5: Configure LoRA

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                              # LoRA rank (higher = more parameters, slower)
    lora_alpha=32,                     # LoRA scaling
    target_modules=[                   # Which layers to adapt
        "q_proj",
        "k_proj", 
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
trainable_params = 0
all_params = 0
for _, param in model.named_parameters():
    all_params += param.numel()
    if param.requires_grad:
        trainable_params += param.numel()

print(f"‚úÖ LoRA configured!")
print(f"üìä Trainable params: {trainable_params:,} / {all_params:,} ({100 * trainable_params / all_params:.2f}%)")
print(f"üíæ Memory footprint: ~5-8GB VRAM (compatible with Colab Pro)")

## üìù Step 6: Prepare Dataset for Training

In [None]:
def format_conversation(example):
    """
    Format conversation into Gemma chat template
    
    Gemma format:
    <start_of_turn>user\n{message}<end_of_turn>\n<start_of_turn>model\n{response}<end_of_turn>\n
    """
    messages = example['messages']
    
    # Build formatted conversation
    formatted = ""
    for msg in messages:
        role = "user" if msg['role'] == 'user' else "model"
        formatted += f"<start_of_turn>{role}\n{msg['content']}<end_of_turn>\n"
    
    return {"text": formatted}

# Apply formatting
print("üîÑ Formatting datasets...")
train_dataset = dataset['train'].map(format_conversation, remove_columns=['messages'])
eval_dataset = dataset['validation'].map(format_conversation, remove_columns=['messages'])

print("‚úÖ Datasets formatted!")
print(f"\nüìù Sample formatted text:")
print(train_dataset[0]['text'][:500] + "...")

## üöÄ Step 7: Training Configuration

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Training arguments
training_args = TrainingArguments(
    output_dir="./gemma2-zantara-indonesian",
    num_train_epochs=3,                    # Number of epochs
    per_device_train_batch_size=1,         # Batch size per GPU (increase if VRAM allows)
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,         # Effective batch size = 4
    gradient_checkpointing=True,           # Reduce VRAM usage
    
    # Optimizer
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_steps=100,
    
    # Logging
    logging_steps=10,
    logging_dir="./logs",
    
    # Evaluation
    eval_strategy="steps",
    eval_steps=200,
    
    # Saving
    save_strategy="steps",
    save_steps=200,
    save_total_limit=2,                    # Keep only 2 best checkpoints
    
    # Performance
    fp16=False,
    bf16=True,                             # Use bfloat16 for A100
    max_grad_norm=0.3,
    
    # Other
    report_to="none",                      # Disable wandb/tensorboard
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

print("‚úÖ Training arguments configured!")
print(f"üìä Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"‚è±Ô∏è  Estimated training time: ~2-4 hours on A100")

## üéì Step 8: Initialize Trainer

In [None]:
# Initialize SFTTrainer (Supervised Fine-Tuning)
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=2048,                   # Maximum sequence length
    tokenizer=tokenizer,
    args=training_args,
    packing=False,                         # Don't pack multiple examples together
)

print("‚úÖ Trainer initialized!")
print(f"üöÄ Ready to start training!")

## üèãÔ∏è Step 9: Start Training

**‚ö†Ô∏è WARNING:** This will take 2-4 hours on A100 GPU!

In [None]:
import datetime

print(f"üöÄ Starting training at {datetime.datetime.now().strftime('%H:%M:%S')}")
print("‚è≥ This will take approximately 2-4 hours...\n")

# Train!
trainer.train()

print(f"\n‚úÖ Training completed at {datetime.datetime.now().strftime('%H:%M:%S')}!")

## üíæ Step 10: Save Fine-Tuned Model

In [None]:
# Save LoRA adapter
output_dir = "./gemma2-zantara-indonesian-final"
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"‚úÖ Model saved to {output_dir}")

# Copy to Google Drive for persistence
import shutil
drive_output = "/content/drive/MyDrive/GEMMA_FINETUNING/gemma2-zantara-indonesian-final"
shutil.copytree(output_dir, drive_output, dirs_exist_ok=True)

print(f"‚úÖ Model backed up to Google Drive: {drive_output}")
print(f"üì¶ Size: ~100-200MB (LoRA adapters only)")

## üß™ Step 11: Test Fine-Tuned Model

In [None]:
from transformers import pipeline

# Create text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.95
)

# Test with Indonesian conversation
test_prompt = """<start_of_turn>user
Halo! Gue mau tanya dong soal visa investor. Prosesnya gimana sih?<end_of_turn>
<start_of_turn>model
"""

print("üß™ Testing fine-tuned model...\n")
result = pipe(test_prompt, do_sample=True)[0]['generated_text']

# Extract only the assistant's response
response = result.split("<start_of_turn>model\n")[-1].split("<end_of_turn>")[0]
print(f"üë§ User: Halo! Gue mau tanya dong soal visa investor. Prosesnya gimana sih?")
print(f"ü§ñ Assistant: {response}")

## üìä Step 12: Evaluate on Test Set (Optional)

In [None]:
# Evaluate on test set
test_dataset = dataset['test'].map(format_conversation, remove_columns=['messages'])

print("üîç Evaluating on test set...")
eval_results = trainer.evaluate(eval_dataset=test_dataset)

print("\nüìä Test Results:")
for key, value in eval_results.items():
    print(f"  {key}: {value:.4f}")

print("\n‚úÖ Evaluation complete!")

## üöÄ Step 13: Push to Hugging Face Hub (Optional)

In [None]:
# Uncomment to push to Hugging Face Hub
# HF_USERNAME = "your-username"
# HF_REPO = "gemma2-9b-zantara-indonesian"

# trainer.model.push_to_hub(f"{HF_USERNAME}/{HF_REPO}")
# tokenizer.push_to_hub(f"{HF_USERNAME}/{HF_REPO}")

# print(f"‚úÖ Model pushed to https://huggingface.co/{HF_USERNAME}/{HF_REPO}")

print("‚ÑπÔ∏è  Push to Hub skipped (uncomment to enable)")

## üìù Summary

### What We Did
1. ‚úÖ Loaded 7,500 Indonesian/Javanese/Italian conversations
2. ‚úÖ Fine-tuned Gemma2 9B with QLoRA (4-bit)
3. ‚úÖ Saved LoRA adapters (~100-200MB)
4. ‚úÖ Tested model with natural conversations

### Next Steps
1. **Integrate into ZANTARA backend:**
   - Load fine-tuned model in `apps/backend-rag`
   - Replace current LLM with fine-tuned Gemma2

2. **Evaluate naturalness:**
   - Use quality analyzer from dataset generation
   - Measure particle usage, slang density
   - Target: 85+/100 naturalness score

3. **Generate more data (if needed):**
   - Current: 7,500 conversations
   - Target: 24,000 conversations
   - Re-train with expanded dataset

### Model Files
- **Local:** `./gemma2-zantara-indonesian-final/`
- **Google Drive:** `/content/drive/MyDrive/GEMMA_FINETUNING/gemma2-zantara-indonesian-final/`

### Resources
- Training time: ~2-4 hours on A100
- VRAM usage: ~5-8GB (4-bit + LoRA)
- Disk space: ~100-200MB (adapters only)

---

**Created:** November 2025  
**Dataset:** ZANTARA Indonesian/Javanese/Italian  
**Model:** Gemma2 9B Instruct + QLoRA