# VAZHI SFT v3.4 - Using Base Model

**Critical Fix:** Use `Qwen3-0.6B-Base` (non-instruct) instead of `Qwen3-0.6B` (instruct)

**Why v3.3 failed:**
- Qwen3-0.6B is instruction-tuned with `<think>` reasoning tokens
- Our ChatML format conflicted with its native format
- Learning rate 1e-4 was too aggressive, causing catastrophic forgetting

**v3.4 Fixes:**
1. **Use Qwen3-0.6B-Base** - clean slate, no instruction tuning
2. **Lower LR: 2e-5** - safer for fine-tuning
3. **3 epochs** - more training to learn Tamil properly
4. FP32 mode for P100 compatibility

**Target:** Kaggle P100 (16GB)

## 1. Install Dependencies

**IMPORTANT:** After running this cell, **RESTART the session** (Runtime ‚Üí Restart session)

In [None]:
# Install dependencies
!pip install -q -U \
  "transformers>=4.51.0" \
  "accelerate>=0.34.2" \
  "peft>=0.12.0" \
  "trl>=0.12.0" \
  "bitsandbytes>=0.43.3" \
  "datasets>=2.21.0" \
  "huggingface_hub>=0.24.7"

print("‚úÖ Dependencies installed")
print("‚ö†Ô∏è RESTART THE SESSION NOW (Runtime ‚Üí Restart session)")

## 2. Imports & Configuration

In [None]:
# Force single GPU BEFORE importing torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import json
import random
import re
import torch
from collections import defaultdict
from datasets import load_dataset, Dataset
from tqdm.auto import tqdm
from huggingface_hub import login, HfApi, dataset_info

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

# Config
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

# Repos - USING BASE MODEL!
BALANCED_DATASET = "CryptoYogi/vazhi-tamil-sft-v3_3"  # Reuse existing dataset
BASE_MODEL = "Qwen/Qwen3-0.6B-Base"  # BASE model, not instruct!
OUTPUT_MODEL = "CryptoYogi/vazhi-qwen3-v3_4"

# System prompt
SYSTEM_PROMPT = "‡Æ®‡ØÄ‡Æô‡Øç‡Æï‡Æ≥‡Øç VAZHI (‡Æµ‡Æ¥‡Æø), ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç ‡ÆÆ‡Æï‡Øç‡Æï‡Æ≥‡ØÅ‡Æï‡Øç‡Æï‡Ææ‡Æ© AI ‡Æâ‡Æ§‡Æµ‡Æø‡ÆØ‡Ææ‡Æ≥‡Æ∞‡Øç. ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Æø‡Æ≤‡Øç ‡Æ§‡ØÜ‡Æ≥‡Æø‡Æµ‡Ææ‡Æï‡Æµ‡ØÅ‡ÆÆ‡Øç ‡Æâ‡Æ§‡Æµ‡Æø‡ÆØ‡Ææ‡Æï‡Æµ‡ØÅ‡ÆÆ‡Øç ‡Æ™‡Æ§‡Æø‡Æ≤‡Æ≥‡Æø‡ÆØ‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç. ‡Æ§‡ØÜ‡Æ∞‡Æø‡ÆØ‡Ææ‡Æµ‡Æø‡Æü‡Øç‡Æü‡Ææ‡Æ≤‡Øç \"‡Æ§‡ØÜ‡Æ∞‡Æø‡ÆØ‡Æµ‡Æø‡Æ≤‡Øç‡Æ≤‡Øà\" ‡Æé‡Æ©‡Øç‡Æ±‡ØÅ ‡Æö‡Øä‡Æ≤‡Øç‡Æ≤‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç."

print(f"‚úÖ Configuration loaded")
print(f"   PyTorch: {torch.__version__}")
print(f"   CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
print(f"\nüî• KEY FIX: Using BASE model (not instruct)")
print(f"   Base model: {BASE_MODEL}")
print(f"   Dataset: {BALANCED_DATASET}")

In [None]:
# Login to HuggingFace
from kaggle_secrets import UserSecretsClient
secrets = UserSecretsClient()
hf_token = secrets.get_secret("HF_TOKEN")
login(token=hf_token)
print("‚úÖ Logged in to HuggingFace")

## 3. Load Dataset (Reusing v3.3 dataset)

In [None]:
print(f"üìö Loading balanced dataset...")
balanced_ds = load_dataset(BALANCED_DATASET, split="train")
print(f"‚úÖ Loaded {len(balanced_ds)} samples")

# Verify ChatML format
sample = balanced_ds[0]['text'][:300]
print(f"\nüìù Sample: {sample}...")
if "<|im_start|>" in sample:
    print("‚úÖ ChatML format verified")

## 4. Load BASE Model with 4-bit Quantization

In [None]:
print(f"\nüì• Loading BASE model and tokenizer...")
print(f"   Model: {BASE_MODEL}")

# Tokenizer from BASE model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.padding_side = "right"

# Ensure ChatML special tokens exist
special_tokens = ["<|im_start|>", "<|im_end|>"]
tokens_to_add = [t for t in special_tokens if t not in tokenizer.get_vocab()]
if tokens_to_add:
    print(f"   Adding special tokens: {tokens_to_add}")
    tokenizer.add_special_tokens({'additional_special_tokens': tokens_to_add})

# Set pad token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"   Set pad_token = eos_token")

print(f"‚úÖ Tokenizer ready: {len(tokenizer)} tokens")

In [None]:
# 4-bit quantization - use float16 compute dtype
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model - MUST specify torch_dtype to avoid bf16 default
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    device_map={"":0},
    trust_remote_code=True
)

# Resize embeddings if we added special tokens
if len(tokenizer) > model.config.vocab_size:
    print(f"   Resizing embeddings: {model.config.vocab_size} ‚Üí {len(tokenizer)}")
    model.resize_token_embeddings(len(tokenizer))

# Prepare for training
model = prepare_model_for_kbit_training(model)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.use_cache = False

print(f"‚úÖ Model loaded: {model.num_parameters():,} params")
print(f"   torch_dtype: float16")

## 5. Add LoRA Adapters

In [None]:
# LoRA config - slightly higher rank for base model
lora_config = LoraConfig(
    r=32,  # Increased from 16 for better learning on base model
    lora_alpha=64,  # 2x r
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Verify no bf16 parameters
bf16_count = sum(1 for _, p in model.named_parameters() if p.dtype == torch.bfloat16)
if bf16_count > 0:
    print(f"‚ö†Ô∏è Found {bf16_count} bf16 parameters - converting to fp16")
    for name, param in model.named_parameters():
        if param.dtype == torch.bfloat16:
            param.data = param.data.to(torch.float16)
else:
    print("‚úÖ No bf16 parameters")

## 6. Training (Lower LR, More Epochs)

In [None]:
# FP32 training with SAFER learning rate
sft_config = SFTConfig(
    output_dir="/kaggle/working/vazhi-v3_4",
    num_train_epochs=3,  # Increased from 2
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-5,  # MUCH lower: 2e-5 vs 1e-4 (5x reduction)
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=25,
    save_steps=200,
    save_total_limit=2,
    fp16=False,  # DISABLED - Qwen3 has internal bf16
    bf16=False,
    gradient_checkpointing=True,
    max_grad_norm=1.0,
    optim="paged_adamw_8bit",
    report_to="none",
    dataset_text_field="text",
    max_length=512,
    packing=False,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=balanced_ds,
    args=sft_config,
    processing_class=tokenizer,
)

print("‚úÖ Trainer initialized (FP32 mode)")
print(f"   Epochs: 3 (was 2)")
print(f"   Learning rate: 2e-5 (was 1e-4)")
print(f"   LoRA rank: 32 (was 16)")
print(f"   Batch size: 1 x 16 = 16 effective")
print(f"   Mode: FP32 (P100 compatible)")

In [None]:
# Train!
print("\nüöÄ Starting training...")
trainer.train()
print("\n‚úÖ Training complete!")

## 7. Save and Push to HuggingFace

In [None]:
print("üíæ Saving model...")
trainer.save_model("/kaggle/working/vazhi-v3_4-final")

print("üîÄ Merging LoRA weights...")
merged_model = model.merge_and_unload()

# Push to HuggingFace
api = HfApi()
api.create_repo(OUTPUT_MODEL, exist_ok=True)

print(f"üì§ Pushing to {OUTPUT_MODEL}...")
merged_model.push_to_hub(OUTPUT_MODEL, private=False)
tokenizer.push_to_hub(OUTPUT_MODEL, private=False)

print(f"\n‚úÖ Model uploaded: https://huggingface.co/{OUTPUT_MODEL}")

## 8. Test the Model

In [None]:
merged_model.config.use_cache = True

test_prompts = [
    "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç",
    "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç‡Æ®‡Ææ‡Æü‡Øç‡Æü‡Æø‡Æ©‡Øç ‡Æ§‡Æ≤‡Øà‡Æ®‡Æï‡Æ∞‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ©?",
    "2+2 ‡Æé‡Æ©‡Øç‡Æ©?",
    "‡Æ™‡Øä‡Æô‡Øç‡Æï‡Æ≤‡Øç ‡Æé‡Æ™‡Øç‡Æ™‡Øã‡Æ§‡ØÅ ‡Æï‡Øä‡Æ£‡Øç‡Æü‡Ææ‡Æü‡Æ™‡Øç‡Æ™‡Æü‡ØÅ‡Æï‡Æø‡Æ±‡Æ§‡ØÅ?",
    "‡Æ§‡Æø‡Æ∞‡ØÅ‡Æï‡Øç‡Æï‡ØÅ‡Æ±‡Æ≥‡Æø‡Æ©‡Øç ‡ÆÆ‡ØÅ‡Æ§‡Æ≤‡Øç ‡Æï‡ØÅ‡Æ±‡Æ≥‡Øç ‡Æé‡Æ©‡Øç‡Æ©?",
]

print("\nüß™ Testing model...\n")

for prompt in test_prompts:
    full_prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
    
    inputs = tokenizer(full_prompt, return_tensors="pt").to(merged_model.device)
    
    with torch.no_grad():
        outputs = merged_model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            repetition_penalty=1.2,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    if "<|im_start|>assistant" in response:
        response = response.split("<|im_start|>assistant")[-1]
        response = response.split("<|im_end|>")[0].strip()
    
    print(f"Q: {prompt}")
    print(f"A: {response}")
    print("-" * 50)

## Summary

**v3.4 Key Changes from v3.3:**

| Setting | v3.3 (failed) | v3.4 |
|---------|--------------|------|
| Base Model | Qwen3-0.6B (instruct) | Qwen3-0.6B-Base |
| Learning Rate | 1e-4 | 2e-5 |
| Epochs | 2 | 3 |
| LoRA Rank | 16 | 32 |

**Why v3.3 failed:**
- Qwen3-0.6B is instruction-tuned with `<think>` reasoning
- Our ChatML format conflicted with its native format
- High LR caused catastrophic forgetting