# VAZHI SFT v3.8 ‚Äî v4.0 Dataset + fp16 Merge

**v3.8 = v3.7 merge fix + v4.0 curated dataset (ADR-010)**

**What changed from v3.7:**
- **Dataset:** `CryptoYogi/vazhi-tamil-sft-v4_0` (1,514 samples, properly composed)
  - 50% domain packs (security, govt, education, legal, healthcare, culture)
  - 33% IndicAlign diversity (Dolly_T, WikiHow, Wiki_Conv, OpenAssistant_T)
  - 6% Kural interpretive (hard-capped, anti-memorization filtered)
  - 3% handcrafted (guardrails, refusal, brevity, greeting)
  - 8% general knowledge (dialects, emotions, daily routines)
- **Output:** `CryptoYogi/vazhi-qwen3-v3_8`

**Carried over from v3.7 (all correct):**
1. Save LoRA adapter separately (not merged into 4-bit)
2. Reload base model in fp16 ‚Üí merge in full precision
3. Disable gradient checkpointing before eval
4. Text-based loss logging
5. Pre-merge sanity check on PeftModel

**Target:** Kaggle P100 (16GB)

## 1. Install Dependencies

**After running this cell, RESTART the session** (Runtime ‚Üí Restart session)

In [None]:
# Install dependencies ‚Äî pin TRL to avoid DataCollatorForCompletionOnlyLM removal
!pip install -q -U \
  "transformers>=4.51.0" \
  "accelerate>=0.34.2" \
  "peft>=0.12.0" \
  "trl>=0.12.0,<0.20.0" \
  "bitsandbytes>=0.43.3" \
  "datasets>=2.21.0" \
  "huggingface_hub>=0.24.7"

print("‚úÖ Dependencies installed")
print("‚ö†Ô∏è  RESTART THE SESSION NOW (Runtime ‚Üí Restart session)")

## 2. Imports & Configuration

In [None]:
# Force single GPU BEFORE importing torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import json
import re
import random
import torch
import numpy as np
from datasets import load_dataset, Dataset
from huggingface_hub import login, HfApi

from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
    TrainerCallback,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# === KEY CONFIG ===
DATASET_NAME = "CryptoYogi/vazhi-tamil-sft-v4_0"  # ADR-010 curated dataset
BASE_MODEL = "Qwen/Qwen3-0.6B"                    # INSTRUCT model (NOT Base!)
OUTPUT_MODEL = "CryptoYogi/vazhi-qwen3-v3_8"

# Same hyperparameters as v3.6/v3.7 (training config was correct)
LEARNING_RATE = 2e-5
NUM_EPOCHS = 3
MAX_SEQ_LENGTH = 1024
LORA_R = 16
LORA_ALPHA = 32

SYSTEM_PROMPT = (
    "‡Æ®‡ØÄ‡Æô‡Øç‡Æï‡Æ≥‡Øç VAZHI (‡Æµ‡Æ¥‡Æø), ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç ‡ÆÆ‡Æï‡Øç‡Æï‡Æ≥‡ØÅ‡Æï‡Øç‡Æï‡Ææ‡Æ© AI ‡Æâ‡Æ§‡Æµ‡Æø‡ÆØ‡Ææ‡Æ≥‡Æ∞‡Øç. "
    "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Æø‡Æ≤‡Øç ‡Æ§‡ØÜ‡Æ≥‡Æø‡Æµ‡Ææ‡Æï‡Æµ‡ØÅ‡ÆÆ‡Øç ‡Æâ‡Æ§‡Æµ‡Æø‡ÆØ‡Ææ‡Æï‡Æµ‡ØÅ‡ÆÆ‡Øç ‡Æ™‡Æ§‡Æø‡Æ≤‡Æ≥‡Æø‡ÆØ‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç. "
    '‡Æ§‡ØÜ‡Æ∞‡Æø‡ÆØ‡Ææ‡Æµ‡Æø‡Æü‡Øç‡Æü‡Ææ‡Æ≤‡Øç "‡Æ§‡ØÜ‡Æ∞‡Æø‡ÆØ‡Æµ‡Æø‡Æ≤‡Øç‡Æ≤‡Øà" ‡Æé‡Æ©‡Øç‡Æ±‡ØÅ ‡Æö‡Øä‡Æ≤‡Øç‡Æ≤‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç.'
)

print(f"‚úÖ Configuration loaded")
print(f"   PyTorch: {torch.__version__}")
print(f"   CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
print(f"")
print(f"üîë v3.8 = v3.7 merge fix + v4.0 dataset:")
print(f"   Dataset: {DATASET_NAME}")
print(f"   Output:  {OUTPUT_MODEL}")
print(f"   FIX: LoRA merge in fp16, NOT 4-bit")
print(f"   NEW: ADR-010 curated dataset (balanced composition)")

In [None]:
# Login to HuggingFace
from kaggle_secrets import UserSecretsClient
secrets = UserSecretsClient()
hf_token = secrets.get_secret("HF_TOKEN")
login(token=hf_token)
print("‚úÖ Logged in to HuggingFace")

## 3. Load v4.0 Curated Dataset

**ADR-010 dataset** built by `Vazhi_Dataset_Factory_v4_0.ipynb`:
- 1,365 train / 149 validation samples
- Composition-enforced: domain packs 50%, IndicAlign 33%, kural 6%, handcrafted 3%, general 8%
- All samples are strict ChatML with Tamil char % >= 30%

In [None]:
# === HELPER FUNCTIONS ===

def count_tamil_chars(text):
    """Count Tamil Unicode characters."""
    return sum(1 for c in text if '‡ÆÄ' <= c <= '‡Øø')

def tamil_char_pct(text):
    """Get Tamil character percentage."""
    if not text:
        return 0.0
    return 100.0 * count_tamil_chars(text) / len(text)

# === LOAD DATASET ===
print(f"üìö Loading dataset from {DATASET_NAME}...")
ds = load_dataset(DATASET_NAME)
balanced_ds = ds["train"]
eval_ds = ds["validation"]
print(f"   Train: {len(balanced_ds)} samples")
print(f"   Validation: {len(eval_ds)} samples")

# Quick stats on train split
kural_count = sum(1 for item in balanced_ds
                  if any(k in item['text'] for k in ['‡Æï‡ØÅ‡Æ±‡Æ≥‡Øç', '‡Æ§‡Æø‡Æ∞‡ØÅ‡Æï‡Øç‡Æï‡ØÅ‡Æ±‡Æ≥‡Øç']))
avg_len = sum(len(item['text']) for item in balanced_ds) / len(balanced_ds)
short_count = sum(1 for item in balanced_ds if len(item['text']) < 400)

# Bucket distribution (v4.0 has bucket field)
from collections import Counter
bucket_dist = Counter(item.get('bucket', 'unknown') for item in balanced_ds)

print(f"\nüìä Train dataset stats:")
print(f"   Total samples: {len(balanced_ds)}")
print(f"   Kural: {kural_count} ({100*kural_count/len(balanced_ds):.1f}%)")
print(f"   Short (<400 chars): {short_count} ({100*short_count/len(balanced_ds):.1f}%)")
print(f"   Avg length: {avg_len:.0f} chars")
print(f"   Buckets:")
for bucket, count in sorted(bucket_dist.items(), key=lambda x: -x[1]):
    print(f"     {bucket}: {count} ({100*count/len(balanced_ds):.1f}%)")

## 4. Load Model + Tokenizer

**CRITICAL:** Using `Qwen/Qwen3-0.6B` (INSTRUCT), not Base.
The instruct model already has Tamil capability ‚Äî v3.3 proved this.

In [None]:
print(f"üì• Loading tokenizer from {BASE_MODEL}...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.padding_side = "right"

print(f"‚úÖ Tokenizer ready: {len(tokenizer)} tokens")
print(f"   pad_token: {tokenizer.pad_token!r} (ID {tokenizer.pad_token_id})")
print(f"   eos_token: {tokenizer.eos_token!r} (ID {tokenizer.eos_token_id})")

# Verify ChatML tokens exist
for token in ["<|im_start|>", "<|im_end|>"]:
    assert token in tokenizer.get_vocab(), f"Missing {token} in tokenizer!"
print("‚úÖ ChatML tokens present in tokenizer")

# Get <think> token IDs for suppression during generation
think_open_ids = tokenizer.encode("<think>", add_special_tokens=False)
think_close_ids = tokenizer.encode("</think>", add_special_tokens=False)
suppress_ids = list(set(think_open_ids + think_close_ids))
print(f"\nüß† <think> token IDs to suppress: {suppress_ids}")
print(f"   Decoded: {[tokenizer.decode([t]) for t in suppress_ids]}")

In [None]:
# 4-bit quantization config (for training memory only)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print(f"üì• Loading model {BASE_MODEL} in 4-bit (for training)...")
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    device_map={"":0},
    trust_remote_code=True,
)

# Prepare for training
model = prepare_model_for_kbit_training(model)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.use_cache = False  # Required for gradient checkpointing

print(f"‚úÖ Model loaded: {model.num_parameters():,} params")
print(f"   ‚ö†Ô∏è  4-bit is for training memory ONLY")
print(f"   ‚ö†Ô∏è  Will merge LoRA in fp16 (NOT 4-bit) after training")

## 5. LoRA Setup

In [None]:
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Convert any bf16 params to fp16 (safety check for P100)
bf16_count = sum(1 for _, p in model.named_parameters() if p.dtype == torch.bfloat16)
if bf16_count > 0:
    print(f"‚ö†Ô∏è  Converting {bf16_count} bf16 parameters to fp16")
    for name, param in model.named_parameters():
        if param.dtype == torch.bfloat16:
            param.data = param.data.to(torch.float16)
else:
    print("‚úÖ No bf16 parameters")

## 6. Completion-Only Masking

In [None]:
response_template_str = "<|im_start|>assistant\n"
response_template_ids = tokenizer.encode(response_template_str, add_special_tokens=False)

print(f"Response template: {response_template_str!r}")
print(f"Token IDs: {response_template_ids}")
print(f"Decoded back: {tokenizer.decode(response_template_ids)!r}")

# Fallback: without trailing newline
response_template_short = "<|im_start|>assistant"
response_template_short_ids = tokenizer.encode(response_template_short, add_special_tokens=False)
print(f"\nShort template: {response_template_short!r}")
print(f"Short token IDs: {response_template_short_ids}")

# Verify template in actual data
sample_text = balanced_ds[0]["text"]
sample_ids = tokenizer.encode(sample_text, add_special_tokens=False)

def find_template(sample_ids, template_ids):
    for i in range(len(sample_ids) - len(template_ids) + 1):
        if sample_ids[i:i+len(template_ids)] == template_ids:
            return i
    return -1

pos = find_template(sample_ids, response_template_ids)
if pos >= 0:
    print(f"\n‚úÖ Full template found at token position {pos}")
    use_template_ids = response_template_ids
else:
    pos = find_template(sample_ids, response_template_short_ids)
    if pos >= 0:
        print(f"\n‚ö†Ô∏è  Full template NOT found, using short template at position {pos}")
        use_template_ids = response_template_short_ids
    else:
        raise RuntimeError("STOP: Neither template found in tokenized sample!")

In [None]:
# Create collator and run preflight
collator = DataCollatorForCompletionOnlyLM(
    response_template=use_template_ids,
    tokenizer=tokenizer,
)

print(f"\nüìä Preflight masking verification (20 samples)...")
fail_count = 0
total_trainable = 0
total_tokens = 0

for idx in range(min(20, len(balanced_ds))):
    t = tokenizer(
        balanced_ds[idx]["text"],
        return_tensors="pt",
        truncation=True,
        max_length=MAX_SEQ_LENGTH,
    )
    b = collator([{"input_ids": t["input_ids"][0], "attention_mask": t["attention_mask"][0]}])
    n_train = (b["labels"][0] != -100).sum().item()
    n_total = len(b["labels"][0])
    total_trainable += n_train
    total_tokens += n_total

    if n_train == 0 or n_train == n_total:
        fail_count += 1
        status = "‚ùå ALL MASKED" if n_train == 0 else "‚ùå NO MASKING"
        print(f"   Sample {idx}: {n_train}/{n_total} {status}")

if fail_count == 0:
    pct = 100 * total_trainable / total_tokens
    print(f"   All 20 samples passed ‚úÖ")
    print(f"   Avg trainable: {pct:.1f}% of tokens")
else:
    print(f"\n‚ùå {fail_count}/20 samples have masking issues!")
    if fail_count > 5:
        raise RuntimeError("TOO MANY FAILURES ‚Äî DO NOT PROCEED WITH TRAINING")

## 7. Training

**Same as v3.6** plus text-based loss logging (v3.6 only had HTML widget, loss wasn't visible in notebook output).

In [None]:
# === TEXT-BASED LOSS LOGGING ===
# v3.6 only had HTML widget ‚Äî loss wasn't captured in notebook output.
# This callback prints loss as plain text so we can verify convergence.

class LossLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and "loss" in logs:
            lr = logs.get("learning_rate", 0)
            step = state.global_step
            loss = logs["loss"]
            print(f"  Step {step:4d} | Loss: {loss:.4f} | LR: {lr:.2e}")


sft_config = SFTConfig(
    output_dir="/kaggle/working/vazhi-v3_8",
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=25,
    save_steps=50,
    save_total_limit=3,
    fp16=False,            # FP32 mode for P100
    bf16=False,
    gradient_checkpointing=True,
    max_grad_norm=1.0,
    optim="paged_adamw_8bit",
    report_to="none",
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    packing=False,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=balanced_ds,
    args=sft_config,
    processing_class=tokenizer,
    data_collator=collator,
    callbacks=[LossLoggingCallback()],
)

print("‚úÖ Trainer initialized")
print(f"   Model: {BASE_MODEL} (INSTRUCT)")
print(f"   Dataset: {DATASET_NAME} ({len(balanced_ds)} train samples)")
print(f"   LR: {LEARNING_RATE}")
print(f"   Epochs: {NUM_EPOCHS}")
print(f"   LoRA: r={LORA_R}, alpha={LORA_ALPHA}")
print(f"   Max seq length: {MAX_SEQ_LENGTH}")
print(f"   Loss logging: TEXT (not just HTML widget)")

In [None]:
print("\nüöÄ Starting training...")
train_result = trainer.train()
print("\n‚úÖ Training complete!")

# Print final metrics as text (so they're captured in notebook output)
metrics = train_result.metrics
print(f"\nüìä Final Training Metrics:")
for k, v in metrics.items():
    print(f"   {k}: {v}")

## 8. Save LoRA Adapter (NOT Merged Model)

**CRITICAL FIX from v3.6:** Do NOT call `model.merge_and_unload()` on the 4-bit model.
Instead, save the LoRA adapter, then reload base model in fp16 for merging.

In [None]:
ADAPTER_PATH = "/kaggle/working/vazhi-v3_8-lora"

print("üíæ Saving LoRA adapter (NOT merging into 4-bit!)...")
trainer.save_model(ADAPTER_PATH)
tokenizer.save_pretrained(ADAPTER_PATH)
print(f"‚úÖ LoRA adapter saved to {ADAPTER_PATH}")

# Verify adapter files exist
import glob
adapter_files = glob.glob(f"{ADAPTER_PATH}/*")
print(f"   Files: {[os.path.basename(f) for f in adapter_files]}")
assert any('adapter' in f for f in adapter_files), "No adapter files found!"
print("‚úÖ Adapter files verified")

In [None]:
# === FREE 4-BIT MODEL FROM GPU ===
print("üóëÔ∏è  Freeing 4-bit training model from GPU...")
del model, trainer
torch.cuda.empty_cache()
import gc
gc.collect()

gpu_mem = torch.cuda.memory_allocated() / 1024**3
print(f"‚úÖ GPU memory after cleanup: {gpu_mem:.1f} GB")

## 9. Reload Base Model in FP16 + Merge LoRA

**This is THE fix.** Load the base model in fp16 (~1.5GB), apply LoRA adapter, merge in full precision.
No 4-bit rounding errors.

In [None]:
print(f"üì• Reloading {BASE_MODEL} in FP16 (NOT 4-bit)...")
base_model_fp16 = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map={"":0},
    trust_remote_code=True,
)

gpu_mem = torch.cuda.memory_allocated() / 1024**3
print(f"‚úÖ Base model loaded in fp16: {base_model_fp16.num_parameters():,} params")
print(f"   GPU memory: {gpu_mem:.1f} GB")

# Load LoRA adapter onto fp16 model
print(f"\nüîó Loading LoRA adapter from {ADAPTER_PATH}...")
peft_model = PeftModel.from_pretrained(base_model_fp16, ADAPTER_PATH)
print(f"‚úÖ LoRA adapter loaded")

# Disable gradient checkpointing BEFORE any generation
peft_model.gradient_checkpointing_disable()
peft_model.config.use_cache = True
peft_model.eval()
print("‚úÖ Gradient checkpointing disabled, use_cache=True, eval mode")

In [None]:
# === PRE-MERGE SANITY CHECK ===
# Test the PeftModel BEFORE merge to verify training worked
# Uses apply_chat_template (GPT5.2 suggestion) for exact format matching

print("üß™ Pre-merge sanity check (PeftModel, before merge_and_unload)...")

# Check if tokenizer supports enable_thinking parameter
try:
    test_tmpl = tokenizer.apply_chat_template(
        [{"role": "user", "content": "test"}],
        tokenize=False, add_generation_prompt=True, enable_thinking=False,
    )
    USE_THINKING_FLAG = True
    print("‚úÖ Tokenizer supports enable_thinking=False")
except TypeError:
    USE_THINKING_FLAG = False
    print("‚ö†Ô∏è  Tokenizer doesn't support enable_thinking, using manual template")

def build_prompt(prompt_text):
    """Build prompt using apply_chat_template when possible."""
    msgs = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": prompt_text},
    ]
    if USE_THINKING_FLAG:
        return tokenizer.apply_chat_template(
            msgs, tokenize=False, add_generation_prompt=True, enable_thinking=False,
        )
    else:
        return (
            f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
            f"<|im_start|>user\n{prompt_text}<|im_end|>\n"
            f"<|im_start|>assistant\n"
        )

sanity_prompts = [
    "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç",
    "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç‡Æ®‡Ææ‡Æü‡Øç‡Æü‡Æø‡Æ©‡Øç ‡Æ§‡Æ≤‡Øà‡Æ®‡Æï‡Æ∞‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ©?",
    "‡Æ®‡ØÄ‡Æô‡Øç‡Æï‡Æ≥‡Øç ‡ÆØ‡Ææ‡Æ∞‡Øç?",
]

for prompt_text in sanity_prompts:
    full_prompt = build_prompt(prompt_text)
    inputs = tokenizer(full_prompt, return_tensors="pt").to(peft_model.device)

    with torch.no_grad():
        outputs = peft_model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
            suppress_tokens=suppress_ids,
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    if "<|im_start|>assistant" in response:
        response = response.split("<|im_start|>assistant")[-1]
        response = response.split("<|im_end|>")[0].strip()
        if response.startswith("\n"):
            response = response[1:]

    tamil_pct = tamil_char_pct(response)
    print(f"\n  Q: {prompt_text}")
    print(f"  A: {response[:200]}")
    print(f"  Tamil: {tamil_pct:.0f}%")

print("\n" + "="*50)
print("If the above responses are Tamil (not garbage), training worked!")
print("Proceeding with merge...")

In [None]:
# === MERGE IN FP16 ‚Äî THE FIX ===
print("üîÄ Merging LoRA weights in FP16 (NO rounding errors)...")
merged_model = peft_model.merge_and_unload()

# Verify no 4-bit warning appeared
print("‚úÖ LoRA merged in fp16 ‚Äî no 4-bit rounding errors")
print(f"   Model params: {merged_model.num_parameters():,}")

gpu_mem = torch.cuda.memory_allocated() / 1024**3
print(f"   GPU memory: {gpu_mem:.1f} GB")

## 10. Quality Evaluation

In [None]:
merged_model.eval()
merged_model.config.use_cache = True

test_prompts = [
    # Greetings (2)
    ("greeting", "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç"),
    ("greeting", "‡Æ®‡ØÄ‡Æô‡Øç‡Æï‡Æ≥‡Øç ‡ÆØ‡Ææ‡Æ∞‡Øç?"),
    # Factual (3) ‚Äî greedy
    ("factual", "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç‡Æ®‡Ææ‡Æü‡Øç‡Æü‡Æø‡Æ©‡Øç ‡Æ§‡Æ≤‡Øà‡Æ®‡Æï‡Æ∞‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ©?"),
    ("factual", "2+2 ‡Æé‡Æ©‡Øç‡Æ©?"),
    ("factual", "‡Æ™‡Øä‡Æô‡Øç‡Æï‡Æ≤‡Øç ‡Æé‡Æ™‡Øç‡Æ™‡Øã‡Æ§‡ØÅ ‡Æï‡Øä‡Æ£‡Øç‡Æü‡Ææ‡Æü‡Æ™‡Øç‡Æ™‡Æü‡ØÅ‡Æï‡Æø‡Æ±‡Æ§‡ØÅ?"),
    # Culture (2)
    ("culture", "‡Æ§‡Æø‡Æ∞‡ØÅ‡Æï‡Øç‡Æï‡ØÅ‡Æ±‡Æ≥‡Æø‡Æ©‡Øç ‡ÆÆ‡ØÅ‡Æ§‡Æ≤‡Øç ‡Æï‡ØÅ‡Æ±‡Æ≥‡Øç ‡Æé‡Æ©‡Øç‡Æ©?"),
    ("culture", "‡Æ§‡Æø‡Æ∞‡ØÅ‡Æµ‡Æ≥‡Øç‡Æ≥‡ØÅ‡Æµ‡Æ∞‡Øç ‡ÆØ‡Ææ‡Æ∞‡Øç?"),
    # Safety (2)
    ("safety", "‡Æí‡Æ∞‡ØÅ scam message ‡Æµ‡Æ®‡Øç‡Æ§‡Ææ‡Æ≤‡Øç ‡Æé‡Æ©‡Øç‡Æ© ‡Æö‡ØÜ‡ÆØ‡Øç‡Æµ‡Æ§‡ØÅ?"),
    ("safety", "‡Æµ‡ØÄ‡Æü‡Øç‡Æü‡Æø‡Æ≤‡Øç ‡Æ§‡ØÄ ‡Æµ‡Æø‡Æ™‡Æ§‡Øç‡Æ§‡ØÅ ‡Æè‡Æ±‡Øç‡Æ™‡Æü‡Øç‡Æü‡Ææ‡Æ≤‡Øç ‡Æé‡Æ©‡Øç‡Æ© ‡Æö‡ØÜ‡ÆØ‡Øç‡ÆØ ‡Æµ‡Øá‡Æ£‡Øç‡Æü‡ØÅ‡ÆÆ‡Øç?"),
    # Refusal (2)
    ("refusal", "‡Æ®‡Ææ‡Æ≥‡Øà ‡Æ™‡Æô‡Øç‡Æï‡ØÅ ‡Æö‡Æ®‡Øç‡Æ§‡Øà ‡Æè‡Æ±‡ØÅ‡ÆÆ‡Ææ?"),
    ("refusal", "‡Æé‡Æ©‡Øç ‡Æï‡Æ£‡Æø‡Æ©‡Æø‡ÆØ‡Æø‡Æ≤‡Øç ‡Æµ‡Øà‡Æ∞‡Æ∏‡Øç ‡Æá‡Æ∞‡ØÅ‡Æï‡Øç‡Æï‡Æø‡Æ±‡Æ§‡Ææ?"),
    # General (1)
    ("general", "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç ‡ÆÆ‡Øä‡Æ¥‡Æø‡ÆØ‡Æø‡Æ©‡Øç ‡Æö‡Æø‡Æ±‡Æ™‡Øç‡Æ™‡ØÅ ‡Æé‡Æ©‡Øç‡Æ©?"),
]

print(f"\n{'='*60}")
print(f"üß™ EVALUATION: {len(test_prompts)} prompts (on fp16 merged model)")
print(f"   Using: {'apply_chat_template(enable_thinking=False)' if USE_THINKING_FLAG else 'manual ChatML'}")
print(f"{'='*60}")

results = []

for category, prompt_text in test_prompts:
    # Use apply_chat_template when available (GPT5.2 suggestion)
    full_prompt = build_prompt(prompt_text)
    inputs = tokenizer(full_prompt, return_tensors="pt").to(merged_model.device)

    gen_kwargs = dict(
        max_new_tokens=150,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        suppress_tokens=suppress_ids,
        no_repeat_ngram_size=4,  # GPT5.2 suggestion: repetition control
    )

    if category == "factual":
        gen_kwargs["do_sample"] = False
    else:
        gen_kwargs["do_sample"] = True
        gen_kwargs["temperature"] = 0.3
        gen_kwargs["top_p"] = 0.9
        gen_kwargs["repetition_penalty"] = 1.2

    with torch.no_grad():
        outputs = merged_model.generate(**inputs, **gen_kwargs)

    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    if "<|im_start|>assistant" in response:
        response = response.split("<|im_start|>assistant")[-1]
        response = response.split("<|im_end|>")[0].strip()
        if response.startswith("\n"):
            response = response[1:]

    # Quality checks
    tamil_pct = tamil_char_pct(response)
    has_loop = len(set(response.split())) < max(3, len(response.split()) * 0.3) if response.split() else True
    has_system = "system" in response.lower()[:50]
    has_think = "<think>" in response
    is_empty = len(response.strip()) < 5
    is_code = any(c in response[:100] for c in ['=True', '="', 'var ', 'function', '{"type', '<br'])

    status = "\u2705"
    if is_code: status = "\u274c CODE"
    elif has_loop: status = "\u26a0\ufe0f LOOP"
    elif has_system: status = "\u274c SYSTEM LEAK"
    elif has_think: status = "\u274c THINK LEAK"
    elif is_empty: status = "\u274c EMPTY"
    elif tamil_pct < 20 and category not in ["factual"]:
        status = "\u26a0\ufe0f LOW TAMIL"

    results.append((category, prompt_text, response[:200], status, tamil_pct))

    print(f"\n[{category.upper()}] {status} (Tamil: {tamil_pct:.0f}%)")
    print(f"Q: {prompt_text}")
    print(f"A: {response[:300]}")
    print("-" * 50)

# Summary
print(f"\n{'='*60}")
print(f"\ud83d\udcca EVALUATION SUMMARY")
print(f"{'='*60}")
pass_count = sum(1 for r in results if r[3] == "\u2705")
avg_tamil = sum(r[4] for r in results) / len(results)
print(f"   Passed: {pass_count}/{len(results)}")
print(f"   Avg Tamil: {avg_tamil:.0f}%")
for cat, prompt, resp, status, tamil in results:
    print(f"   {status} [{cat}] {prompt[:40]}... (Tamil: {tamil:.0f}%)")

if pass_count >= len(results) * 0.8 and avg_tamil > 30:
    print(f"\n\ud83c\udf89 Model looks good!")
elif pass_count >= len(results) * 0.5:
    print(f"\n\u26a0\ufe0f  Partially working. Review failures above.")
else:
    print(f"\n\u274c Too many failures. Check:")
    print(f"   1. Loss curve above \u2014 did training converge?")
    print(f"   2. Pre-merge sanity check \u2014 did PeftModel work?")
    print(f"   3. Consider DAPT stage before SFT")

## 11. Push to HuggingFace

Always upload ‚Äî eval may have false negatives. Better to upload and discard than miss a good model.

In [None]:
api = HfApi()
api.create_repo(OUTPUT_MODEL, exist_ok=True)

print(f"üì§ Pushing merged fp16 model to {OUTPUT_MODEL}...")
merged_model.push_to_hub(OUTPUT_MODEL, private=False)
tokenizer.push_to_hub(OUTPUT_MODEL, private=False)

print(f"\n‚úÖ Model uploaded: https://huggingface.co/{OUTPUT_MODEL}")
print(f"   Eval passed: {pass_count}/{len(results)}, Avg Tamil: {avg_tamil:.0f}%")
print(f"   Review eval results above to decide if model is usable")

## Summary

### v3.8 Changes from v3.7

| What | v3.7 | v3.8 (this notebook) |
|------|------|---------------------|
| **Dataset** | `vazhi-tamil-sft-v3_6` (3,667 samples) | **`vazhi-tamil-sft-v4_0` (1,514 samples, ADR-010)** |
| **Composition** | Uncontrolled (72% Kural) | **Enforced: 50% domain, 33% IndicAlign, 6% Kural, 3% handcrafted, 8% general** |
| **Training** | Same as v3.6 | Same (LR 2e-5, 3 epochs, LoRA r=16) |
| **LoRA merge** | fp16 (fixed from v3.6) | Same (fp16) |
| **Everything else** | Same | Same |

### Key difference: Dataset quality over quantity

v3.6/v3.7 used 3,667 samples with 72% Thirukkural (memorization risk).
v3.8 uses 1,514 samples with proper diversity ‚Äî domain knowledge, IndicAlign general Tamil, guardrails, and capped Kural interpretations.

### If this succeeds:
1. Convert to GGUF (Q4_K_M ~462MB, Q5_K_M ~526MB)
2. Test on mobile via Flutter app
3. Ship hybrid retrieval + LLM reasoning

### If output quality is low:
1. Check loss curve ‚Äî did it converge with fewer samples?
2. If training converged but output is weak: increase epochs to 5, or add more data to Dataset Factory
3. If training didn't converge: dataset may be too small, consider combining v3.6 + v4.0 data