# VAZHI SFT v3.2 - Fixed Training

**FIXES from v3.1 failure:**
1. **Data format consistency** - Only uses ChatML-formatted samples (no raw text mixing)
2. **Pinned versions** - Avoids API drift issues
3. **Single GPU forced** - Prevents cuda:1 vs cuda:0 errors
4. **fp16 on T4** - Not bf16 (T4 doesn't support bf16 well)
5. **4-bit QLoRA** - More stable/memory efficient on Kaggle

**Root cause of v3.1 failure:**
Mixed raw text (Sangam poetry, Thirukkural verses) with ChatML-formatted Q&A pairs.
This caused the model to output "systemsystemsystem..." garbage.

**Per GPT5.2 recommendations.**

## 1. Setup & Dependencies

**IMPORTANT:** After running the install cell, **RESTART the Kaggle session** (Kernel ‚Üí Restart Session)

In [None]:
# Install dependencies - Updated versions for Qwen3 support
# NOTE: transformers 4.46.3 does NOT support Qwen3 (too old)
# Using newer versions that support Qwen3 while keeping other libs stable
# 
# IMPORTANT: After running this cell, RESTART the Kaggle session!

!pip -q install -U \
  "transformers>=4.51.0" \
  "accelerate>=0.34.2" \
  "peft>=0.12.0" \
  "trl>=0.12.0" \
  "bitsandbytes>=0.43.3" \
  "datasets>=2.21.0" \
  "huggingface_hub>=0.24.7"

# Verify Qwen3 is supported
import transformers
print(f"‚úÖ Transformers version: {transformers.__version__}")
print("‚úÖ Dependencies installed")
print("‚ö†Ô∏è IMPORTANT: Restart the Kaggle session now (Kernel ‚Üí Restart Session)")

In [None]:
# ============================================================================
# CRITICAL: Force single GPU BEFORE importing torch/transformers
# Per GPT5.2: This prevents "cuda:1 vs cuda:0" device mismatch errors
# Must be at the VERY TOP before any other imports
# ============================================================================
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import json
import random
import re
from collections import defaultdict
from datasets import load_dataset, Dataset
from tqdm.auto import tqdm
from huggingface_hub import login, HfApi

# Config
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

# HuggingFace repos
EXISTING_DATASET = "CryptoYogi/vazhi-tamil-v05"
BALANCED_DATASET = "CryptoYogi/vazhi-tamil-sft-v3_2"  # New version

# System prompt
SYSTEM_PROMPT = "‡Æ®‡ØÄ‡Æô‡Øç‡Æï‡Æ≥‡Øç VAZHI (‡Æµ‡Æ¥‡Æø), ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç ‡ÆÆ‡Æï‡Øç‡Æï‡Æ≥‡ØÅ‡Æï‡Øç‡Æï‡Ææ‡Æ© AI ‡Æâ‡Æ§‡Æµ‡Æø‡ÆØ‡Ææ‡Æ≥‡Æ∞‡Øç. ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Æø‡Æ≤‡Øç ‡Æ§‡ØÜ‡Æ≥‡Æø‡Æµ‡Ææ‡Æï‡Æµ‡ØÅ‡ÆÆ‡Øç ‡Æâ‡Æ§‡Æµ‡Æø‡ÆØ‡Ææ‡Æï‡Æµ‡ØÅ‡ÆÆ‡Øç ‡Æ™‡Æ§‡Æø‡Æ≤‡Æ≥‡Æø‡ÆØ‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç. ‡Æ§‡ØÜ‡Æ∞‡Æø‡ÆØ‡Ææ‡Æµ‡Æø‡Æü‡Øç‡Æü‡Ææ‡Æ≤‡Øç \"‡Æ§‡ØÜ‡Æ∞‡Æø‡ÆØ‡Æµ‡Æø‡Æ≤‡Øç‡Æ≤‡Øà\" ‡Æé‡Æ©‡Øç‡Æ±‡ØÅ ‡Æö‡Øä‡Æ≤‡Øç‡Æ≤‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç."

print("‚úÖ Configuration loaded")
print(f"   CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'not set')}")
print(f"   Source: {EXISTING_DATASET}")
print(f"   Target: {BALANCED_DATASET}")

In [None]:
# Login to HuggingFace
from kaggle_secrets import UserSecretsClient
secrets = UserSecretsClient()
hf_token = secrets.get_secret("HF_TOKEN")
login(token=hf_token)
print("‚úÖ Logged in to HuggingFace")

## 2. Helper Functions

In [None]:
# ============================================================================
# CHECK: Skip data preparation if balanced dataset already exists
# ============================================================================
SKIP_DATA_PREP = False

try:
    from huggingface_hub import dataset_info
    info = dataset_info(BALANCED_DATASET)
    print(f"‚úÖ Dataset {BALANCED_DATASET} already exists on HuggingFace!")
    print(f"   Created: {info.created_at}")
    print(f"   Downloads: {info.downloads}")
    print(f"\nüöÄ SKIPPING data extraction/preparation - will load directly for training")
    SKIP_DATA_PREP = True
except Exception as e:
    print(f"üìù Dataset {BALANCED_DATASET} not found. Will create it.")
    print(f"   (This is expected on first run)")
    SKIP_DATA_PREP = False

## 3. Extract Diverse QA from IndicAlign

IndicAlign contains Tamil translations in the `tam_Taml` field (translated using IndicTrans2 by AI4Bharat).

In [None]:
def extract_from_indicaling(config_name, max_samples):
    """Extract Tamil samples from IndicAlign config.
    
    IndicAlign structure:
    - tam_Taml is a list of length 1
    - tam_Taml[0] is a list of conversation turns [user, assistant, ...]
    - tam_Taml[0][0] = user message (Tamil)
    - tam_Taml[0][1] = assistant message (Tamil)
    """
    print(f"\nüìö Loading {config_name}...")
    try:
        ds = load_dataset("ai4bharat/indic-align", config_name, split="train", streaming=True)
    except Exception as e:
        print(f"   ‚ö†Ô∏è Error: {e}")
        return []
    
    samples = []
    seen = set()
    skipped_non_tamil = 0
    skipped_short = 0
    skipped_format = 0
    
    for item in tqdm(ds, desc=config_name, total=max_samples*5):
        if len(samples) >= max_samples:
            break
        
        # tam_Taml is a nested list: [[user_msg, assistant_msg, ...]]
        tamil = item.get('tam_Taml', [])
        if not tamil:
            continue
        
        # FIXED: Access nested list structure correctly
        if isinstance(tamil, list) and len(tamil) > 0:
            turns = tamil[0]  # Get the inner list
            if isinstance(turns, list) and len(turns) >= 2:
                user_msg = clean_text(str(turns[0]))
                assistant_msg = clean_text(str(turns[1]))
            else:
                skipped_format += 1
                continue
        else:
            skipped_format += 1
            continue
        
        # Verify it's actually Tamil
        if not is_good_tamil_sample(user_msg):
            skipped_non_tamil += 1
            continue
        if not is_good_tamil_sample(assistant_msg):
            skipped_short += 1
            continue
        
        # Dedup
        key = user_msg[:100]
        if key in seen:
            continue
        seen.add(key)
        
        samples.append({
            "instruction": user_msg,
            "output": assistant_msg,
            "source": config_name
        })
    
    print(f"   ‚úÖ Extracted {len(samples)} Tamil samples")
    print(f"   ‚è≠Ô∏è Skipped: {skipped_non_tamil} (not Tamil), {skipped_short} (too short), {skipped_format} (wrong format)")
    
    if samples:
        print(f"   üìù Sample verification:")
        print(f"      User: {samples[0]['instruction'][:80]}...")
        print(f"      Asst: {samples[0]['output'][:80]}...")
    
    return samples

In [None]:
# Extract from IndicAlign (SKIP if dataset already exists)
if not SKIP_DATA_PREP:
    print("üöÄ Extracting diverse QA from IndicAlign...")
    print("   (tam_Taml field = Already-translated Tamil via IndicTrans2)")

    diverse_samples = []
    diverse_samples.extend(extract_from_indicaling("Dolly_T", 300))
    diverse_samples.extend(extract_from_indicaling("WikiHow", 250))
    diverse_samples.extend(extract_from_indicaling("Wiki_Conv", 300))
    diverse_samples.extend(extract_from_indicaling("OpenAssistant_T", 200))

    print(f"\nüìä Total extracted from IndicAlign: {len(diverse_samples)}")
else:
    print("‚è≠Ô∏è Skipping IndicAlign extraction (dataset already exists)")

In [None]:
# Verify Tamil content distribution (SKIP if dataset already exists)
if not SKIP_DATA_PREP:
    tamil_char_pcts = []
    for s in diverse_samples[:100]:
        text = s['instruction'] + s['output']
        pct = 100 * count_tamil_chars(text) / len(text) if text else 0
        tamil_char_pcts.append(pct)

    avg_tamil_pct = sum(tamil_char_pcts) / len(tamil_char_pcts) if tamil_char_pcts else 0
    print(f"üìà Average Tamil character % in samples: {avg_tamil_pct:.1f}%")

    if avg_tamil_pct < 40:
        print("‚ö†Ô∏è Warning: Tamil content seems low. Check extraction logic.")
    else:
        print("‚úÖ Good Tamil content ratio!")
else:
    print("‚è≠Ô∏è Skipping Tamil verification (dataset already exists)")

## 4. Add Manual Samples (Short Answers + Behavior)

In [None]:
# Manual samples (SKIP if dataset already exists)
if not SKIP_DATA_PREP:
    manual_samples = [
        # Geography
        {"instruction": "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç‡Æ®‡Ææ‡Æü‡Øç‡Æü‡Æø‡Æ©‡Øç ‡Æ§‡Æ≤‡Øà‡Æ®‡Æï‡Æ∞‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ©?", "output": "‡Æö‡ØÜ‡Æ©‡Øç‡Æ©‡Øà.", "source": "manual"},
        {"instruction": "‡Æá‡Æ®‡Øç‡Æ§‡Æø‡ÆØ‡Ææ‡Æµ‡Æø‡Æ©‡Øç ‡Æ§‡Æ≤‡Øà‡Æ®‡Æï‡Æ∞‡ÆÆ‡Øç ‡Æé‡Æ§‡ØÅ?", "output": "‡Æ™‡ØÅ‡Æ§‡ØÅ ‡Æ§‡Æø‡Æ≤‡Øç‡Æ≤‡Æø.", "source": "manual"},
        {"instruction": "‡Æâ‡Æ≤‡Æï‡Æø‡Æ©‡Øç ‡ÆÆ‡Æø‡Æï‡Æ™‡Øç‡Æ™‡ØÜ‡Æ∞‡Æø‡ÆØ ‡Æ®‡Ææ‡Æü‡ØÅ ‡Æé‡Æ§‡ØÅ?", "output": "‡Æ∞‡Æ∑‡Øç‡ÆØ‡Ææ (‡Æ™‡Æ∞‡Æ™‡Øç‡Æ™‡Æ≥‡Æµ‡Æø‡Æ≤‡Øç).", "source": "manual"},
        {"instruction": "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç‡Æ®‡Ææ‡Æü‡Øç‡Æü‡Æø‡Æ©‡Øç ‡ÆÆ‡Ææ‡Æµ‡Æü‡Øç‡Æü‡Æô‡Øç‡Æï‡Æ≥‡Øç ‡Æé‡Æ§‡Øç‡Æ§‡Æ©‡Øà?", "output": "38 ‡ÆÆ‡Ææ‡Æµ‡Æü‡Øç‡Æü‡Æô‡Øç‡Æï‡Æ≥‡Øç.", "source": "manual"},
        {"instruction": "‡Æï‡Ææ‡Æµ‡Æø‡Æ∞‡Æø ‡Æ®‡Æ§‡Æø ‡Æé‡Æ®‡Øç‡Æ§ ‡ÆÆ‡Ææ‡Æ®‡Æø‡Æ≤‡Æô‡Øç‡Æï‡Æ≥‡Æø‡Æ≤‡Øç ‡Æ™‡Ææ‡ÆØ‡Øç‡Æï‡Æø‡Æ±‡Æ§‡ØÅ?", "output": "‡Æï‡Æ∞‡Øç‡Æ®‡Ææ‡Æü‡Æï‡Ææ ‡ÆÆ‡Æ±‡Øç‡Æ±‡ØÅ‡ÆÆ‡Øç ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç‡Æ®‡Ææ‡Æü‡ØÅ.", "source": "manual"},
        {"instruction": "‡ÆÆ‡Æ§‡ØÅ‡Æ∞‡Øà ‡Æé‡Æ®‡Øç‡Æ§ ‡Æ®‡Æ§‡Æø‡Æï‡Øç‡Æï‡Æ∞‡Øà‡ÆØ‡Æø‡Æ≤‡Øç ‡Æâ‡Æ≥‡Øç‡Æ≥‡Æ§‡ØÅ?", "output": "‡Æµ‡Øà‡Æï‡Øà ‡Æ®‡Æ§‡Æø‡Æï‡Øç‡Æï‡Æ∞‡Øà‡ÆØ‡Æø‡Æ≤‡Øç.", "source": "manual"},
        {"instruction": "‡Æï‡Æô‡Øç‡Æï‡Øà ‡Æ®‡Æ§‡Æø ‡Æé‡Æô‡Øç‡Æï‡ØÅ ‡Æâ‡Æ±‡Øç‡Æ™‡Æ§‡Øç‡Æ§‡Æø‡ÆØ‡Ææ‡Æï‡Æø‡Æ±‡Æ§‡ØÅ?", "output": "‡Æá‡ÆÆ‡ÆØ‡ÆÆ‡Æ≤‡Øà‡ÆØ‡Æø‡Æ≤‡Øç ‡Æâ‡Æ≥‡Øç‡Æ≥ ‡Æï‡Æô‡Øç‡Æï‡Øã‡Æ§‡Øç‡Æ∞‡Æø ‡Æ™‡Æ©‡Æø‡Æ™‡Øç‡Æ™‡Ææ‡Æ±‡Øà‡ÆØ‡Æø‡Æ≤‡Øç.", "source": "manual"},
        {"instruction": "‡Æá‡Æ®‡Øç‡Æ§‡Æø‡ÆØ‡Ææ‡Æµ‡Æø‡Æ©‡Øç ‡ÆÆ‡Æï‡Øç‡Æï‡Æ≥‡Øç‡Æ§‡Øä‡Æï‡Øà ‡ÆÖ‡Æ§‡Æø‡Æï‡ÆÆ‡Ææ‡Æ© ‡ÆÆ‡Ææ‡Æ®‡Æø‡Æ≤‡ÆÆ‡Øç ‡Æé‡Æ§‡ØÅ?", "output": "‡Æâ‡Æ§‡Øç‡Æ§‡Æ∞‡Æ™‡Øç ‡Æ™‡Æø‡Æ∞‡Æ§‡Øá‡Æö‡ÆÆ‡Øç.", "source": "manual"},
        
        # Basic facts
        {"instruction": "‡Æö‡ØÇ‡Æ∞‡Æø‡ÆØ‡Æ©‡Øç ‡Æé‡Æ®‡Øç‡Æ§ ‡Æ§‡Æø‡Æö‡Øà‡ÆØ‡Æø‡Æ≤‡Øç ‡Æâ‡Æ§‡Æø‡Æï‡Øç‡Æï‡ØÅ‡ÆÆ‡Øç?", "output": "‡Æï‡Æø‡Æ¥‡Æï‡Øç‡Æï‡ØÅ ‡Æ§‡Æø‡Æö‡Øà‡ÆØ‡Æø‡Æ≤‡Øç.", "source": "manual"},
        {"instruction": "‡Æí‡Æ∞‡ØÅ ‡Æµ‡Ææ‡Æ∞‡Æ§‡Øç‡Æ§‡Æø‡Æ≤‡Øç ‡Æé‡Æ§‡Øç‡Æ§‡Æ©‡Øà ‡Æ®‡Ææ‡Æü‡Øç‡Æï‡Æ≥‡Øç?", "output": "‡Æè‡Æ¥‡ØÅ ‡Æ®‡Ææ‡Æü‡Øç‡Æï‡Æ≥‡Øç.", "source": "manual"},
        {"instruction": "‡Æí‡Æ∞‡ØÅ ‡Æµ‡Æ∞‡ØÅ‡Æü‡Æ§‡Øç‡Æ§‡Æø‡Æ≤‡Øç ‡Æé‡Æ§‡Øç‡Æ§‡Æ©‡Øà ‡ÆÆ‡Ææ‡Æ§‡Æô‡Øç‡Æï‡Æ≥‡Øç?", "output": "12 ‡ÆÆ‡Ææ‡Æ§‡Æô‡Øç‡Æï‡Æ≥‡Øç.", "source": "manual"},
        {"instruction": "‡Æ§‡Æ£‡Øç‡Æ£‡ØÄ‡Æ∞‡Æø‡Æ©‡Øç ‡Æï‡Øä‡Æ§‡Æø‡Æ®‡Æø‡Æ≤‡Øà ‡Æé‡Æ©‡Øç‡Æ©?", "output": "100 ‡Æü‡Æø‡Æï‡Æø‡Æ∞‡Æø ‡Æö‡ØÜ‡Æ≤‡Øç‡Æö‡Æø‡ÆØ‡Æ∏‡Øç.", "source": "manual"},
        {"instruction": "2+2 ‡Æé‡Æ©‡Øç‡Æ©?", "output": "4.", "source": "manual"},
        {"instruction": "10 x 10 ‡Æé‡Æ©‡Øç‡Æ©?", "output": "100.", "source": "manual"},
        {"instruction": "100-‡Æê 4-‡ÆÜ‡Æ≤‡Øç ‡Æµ‡Æï‡ØÅ‡Æ§‡Øç‡Æ§‡Ææ‡Æ≤‡Øç?", "output": "25.", "source": "manual"},
        {"instruction": "‡Æ™‡ØÇ‡ÆÆ‡Æø ‡Æö‡ØÇ‡Æ∞‡Æø‡ÆØ‡Æ©‡Øà ‡Æö‡ØÅ‡Æ±‡Øç‡Æ± ‡Æé‡Æ§‡Øç‡Æ§‡Æ©‡Øà ‡Æ®‡Ææ‡Æü‡Øç‡Æï‡Æ≥‡Øç ‡ÆÜ‡Æï‡ØÅ‡ÆÆ‡Øç?", "output": "365 ‡Æ®‡Ææ‡Æü‡Øç‡Æï‡Æ≥‡Øç (‡Æí‡Æ∞‡ØÅ ‡Æµ‡Æ∞‡ØÅ‡Æü‡ÆÆ‡Øç).", "source": "manual"},
        
        # Tamil culture (non-Thirukkural)
        {"instruction": "‡Æ™‡Øä‡Æô‡Øç‡Æï‡Æ≤‡Øç ‡Æé‡Æ™‡Øç‡Æ™‡Øã‡Æ§‡ØÅ ‡Æï‡Øä‡Æ£‡Øç‡Æü‡Ææ‡Æü‡Æ™‡Øç‡Æ™‡Æü‡ØÅ‡Æï‡Æø‡Æ±‡Æ§‡ØÅ?", "output": "‡Æ§‡Øà ‡ÆÆ‡Ææ‡Æ§‡ÆÆ‡Øç ‡ÆÆ‡ØÅ‡Æ§‡Æ≤‡Øç ‡Æ®‡Ææ‡Æ≥‡Øç (‡Æú‡Æ©‡Æµ‡Æ∞‡Æø 14 ‡ÆÖ‡Æ≤‡Øç‡Æ≤‡Æ§‡ØÅ 15).", "source": "manual"},
        {"instruction": "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç ‡Æé‡Æ¥‡ØÅ‡Æ§‡Øç‡Æ§‡ØÅ‡Æï‡Øç‡Æï‡Æ≥‡Øç ‡Æé‡Æ§‡Øç‡Æ§‡Æ©‡Øà?", "output": "247 ‡Æé‡Æ¥‡ØÅ‡Æ§‡Øç‡Æ§‡ØÅ‡Æï‡Øç‡Æï‡Æ≥‡Øç (12 ‡Æâ‡ÆØ‡Æø‡Æ∞‡Øç + 18 ‡ÆÆ‡ØÜ‡ÆØ‡Øç + 216 ‡Æâ‡ÆØ‡Æø‡Æ∞‡Øç‡ÆÆ‡ØÜ‡ÆØ‡Øç + 1 ‡ÆÜ‡ÆØ‡Øç‡Æ§‡ÆÆ‡Øç).", "source": "manual"},
        {"instruction": "‡Æö‡Æø‡Æ≤‡Æ™‡Øç‡Æ™‡Æ§‡Æø‡Æï‡Ææ‡Æ∞‡Æ§‡Øç‡Æ§‡Øà ‡Æé‡Æ¥‡ØÅ‡Æ§‡Æø‡ÆØ‡Æµ‡Æ∞‡Øç ‡ÆØ‡Ææ‡Æ∞‡Øç?", "output": "‡Æá‡Æ≥‡Æô‡Øç‡Æï‡Øã‡Æµ‡Æü‡Æø‡Æï‡Æ≥‡Øç.", "source": "manual"},
        {"instruction": "‡Æ™‡Ææ‡Æ∞‡Æ§‡Æø‡ÆØ‡Ææ‡Æ∞‡Øç ‡Æé‡Æ®‡Øç‡Æ§ ‡Æä‡Æ∞‡Æø‡Æ≤‡Øç ‡Æ™‡Æø‡Æ±‡Æ®‡Øç‡Æ§‡Ææ‡Æ∞‡Øç?", "output": "‡Æé‡Æü‡Øç‡Æü‡ÆØ‡Æ™‡ØÅ‡Æ∞‡ÆÆ‡Øç.", "source": "manual"},
        {"instruction": "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç ‡Æ§‡Æø‡Æ©‡ÆÆ‡Øç ‡Æé‡Æ™‡Øç‡Æ™‡Øã‡Æ§‡ØÅ?", "output": "‡Æú‡Æ©‡Æµ‡Æ∞‡Æø 9.", "source": "manual"},
        {"instruction": "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç‡Æ®‡Ææ‡Æü‡Øç‡Æü‡Æø‡Æ©‡Øç ‡ÆÖ‡Æ≤‡ØÅ‡Æµ‡Æ≤‡Øç ‡ÆÆ‡Øä‡Æ¥‡Æø ‡Æé‡Æ©‡Øç‡Æ©?", "output": "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç.", "source": "manual"},
        
        # Science
        {"instruction": "‡ÆÆ‡Æ©‡Æø‡Æ§ ‡Æâ‡Æü‡Æ≤‡Æø‡Æ≤‡Øç ‡Æé‡Æ§‡Øç‡Æ§‡Æ©‡Øà ‡Æé‡Æ≤‡ØÅ‡ÆÆ‡Øç‡Æ™‡ØÅ‡Æï‡Æ≥‡Øç ‡Æâ‡Æ≥‡Øç‡Æ≥‡Æ©?", "output": "206 ‡Æé‡Æ≤‡ØÅ‡ÆÆ‡Øç‡Æ™‡ØÅ‡Æï‡Æ≥‡Øç.", "source": "manual"},
        {"instruction": "H2O ‡Æé‡Æ©‡Øç‡Æ™‡Æ§‡ØÅ ‡Æé‡Æ©‡Øç‡Æ©?", "output": "‡Æ§‡Æ£‡Øç‡Æ£‡ØÄ‡Æ∞‡Øç (‡Æ®‡ØÄ‡Æ∞‡Øç).", "source": "manual"},
        {"instruction": "‡Æ™‡ØÇ‡ÆÆ‡Æø‡ÆØ‡Æø‡Æ©‡Øç ‡Æí‡Æ∞‡Øá ‡Æá‡ÆØ‡Æ±‡Øç‡Æï‡Øà ‡Æ§‡ØÅ‡Æ£‡Øà‡Æï‡Øç‡Æï‡Øã‡Æ≥‡Øç ‡Æé‡Æ§‡ØÅ?", "output": "‡Æ®‡Æø‡Æ≤‡Æµ‡ØÅ (‡Æö‡Æ®‡Øç‡Æ§‡Æø‡Æ∞‡Æ©‡Øç).", "source": "manual"},
        {"instruction": "‡Æö‡ØÇ‡Æ∞‡Æø‡ÆØ ‡Æï‡ØÅ‡Æü‡ØÅ‡ÆÆ‡Øç‡Æ™‡Æ§‡Øç‡Æ§‡Æø‡Æ≤‡Øç ‡Æé‡Æ§‡Øç‡Æ§‡Æ©‡Øà ‡Æï‡Øã‡Æ≥‡Øç‡Æï‡Æ≥‡Øç?", "output": "‡Æé‡Æü‡Øç‡Æü‡ØÅ ‡Æï‡Øã‡Æ≥‡Øç‡Æï‡Æ≥‡Øç.", "source": "manual"},
        {"instruction": "‡Æí‡Æ≥‡Æø‡ÆØ‡Æø‡Æ©‡Øç ‡Æµ‡Øá‡Æï‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ©?", "output": "‡Æµ‡Æø‡Æ©‡Ææ‡Æü‡Æø‡Æï‡Øç‡Æï‡ØÅ ‡Æö‡ØÅ‡ÆÆ‡Ææ‡Æ∞‡Øç 3 ‡Æ≤‡Æü‡Øç‡Æö‡ÆÆ‡Øç ‡Æï‡Æø‡Æ≤‡Øã‡ÆÆ‡ØÄ‡Æü‡Øç‡Æü‡Æ∞‡Øç.", "source": "manual"},
        
        # Everyday Tamil
        {"instruction": "‡Æ®‡Æ©‡Øç‡Æ±‡Æø ‡Æé‡Æ©‡Øç‡Æ±‡Ææ‡Æ≤‡Øç ‡ÆÜ‡Æô‡Øç‡Æï‡Æø‡Æ≤‡Æ§‡Øç‡Æ§‡Æø‡Æ≤‡Øç ‡Æé‡Æ©‡Øç‡Æ©?", "output": "Thank you.", "source": "manual"},
        {"instruction": "Good morning ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Æø‡Æ≤‡Øç ‡Æé‡Æ©‡Øç‡Æ©?", "output": "‡Æï‡Ææ‡Æ≤‡Øà ‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç.", "source": "manual"},
        {"instruction": "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ±‡Ææ‡Æ≤‡Øç ‡Æé‡Æ©‡Øç‡Æ©?", "output": "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ™‡Æ§‡ØÅ ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Æø‡Æ≤‡Øç ‡Æ™‡ÆØ‡Æ©‡Øç‡Æ™‡Æü‡ØÅ‡Æ§‡Øç‡Æ§‡Æ™‡Øç‡Æ™‡Æü‡ØÅ‡ÆÆ‡Øç ‡Æµ‡Ææ‡Æ¥‡Øç‡Æ§‡Øç‡Æ§‡ØÅ‡Æö‡Øç ‡Æö‡Øä‡Æ≤‡Øç, Hello ‡Æé‡Æ©‡Øç‡Æ± ‡Æ™‡Øä‡Æ∞‡ØÅ‡Æ≥‡Æø‡Æ≤‡Øç.", "source": "manual"},
        {"instruction": "‡ÆÜ‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ±‡Ææ‡Æ≤‡Øç ‡ÆÜ‡Æô‡Øç‡Æï‡Æø‡Æ≤‡Æ§‡Øç‡Æ§‡Æø‡Æ≤‡Øç?", "output": "Yes.", "source": "manual"},
        {"instruction": "‡Æá‡Æ≤‡Øç‡Æ≤‡Øà ‡Æé‡Æ©‡Øç‡Æ±‡Ææ‡Æ≤‡Øç ‡ÆÜ‡Æô‡Øç‡Æï‡Æø‡Æ≤‡Æ§‡Øç‡Æ§‡Æø‡Æ≤‡Øç?", "output": "No.", "source": "manual"},
        {"instruction": "‡ÆÆ‡Æø‡Æï‡Æ™‡Øç‡Æ™‡ØÜ‡Æ∞‡Æø‡ÆØ ‡Æï‡Æ£‡Øç‡Æü‡ÆÆ‡Øç ‡Æé‡Æ§‡ØÅ?", "output": "‡ÆÜ‡Æö‡Æø‡ÆØ‡Ææ.", "source": "manual"},
        {"instruction": "‡ÆÆ‡Æø‡Æï‡Æö‡Øç‡Æö‡Æø‡Æ±‡Æø‡ÆØ ‡Æï‡Æ£‡Øç‡Æü‡ÆÆ‡Øç ‡Æé‡Æ§‡ØÅ?", "output": "‡ÆÜ‡Æ∏‡Øç‡Æ§‡Æø‡Æ∞‡Øá‡Æ≤‡Æø‡ÆØ‡Ææ.", "source": "manual"},
        
        # Behavior samples
        {"instruction": "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç", "output": "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç! ‡Æ®‡Ææ‡Æ©‡Øç ‡Æµ‡Æ¥‡Æø. ‡Æâ‡Æô‡Øç‡Æï‡Æ≥‡ØÅ‡Æï‡Øç‡Æï‡ØÅ ‡Æé‡Æ™‡Øç‡Æ™‡Æü‡Æø ‡Æâ‡Æ§‡Æµ ‡Æµ‡Øá‡Æ£‡Øç‡Æü‡ØÅ‡ÆÆ‡Øç?", "source": "behavior"},
        {"instruction": "hi", "output": "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç! ‡Æé‡Æ™‡Øç‡Æ™‡Æü‡Æø ‡Æâ‡Æ§‡Æµ‡Æ≤‡Ææ‡ÆÆ‡Øç?", "source": "behavior"},
        {"instruction": "hello", "output": "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç! ‡Æï‡Øá‡Æ≥‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç.", "source": "behavior"},
        {"instruction": "2050-‡Æ≤‡Øç ‡ÆØ‡Ææ‡Æ∞‡Øç ‡Æ™‡Æø‡Æ∞‡Æ§‡ÆÆ‡Æ∞‡Øç ‡ÆÜ‡Æµ‡Ææ‡Æ∞‡Øç?", "output": "‡Æé‡Æ§‡Æø‡Æ∞‡Øç‡Æï‡Ææ‡Æ≤‡Æ§‡Øç‡Æ§‡Øà ‡Æï‡Æ£‡Æø‡Æï‡Øç‡Æï ‡Æé‡Æ©‡Øç‡Æ©‡Ææ‡Æ≤‡Øç ‡ÆÆ‡ØÅ‡Æü‡Æø‡ÆØ‡Ææ‡Æ§‡ØÅ. ‡Æ§‡ØÜ‡Æ∞‡Æø‡ÆØ‡Æµ‡Æø‡Æ≤‡Øç‡Æ≤‡Øà.", "source": "behavior"},
        {"instruction": "‡Æ®‡Ææ‡Æ≥‡Øà ‡Æ™‡Æô‡Øç‡Æï‡ØÅ ‡Æö‡Æ®‡Øç‡Æ§‡Øà ‡Æé‡Æ™‡Øç‡Æ™‡Æü‡Æø ‡Æá‡Æ∞‡ØÅ‡Æï‡Øç‡Æï‡ØÅ‡ÆÆ‡Øç?", "output": "‡Æ™‡Æô‡Øç‡Æï‡ØÅ ‡Æö‡Æ®‡Øç‡Æ§‡Øà ‡Æ®‡Æø‡Æ≤‡Æµ‡Æ∞‡Æ§‡Øç‡Æ§‡Øà ‡Æï‡Æ£‡Æø‡Æï‡Øç‡Æï ‡Æé‡Æ©‡Øç‡Æ©‡Ææ‡Æ≤‡Øç ‡Æá‡ÆØ‡Æ≤‡Ææ‡Æ§‡ØÅ. ‡Æ®‡Æø‡Æ§‡Æø ‡ÆÜ‡Æ≤‡Øã‡Æö‡Æï‡Æ∞‡Æø‡Æü‡ÆÆ‡Øç ‡Æï‡Øá‡Æ≥‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç.", "source": "behavior"},
        {"instruction": "‡Æé‡Æ©‡Øç‡Æ©‡ØÅ‡Æü‡Øà‡ÆØ ‡Æï‡Æü‡Æµ‡ØÅ‡Æö‡Øç‡Æö‡Øä‡Æ≤‡Øç ‡Æé‡Æ©‡Øç‡Æ©?", "output": "‡Æâ‡Æô‡Øç‡Æï‡Æ≥‡Øç ‡Æ§‡Æ©‡Æø‡Æ™‡Øç‡Æ™‡Æü‡Øç‡Æü ‡Æ§‡Æï‡Æµ‡Æ≤‡Øç‡Æï‡Æ≥‡Øç ‡Æé‡Æ©‡Øç‡Æ©‡Æø‡Æü‡ÆÆ‡Øç ‡Æá‡Æ≤‡Øç‡Æ≤‡Øà.", "source": "behavior"},
        {"instruction": "‡ÆØ‡Ææ‡Æ∞‡Øà‡ÆØ‡Ææ‡Æµ‡Æ§‡ØÅ ‡Æè‡ÆÆ‡Ææ‡Æ±‡Øç‡Æ± ‡Æâ‡Æ§‡Æµ‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç", "output": "‡ÆÆ‡Æ©‡Øç‡Æ©‡Æø‡Æï‡Øç‡Æï‡Æµ‡ØÅ‡ÆÆ‡Øç, ‡Æö‡Æü‡Øç‡Æü‡Æµ‡Æø‡Æ∞‡Øã‡Æ§ ‡Æö‡ØÜ‡ÆØ‡Æ≤‡Øç‡Æï‡Æ≥‡ØÅ‡Æï‡Øç‡Æï‡ØÅ ‡Æâ‡Æ§‡Æµ ‡ÆÆ‡ØÅ‡Æü‡Æø‡ÆØ‡Ææ‡Æ§‡ØÅ.", "source": "behavior"},
        {"instruction": "‡ÆÆ‡Æ∞‡ØÅ‡Æ§‡Øç‡Æ§‡ØÅ‡Æµ ‡ÆÜ‡Æ≤‡Øã‡Æö‡Æ©‡Øà ‡Æ§‡Øá‡Æµ‡Øà", "output": "‡ÆÆ‡Æ∞‡ØÅ‡Æ§‡Øç‡Æ§‡ØÅ‡Æµ ‡ÆÜ‡Æ≤‡Øã‡Æö‡Æ©‡Øà‡Æï‡Øç‡Æï‡ØÅ ‡Æ§‡Æï‡ØÅ‡Æ§‡Æø ‡Æµ‡Ææ‡ÆØ‡Øç‡Æ®‡Øç‡Æ§ ‡ÆÆ‡Æ∞‡ØÅ‡Æ§‡Øç‡Æ§‡ØÅ‡Æµ‡Æ∞‡Øà ‡ÆÖ‡Æ£‡ØÅ‡Æï‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç. ‡Æ®‡Ææ‡Æ©‡Øç ‡Æ™‡Øä‡Æ§‡ØÅ ‡Æ§‡Æï‡Æµ‡Æ≤‡Øç ‡ÆÆ‡Æü‡Øç‡Æü‡ØÅ‡ÆÆ‡Øá ‡Æ§‡Æ∞ ‡ÆÆ‡ØÅ‡Æü‡Æø‡ÆØ‡ØÅ‡ÆÆ‡Øç.", "source": "behavior"},
        {"instruction": "‡Æö‡Æü‡Øç‡Æü ‡ÆÜ‡Æ≤‡Øã‡Æö‡Æ©‡Øà ‡Æµ‡Øá‡Æ£‡Øç‡Æü‡ØÅ‡ÆÆ‡Øç", "output": "‡Æö‡Æü‡Øç‡Æü ‡Æµ‡Æø‡Æ∑‡ÆØ‡Æô‡Øç‡Æï‡Æ≥‡ØÅ‡Æï‡Øç‡Æï‡ØÅ ‡Æµ‡Æ¥‡Æï‡Øç‡Æï‡Æ±‡Æø‡Æû‡Æ∞‡Øà ‡ÆÖ‡Æ£‡ØÅ‡Æï‡ØÅ‡Æµ‡Æ§‡ØÅ ‡Æö‡Æø‡Æ±‡Æ®‡Øç‡Æ§‡Æ§‡ØÅ. ‡Æ®‡Ææ‡Æ©‡Øç ‡Æ™‡Øä‡Æ§‡ØÅ ‡Æ§‡Æï‡Æµ‡Æ≤‡Øç ‡ÆÆ‡Æü‡Øç‡Æü‡ØÅ‡ÆÆ‡Øá ‡Æ§‡Æ∞ ‡ÆÆ‡ØÅ‡Æü‡Æø‡ÆØ‡ØÅ‡ÆÆ‡Øç.", "source": "behavior"},
        {"instruction": "‡ÆÖ‡Æ∞‡Æö‡ØÅ ‡Æ§‡Æø‡Æü‡Øç‡Æü‡Æô‡Øç‡Æï‡Æ≥‡ØÅ‡Æï‡Øç‡Æï‡ØÅ ‡Æµ‡Æø‡Æ£‡Øç‡Æ£‡Æ™‡Øç‡Æ™‡Æø‡Æ™‡Øç‡Æ™‡Æ§‡ØÅ ‡Æé‡Æ™‡Øç‡Æ™‡Æü‡Æø?", "output": "‡ÆÖ‡Æ∞‡Æö‡ØÅ ‡Æ§‡Æø‡Æü‡Øç‡Æü‡Æô‡Øç‡Æï‡Æ≥‡ØÅ‡Æï‡Øç‡Æï‡ØÅ tnega.tn.gov.in ‡ÆÖ‡Æ≤‡Øç‡Æ≤‡Æ§‡ØÅ ‡ÆÖ‡Æ∞‡ØÅ‡Æï‡Æø‡Æ≤‡ØÅ‡Æ≥‡Øç‡Æ≥ ‡ÆÖ‡Æ≤‡ØÅ‡Æµ‡Æ≤‡Æï‡Æ§‡Øç‡Æ§‡Øà ‡Æ§‡Øä‡Æü‡Æ∞‡Øç‡Æ™‡ØÅ ‡Æï‡Øä‡Æ≥‡Øç‡Æ≥‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç.", "source": "behavior"},
        {"instruction": "‡Æ®‡Æ©‡Øç‡Æ±‡Æø", "output": "‡ÆÆ‡Æï‡Æø‡Æ¥‡Øç‡Æö‡Øç‡Æö‡Æø! ‡Æµ‡Øá‡Æ±‡ØÅ ‡Æâ‡Æ§‡Æµ‡Æø ‡Æ§‡Øá‡Æµ‡Øà‡Æ™‡Øç‡Æ™‡Æü‡Øç‡Æü‡Ææ‡Æ≤‡Øç ‡Æï‡Øá‡Æ≥‡ØÅ‡Æô‡Øç‡Æï‡Æ≥‡Øç.", "source": "behavior"},
        {"instruction": "bye", "output": "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç! ‡Æá‡Æ©‡Æø‡ÆØ ‡Æ®‡Ææ‡Æ≥‡Øç ‡Æµ‡Ææ‡Æ¥‡Øç‡Æ§‡Øç‡Æ§‡ØÅ‡Æï‡Øç‡Æï‡Æ≥‡Øç.", "source": "behavior"},
        {"instruction": "‡Æö‡Æ∞‡Æø", "output": "‡Æö‡Æ∞‡Æø, ‡Æµ‡Øá‡Æ±‡ØÅ ‡Æè‡Æ§‡Ææ‡Æµ‡Æ§‡ØÅ ‡Æï‡Øá‡Æ≥‡Øç‡Æµ‡Æø ‡Æá‡Æ∞‡ØÅ‡Æï‡Øç‡Æï‡Æø‡Æ±‡Æ§‡Ææ?", "source": "behavior"},
    ]

    diverse_samples.extend(manual_samples)
    print(f"üìä Total after adding manual samples: {len(diverse_samples)}")
    print(f"   - From IndicAlign: {len(diverse_samples) - len(manual_samples)}")
    print(f"   - Manual samples: {len(manual_samples)}")
else:
    print("‚è≠Ô∏è Skipping manual samples (dataset already exists)")

## 5. Load Existing Dataset & Filter for ChatML ONLY

**CRITICAL FIX:** Only use ChatML-formatted samples. Raw text belongs in DAPT, not SFT.

In [None]:
# Load existing dataset & analyze format (SKIP if balanced dataset already exists)
if not SKIP_DATA_PREP:
    print(f"\nüìö Loading existing dataset from {EXISTING_DATASET}...")
    existing_ds = load_dataset(EXISTING_DATASET, split="train")
    print(f"   Loaded {len(existing_ds)} samples")

    # Analyze format distribution BEFORE filtering
    chatml_count = 0
    raw_count = 0
    for item in tqdm(existing_ds, desc="Analyzing formats"):
        text = item.get('text', '')
        if is_chatml_formatted(text):
            chatml_count += 1
        else:
            raw_count += 1

    print(f"\nüìä Existing dataset format analysis:")
    print(f"   ChatML formatted: {chatml_count} ({100*chatml_count/len(existing_ds):.1f}%)")
    print(f"   Raw text: {raw_count} ({100*raw_count/len(existing_ds):.1f}%)")
    print(f"")
    print(f"   ‚ö†Ô∏è Raw text samples will be EXCLUDED from SFT training")
    print(f"   üìù They belong in Micro-DAPT stage, not SFT")
else:
    print("‚è≠Ô∏è Skipping existing dataset analysis (balanced dataset already exists)")

In [None]:
# ============================================================================
# CRITICAL FIX: Filter existing dataset for ChatML ONLY
# Raw text and ChatML mixed = "systemsystemsystem..." garbage
# ============================================================================

if not SKIP_DATA_PREP:
    # Filter existing dataset - ONLY keep ChatML formatted samples
    existing_chatml_samples = []
    existing_kural_chatml = []
    existing_other_chatml = []

    for item in tqdm(existing_ds, desc="Filtering ChatML"):
        text = item.get('text', '')
        if is_chatml_formatted(text):
            if is_kural(text):
                existing_kural_chatml.append({"text": text})
            else:
                existing_other_chatml.append({"text": text})

    print(f"\nüìä ChatML samples from existing dataset:")
    print(f"   Kural (ChatML): {len(existing_kural_chatml)}")
    print(f"   Other (ChatML): {len(existing_other_chatml)}")
    print(f"   Total usable: {len(existing_kural_chatml) + len(existing_other_chatml)}")
else:
    print("‚è≠Ô∏è Skipping ChatML filtering (balanced dataset already exists)")

## 6. Downsample Thirukkural & Create Balanced Dataset

In [None]:
# Downsample Thirukkural to ~25% of non-Thirukkural samples (SKIP if balanced dataset exists)
if not SKIP_DATA_PREP:
    total_other = len(existing_other_chatml)
    target_kural_pct = 0.25
    target_kural_count = int(target_kural_pct * total_other / (1 - target_kural_pct))

    print(f"\nüéØ Downsampling Thirukkural:")
    print(f"   Current ChatML Kural: {len(existing_kural_chatml)}")
    print(f"   Target: {target_kural_count} ({100*target_kural_pct:.0f}%)")

    # Randomly sample (seeded for reproducibility)
    if len(existing_kural_chatml) > target_kural_count:
        downsampled_kural = random.sample(existing_kural_chatml, target_kural_count)
    else:
        downsampled_kural = existing_kural_chatml
    print(f"   Downsampled: {len(downsampled_kural)}")
else:
    print("‚è≠Ô∏è Skipping Thirukkural downsampling (balanced dataset already exists)")

In [None]:
# Combine all samples and verify format (SKIP if balanced dataset exists)
if not SKIP_DATA_PREP:
    # Convert diverse QA to ChatML format
    diverse_formatted = [{"text": to_chatml(s["instruction"], s["output"])} for s in diverse_samples]

    # Combine ALL samples - all must be ChatML formatted
    final_samples = []
    final_samples.extend(downsampled_kural)      # ChatML Kural
    final_samples.extend(existing_other_chatml)  # ChatML Other
    final_samples.extend(diverse_formatted)       # ChatML Diverse

    # Shuffle (seeded for reproducibility)
    random.shuffle(final_samples)

    print(f"\nüìä Final SFT dataset (ChatML ONLY):")
    print(f"   Downsampled Kural: {len(downsampled_kural)}")
    print(f"   Other (ChatML): {len(existing_other_chatml)}")
    print(f"   Diverse QA (new): {len(diverse_formatted)}")
    print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
    print(f"   Total: {len(final_samples)}")

    # CRITICAL: Verify 100% ChatML
    chatml_count = sum(1 for s in final_samples if is_chatml_formatted(s["text"]))
    print(f"\nüìà ChatML format %: {100*chatml_count/len(final_samples):.1f}% (MUST be 100%)")

    if chatml_count != len(final_samples):
        print("‚ùå ERROR: Not all samples are ChatML formatted! This will cause training failure.")
        raise ValueError("Data format inconsistency detected")
    else:
        print("‚úÖ All samples are ChatML formatted - safe to train")

    # Verify Kural distribution
    final_kural = sum(1 for s in final_samples if is_kural(s["text"]))
    print(f"üìà Final Thirukkural %: {100*final_kural/len(final_samples):.1f}%")
else:
    print("‚è≠Ô∏è Skipping sample combination (balanced dataset already exists)")

## 7. Save & Upload to HuggingFace

In [None]:
# Save locally and split train/val (SKIP if balanced dataset exists)
if not SKIP_DATA_PREP:
    # Create output directory
    os.makedirs("/kaggle/working/balanced_sft", exist_ok=True)

    # Split 95/5 train/val
    split_idx = int(0.95 * len(final_samples))
    train_samples = final_samples[:split_idx]
    val_samples = final_samples[split_idx:]

    # Save locally
    with open("/kaggle/working/balanced_sft/train.jsonl", 'w') as f:
        for s in train_samples:
            f.write(json.dumps(s, ensure_ascii=False) + '\n')

    with open("/kaggle/working/balanced_sft/val.jsonl", 'w') as f:
        for s in val_samples:
            f.write(json.dumps(s, ensure_ascii=False) + '\n')

    print(f"üíæ Saved locally:")
    print(f"   Train: {len(train_samples)} samples")
    print(f"   Val: {len(val_samples)} samples")
else:
    print("‚è≠Ô∏è Skipping local save (balanced dataset already exists)")

In [None]:
# Upload to HuggingFace (SKIP if balanced dataset exists)
if not SKIP_DATA_PREP:
    api = HfApi()

    # Create dataset repo (per GPT5.2: ensure repo exists before pushing)
    try:
        api.create_repo(BALANCED_DATASET, repo_type="dataset", exist_ok=True)
        print(f"‚úÖ Created/verified repo: {BALANCED_DATASET}")
    except Exception as e:
        print(f"‚ö†Ô∏è Repo creation: {e}")

    # Upload files
    api.upload_file(
        path_or_fileobj="/kaggle/working/balanced_sft/train.jsonl",
        path_in_repo="train.jsonl",
        repo_id=BALANCED_DATASET,
        repo_type="dataset"
    )
    api.upload_file(
        path_or_fileobj="/kaggle/working/balanced_sft/val.jsonl",
        path_in_repo="val.jsonl",
        repo_id=BALANCED_DATASET,
        repo_type="dataset"
    )

    print(f"\n‚úÖ Uploaded to: https://huggingface.co/datasets/{BALANCED_DATASET}")
else:
    print("‚è≠Ô∏è Skipping HuggingFace upload (balanced dataset already exists)")
    print(f"   Will load directly from: https://huggingface.co/datasets/{BALANCED_DATASET}")

## 8. Load Balanced Dataset for Training

In [None]:
# Load the balanced dataset for training
print(f"\nüìö Loading balanced dataset for training...")
balanced_ds = load_dataset(BALANCED_DATASET, split="train")
print(f"‚úÖ Loaded {len(balanced_ds)} balanced samples")

# Show sample - verify it's ChatML formatted
print(f"\nüìù Sample (should show ChatML tags):")
sample_text = balanced_ds[0]['text'][:400]
print(sample_text + "...")

if "<|im_start|>" in sample_text:
    print("\n‚úÖ Sample is ChatML formatted")
else:
    print("\n‚ùå ERROR: Sample is NOT ChatML formatted!")

---

## 9. SFT Training Setup

Now we train Qwen3-0.6B on the balanced dataset.

**Per GPT5.2 recommendations:**
- 4-bit QLoRA (more stable/memory efficient)
- fp16 compute dtype (T4 doesn't support bf16 well)
- Pinned library versions
- Single GPU forced

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# Try importing SFTConfig (newer TRL versions)
try:
    from trl import SFTConfig
    print("‚úÖ Using TRL with SFTConfig (newer API)")
except ImportError:
    SFTConfig = None
    print("‚ö†Ô∏è Using TRL with TrainingArguments (older API)")

# Model config
BASE_MODEL = "Qwen/Qwen3-0.6B"
OUTPUT_MODEL = "CryptoYogi/vazhi-qwen3-v3_2"

print(f"ü§ñ Base model: {BASE_MODEL}")
print(f"üì§ Output: {OUTPUT_MODEL}")

In [None]:
# Load model and tokenizer with 4-bit quantization
# 4-bit + LoRA is more stable/fast on Kaggle
print("\nüì• Loading model and tokenizer...")

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
# IMPORTANT: Do NOT modify tokenizer.pad_token = eos_token
# Per TRAINING_LOG lesson #9: This causes "OrderedVocab holes" and corrupts the model
tokenizer.padding_side = "right"

# 4-bit quantization config (Kaggle-friendly)
# Use float16 compute dtype - P100/T4 don't support bf16 (requires Ampere+)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model with 4-bit quantization
# CRITICAL: Force torch_dtype=float16 - Qwen3 defaults to bf16 which P100 doesn't support
# bf16 requires Ampere architecture (A100, RTX 30xx) or newer
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,  # Force fp16 (P100/T4 don't support bf16)
    device_map={"":0},  # Force single GPU (prevents cuda:1 vs cuda:0 errors)
    trust_remote_code=True
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# Align model config with tokenizer (don't modify tokenizer)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.eos_token_id

# Disable cache for gradient checkpointing compatibility
model.config.use_cache = False

print(f"‚úÖ Model loaded: {model.num_parameters():,} parameters (4-bit quantized)")
print(f"   torch_dtype: float16 (P100 compatible)")
print(f"   pad_token_id: {tokenizer.pad_token_id}")
print(f"   eos_token_id: {tokenizer.eos_token_id}")

In [None]:
# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

In [None]:
# Training arguments (TRL 0.11.4 compatible)
training_args = TrainingArguments(
    output_dir="/kaggle/working/vazhi-v3_2",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,  # Effective batch = 16
    learning_rate=1e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=25,
    save_steps=200,
    save_total_limit=2,
    fp16=True,  # T4 compatible (not bf16)
    gradient_checkpointing=True,
    max_grad_norm=1.0,
    optim="paged_adamw_8bit",
    report_to="none",
    remove_unused_columns=False,
)

print("‚úÖ Training arguments configured")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Learning rate: {training_args.learning_rate}")

In [None]:
# Create trainer - compatible with TRL 0.12+
try:
    from trl import SFTConfig
    
    sft_config = SFTConfig(
        output_dir="/kaggle/working/vazhi-v3_2",
        num_train_epochs=2,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=1e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        logging_steps=25,
        save_steps=200,
        save_total_limit=2,
        fp16=True,
        bf16=False,  # Explicitly disable bf16 (P100 doesn't support it)
        gradient_checkpointing=True,
        max_grad_norm=1.0,
        optim="paged_adamw_8bit",
        report_to="none",
        dataset_text_field="text",
        max_length=512,  # Changed from max_seq_length (TRL API change)
        packing=False,
    )
    
    trainer = SFTTrainer(
        model=model,
        train_dataset=balanced_ds,
        args=sft_config,
        processing_class=tokenizer,
    )
    print("‚úÖ Trainer initialized (SFTConfig API)")
    
except ImportError:
    # Fall back to old API
    trainer = SFTTrainer(
        model=model,
        train_dataset=balanced_ds,
        args=training_args,
        tokenizer=tokenizer,
        dataset_text_field="text",
        max_seq_length=512,
        packing=False,
    )
    print("‚úÖ Trainer initialized (TrainingArguments API)")

In [None]:
# Train!
print("\nüöÄ Starting training...")
trainer.train()
print("\n‚úÖ Training complete!")

In [None]:
# Save and push to HuggingFace
print("\nüíæ Saving model...")
trainer.save_model("/kaggle/working/vazhi-v3_2-final")

# Merge LoRA weights
print("\nüîÄ Merging LoRA weights...")
merged_model = model.merge_and_unload()

# Ensure repo exists before pushing
api = HfApi()
try:
    api.create_repo(OUTPUT_MODEL, exist_ok=True)
    print(f"‚úÖ Created/verified repo: {OUTPUT_MODEL}")
except Exception as e:
    print(f"‚ö†Ô∏è Repo creation: {e}")

# Push to HuggingFace
print(f"\nüì§ Pushing to {OUTPUT_MODEL}...")
merged_model.push_to_hub(OUTPUT_MODEL, private=False)
tokenizer.push_to_hub(OUTPUT_MODEL, private=False)

print(f"\n‚úÖ Model uploaded to: https://huggingface.co/{OUTPUT_MODEL}")

## 10. Test the Model

In [None]:
# Re-enable cache for inference
merged_model.config.use_cache = True

# Test prompts
test_prompts = [
    "‡Æµ‡Æ£‡Æï‡Øç‡Æï‡ÆÆ‡Øç",
    "‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç‡Æ®‡Ææ‡Æü‡Øç‡Æü‡Æø‡Æ©‡Øç ‡Æ§‡Æ≤‡Øà‡Æ®‡Æï‡Æ∞‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ©?",
    "2+2 ‡Æé‡Æ©‡Øç‡Æ©?",
    "‡Æ™‡Øä‡Æô‡Øç‡Æï‡Æ≤‡Øç ‡Æé‡Æ™‡Øç‡Æ™‡Øã‡Æ§‡ØÅ ‡Æï‡Øä‡Æ£‡Øç‡Æü‡Ææ‡Æü‡Æ™‡Øç‡Æ™‡Æü‡ØÅ‡Æï‡Æø‡Æ±‡Æ§‡ØÅ?",
    "‡Æ§‡Æø‡Æ∞‡ØÅ‡Æï‡Øç‡Æï‡ØÅ‡Æ±‡Æ≥‡Æø‡Æ©‡Øç ‡ÆÆ‡ØÅ‡Æ§‡Æ≤‡Øç ‡ÆÖ‡Æ§‡Æø‡Æï‡Ææ‡Æ∞‡ÆÆ‡Øç ‡Æé‡Æ©‡Øç‡Æ©?",
]

print("\nüß™ Testing model...\n")
print("   (Using anti-repeat decoding per GPT5.2)")
print("   repetition_penalty=1.3, no_repeat_ngram_size=3, top_p=0.9, temp=0.5\n")

for prompt in test_prompts:
    full_prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
    
    inputs = tokenizer(full_prompt, return_tensors="pt").to(merged_model.device)
    
    with torch.no_grad():
        outputs = merged_model.generate(
            **inputs,
            max_new_tokens=100,
            # Anti-repeat decoding defaults (per GPT5.2)
            temperature=0.5,
            top_p=0.9,
            do_sample=True,
            repetition_penalty=1.3,
            no_repeat_ngram_size=3,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    # Extract assistant response
    if "<|im_start|>assistant" in response:
        response = response.split("<|im_start|>assistant")[-1]
        response = response.split("<|im_end|>")[0].strip()
    
    # Check for garbage output
    if "systemsystem" in response.lower() or len(set(response.split())) < 3:
        print(f"Q: {prompt}")
        print(f"A: ‚ùå GARBAGE/REPEAT DETECTED: {response[:100]}...")
    else:
        print(f"Q: {prompt}")
        print(f"A: {response}")
    print("-" * 50)

## Summary

### Fixes from v3.1:
1. ‚úÖ **Data format consistency** - Only ChatML samples used (raw text excluded)
2. ‚úÖ **Pinned versions** - transformers==4.46.3, trl==0.11.4, etc.
3. ‚úÖ **Single GPU forced** - CUDA_VISIBLE_DEVICES=0 at top
4. ‚úÖ **fp16 on T4** - Not bf16
5. ‚úÖ **4-bit QLoRA** - More stable on Kaggle

### Expected Results:
- No more "systemsystemsystem..." garbage output
- Model should respond coherently in Tamil
- Thirukkural distribution ~25% (down from 71%)