# VAZHI DAPT v1.0 тАФ Tamil Language Adaptation Training

**Pipeline Step 2 of 3:** Train DAPT on pre-built packed Tamil data.

```
Step 1: Data Prep (DONE тАФ Vazhi_DAPT_Data_v1_0.ipynb)
  тЖТ Produced: CryptoYogi/vazhi-dapt-tamil-v1_0 (packed 1024-token blocks)

Step 2 (THIS NOTEBOOK): DAPT Training тАФ Kaggle P100 GPU
  тЖТ Input:  Packed dataset from HF + Qwen3-0.6B-Base
  тЖТ Output: CryptoYogi/qwen3-0.6b-tamil (reusable Tamil base)
           CryptoYogi/qwen3-0.6b-tamil-lora (adapter backup)

Step 3: SFT (NEXT тАФ Vazhi_SFT_v3_9_OnDAPT.ipynb)
  тЖТ Input:  DAPT'd model + ChatML instruction pairs
  тЖТ Output: CryptoYogi/vazhi-qwen3-v3_9
```

**Key design (incorporating GPT5.2 review):**
1. Base model, not Instruct (cleaner DAPT)
2. Token-budgeted training (max_steps from token count)
3. Data already packed into 1024-token blocks (no padding waste)
4. QLoRA r=16 (conservative for 0.6B)
5. Eval: perplexity on held-out blocks + Tamil generation quality
6. Adapter + merged model saved separately for recovery

**Target:** Kaggle P100 (16GB) | Est. 2-4 hours

## 1. Install Dependencies

**After running this cell, RESTART the session** (Runtime тЖТ Restart session)

In [1]:
!pip install -q -U \
  "transformers>=4.45.0,<5.0.0" \
  "accelerate>=0.34.2" \
  "peft>=0.12.0" \
  "bitsandbytes>=0.43.3" \
  "datasets>=2.21.0" \
  "huggingface_hub>=0.24.7"

print("\u2705 Dependencies installed")
print("\u26a0\ufe0f  RESTART THE SESSION NOW (Runtime \u2192 Restart session)")

тЬЕ Dependencies installed
тЪая╕П  RESTART THE SESSION NOW (Runtime тЖТ Restart session)


## 2. Configuration

In [2]:
# Force single GPU BEFORE importing torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import json
import random
import glob
import gc
import torch
import numpy as np
from dataclasses import dataclass
from datasets import load_dataset
from huggingface_hub import login, HfApi

from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
    TrainerCallback, Trainer, TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# === KEY CONFIG ===
BASE_MODEL = "Qwen/Qwen3-0.6B-Base"  # Base model for DAPT (GPT5.2 #1)
DATASET_NAME = "CryptoYogi/vazhi-dapt-tamil-v1_0"  # Pre-built by Data Prep notebook
OUTPUT_MODEL = "CryptoYogi/qwen3-0.6b-tamil"  # Reusable Tamil base
ADAPTER_REPO = "CryptoYogi/qwen3-0.6b-tamil-lora"  # Adapter backup (GPT5.2 #9)

# Training config
MAX_SEQ_LENGTH = 1024        # Must match data prep notebook
LEARNING_RATE = 2e-5         # Low LR for gentle adaptation
LORA_R = 16                  # Conservative rank (GPT5.2 #4)
LORA_ALPHA = 32
BATCH_SIZE = 4               # batch 8 OOMs on T4 (Qwen3's 151K vocab = huge logits tensor)
GRADIENT_ACCUMULATION = 8    # Effective batch = 32
WARMUP_RATIO = 0.05
MAX_STEPS_CAP = 500          # Cap to fit compute budget (~16M tokens)

print(f"\u2705 Configuration loaded")
print(f"   PyTorch: {torch.__version__}")
print(f"   CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"   GPU: {gpu_name} ({gpu_mem:.0f} GB)")
    print(f"   fp16: ENABLED")
print()
print(f"\U0001f4cb DAPT Training v1.0:")
print(f"   Base model:  {BASE_MODEL}")
print(f"   Dataset:     {DATASET_NAME}")
print(f"   Output:      {OUTPUT_MODEL}")
print(f"   LR:          {LEARNING_RATE}")
print(f"   LoRA:        r={LORA_R}, alpha={LORA_ALPHA}")
print(f"   Batch:       {BATCH_SIZE} x {GRADIENT_ACCUMULATION} = {BATCH_SIZE * GRADIENT_ACCUMULATION} effective")
print(f"   Max steps:   {MAX_STEPS_CAP} (~{MAX_STEPS_CAP * BATCH_SIZE * GRADIENT_ACCUMULATION * MAX_SEQ_LENGTH / 1e6:.0f}M tokens)")
print(f"   fp16:        True")
print(f"   Grad ckpt:   True (needed for T4 15GB)")

2026-02-13 00:33:35.375160: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770942815.563951     113 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770942815.618549     113 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1770942816.067993     113 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770942816.068043     113 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770942816.068049     113 computation_placer.cc:177] computation placer alr

тЬЕ Configuration loaded
   PyTorch: 2.8.0+cu126
   CUDA: True
   GPU: Tesla T4 (15 GB)
   fp16: ENABLED

ЁЯУЛ DAPT Training v1.0:
   Base model:  Qwen/Qwen3-0.6B-Base
   Dataset:     CryptoYogi/vazhi-dapt-tamil-v1_0
   Output:      CryptoYogi/qwen3-0.6b-tamil
   LR:          2e-05
   LoRA:        r=16, alpha=32
   Batch:       4 x 8 = 32 effective
   Max steps:   500 (~16M tokens)
   fp16:        True
   Grad ckpt:   True (needed for T4 15GB)


In [3]:
# Login to HuggingFace
from kaggle_secrets import UserSecretsClient
secrets = UserSecretsClient()
hf_token = secrets.get_secret("HF_TOKEN")
login(token=hf_token)
print("\u2705 Logged in to HuggingFace")

тЬЕ Logged in to HuggingFace


## 3. Load Pre-Built Dataset

Dataset was created by `Vazhi_DAPT_Data_v1_0.ipynb`:
- Sangraha verified Tamil, filtered (Tamil >= 40%, dedup, no repetition)
- Packed into 1024-token blocks
- Already split into train/validation

In [4]:
print(f"\U0001f4e5 Loading pre-built dataset from {DATASET_NAME}...")
ds = load_dataset(DATASET_NAME)

train_dataset = ds["train"]
eval_dataset = ds["validation"]

print(f"\u2705 Dataset loaded:")
print(f"   Train:      {len(train_dataset):,} blocks")
print(f"   Validation: {len(eval_dataset):,} blocks")
print(f"   Block size: {len(train_dataset[0]['input_ids'])} tokens")
print(f"   Columns:    {train_dataset.column_names}")

total_train_tokens = len(train_dataset) * MAX_SEQ_LENGTH
print(f"   Total train tokens: {total_train_tokens:,}")

# Verify block size matches our config
assert len(train_dataset[0]["input_ids"]) == MAX_SEQ_LENGTH, \
    f"Block size mismatch: dataset has {len(train_dataset[0]['input_ids'])}, config has {MAX_SEQ_LENGTH}"
print("\u2705 Block size verified")

ЁЯУе Loading pre-built dataset from CryptoYogi/vazhi-dapt-tamil-v1_0...


README.md:   0%|          | 0.00/472 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/65.7M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/31599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/645 [00:00<?, ? examples/s]

тЬЕ Dataset loaded:
   Train:      31,599 blocks
   Validation: 645 blocks
   Block size: 1024 tokens
   Columns:    ['input_ids', 'attention_mask', 'labels']
   Total train tokens: 32,357,376
тЬЕ Block size verified


## 4. Load Tokenizer

In [5]:
print(f"\U0001f4e5 Loading tokenizer from {BASE_MODEL}...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.padding_side = "right"

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"\u2705 Tokenizer ready: {len(tokenizer)} tokens")
print(f"   eos_token: {tokenizer.eos_token!r} (ID {tokenizer.eos_token_id})")
print(f"   pad_token: {tokenizer.pad_token!r} (ID {tokenizer.pad_token_id})")

# Quick sanity: decode a sample from the dataset
def count_tamil_chars(text):
    return sum(1 for c in text if '\u0B80' <= c <= '\u0BFF')

def tamil_char_pct(text):
    if not text:
        return 0.0
    return 100.0 * count_tamil_chars(text) / len(text)

sample_text = tokenizer.decode(train_dataset[0]["input_ids"][:100])
print(f"\n\U0001f50d Sample from dataset (first 100 tokens):")
print(f"   Tamil%: {tamil_char_pct(sample_text):.0f}%")
print(f"   Text:   {sample_text[:200]}...")

ЁЯУе Loading tokenizer from Qwen/Qwen3-0.6B-Base...


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

тЬЕ Tokenizer ready: 151669 tokens
   eos_token: '<|endoftext|>' (ID 151643)
   pad_token: '<|endoftext|>' (ID 151643)

ЁЯФН Sample from dataset (first 100 tokens):
   Tamil%: 85%
   Text:   я┐╜родро╛ро▓рпН родроирпНродрпИропрпИроХрпН роХро╛рогро╛род роХрпБро┤роирпНродрпИ роЕро┤ роЖро░роорпНрокро┐родрпНродродрпБ. рокроЪро┐ропрпБроорпН ро╡ро╛роЯрпНроЯро┐ропродрпБ. роХрпБро┤роирпНродрпИропро┐ройрпН роЕро┤рпБроХрпБро░ро▓рпН родро┐ро░...


## 5. Load Model + QLoRA Setup

**Using Base model** (GPT5.2 #1): DAPT from Base is cleaner.
Instruction-following will be restored in SFT stage.

In [6]:
# No 4-bit quantization needed! Qwen3-0.6B in fp16 = ~1.2GB
# Fits easily on T4 (15GB) or P100 (16GB)
# 4-bit was causing bitsandbytes dequantization overhead that
# bypassed Tensor Cores тАФ the root cause of 0.03 it/s speed

print(f"\U0001f4e5 Loading {BASE_MODEL} in fp16 (no quantization)...")
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map={"": 0},
    trust_remote_code=True,
)

model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.use_cache = False

# Enable gradient checkpointing for memory safety
model.gradient_checkpointing_enable()

mem_gb = torch.cuda.memory_allocated() / 1024**3
print(f"\u2705 Model loaded in fp16: {model.num_parameters():,} params")
print(f"   GPU memory used: {mem_gb:.1f} GB")
print(f"   No 4-bit = full Tensor Core speed on T4")

ЁЯУе Loading Qwen/Qwen3-0.6B-Base in fp16 (no quantization)...


config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

тЬЕ Model loaded in fp16: 596,049,920 params
   GPU memory used: 1.1 GB
   No 4-bit = full Tensor Core speed on T4


In [7]:
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

mem_gb = torch.cuda.memory_allocated() / 1024**3
print(f"\u2705 LoRA applied | GPU: {mem_gb:.1f} GB")

trainable params: 10,092,544 || all params: 606,142,464 || trainable%: 1.6650
тЬЕ LoRA applied | GPU: 1.1 GB


## 6. Compute Training Steps

**GPT5.2 #3:** Control by token budget / max_steps, not arbitrary epoch count.
Cap at MAX_EPOCHS to prevent catastrophic forgetting.

In [8]:
tokens_per_step = BATCH_SIZE * MAX_SEQ_LENGTH * GRADIENT_ACCUMULATION
steps_per_epoch = len(train_dataset) // (BATCH_SIZE * GRADIENT_ACCUMULATION)

# Cap steps to fit compute budget
max_steps = min(steps_per_epoch, MAX_STEPS_CAP)
total_tokens_trained = max_steps * tokens_per_step

# Save/log intervals
save_steps = max(max_steps // 4, 50)
log_steps = max(max_steps // 40, 10)
eval_steps = max(max_steps // 8, 25)

print(f"\U0001f4ca Training Plan:")
print(f"   Dataset tokens:      {len(train_dataset) * MAX_SEQ_LENGTH:,}")
print(f"   Tokens/step:         {tokens_per_step:,}")
print(f"   Steps/epoch:         {steps_per_epoch:,}")
print(f"   Max steps (capped):  {max_steps:,}")
print(f"   Tokens to train on: {total_tokens_trained:,}")
print(f"   Coverage:            {100 * max_steps / steps_per_epoch:.0f}% of dataset")
print(f"   Save every:          {save_steps} steps")
print(f"   Log every:           {log_steps} steps")
print(f"   Eval every:          {eval_steps} steps")

ЁЯУК Training Plan:
   Dataset tokens:      32,357,376
   Tokens/step:         32,768
   Steps/epoch:         987
   Max steps (capped):  500
   Tokens to train on: 16,384,000
   Coverage:            51% of dataset
   Save every:          125 steps
   Log every:           12 steps
   Eval every:          62 steps


## 7. Train

In [9]:
# === LOSS LOGGING ===
class LossLoggingCallback(TrainerCallback):
    def __init__(self):
        self.losses = []
        self.eval_losses = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs:
            if "loss" in logs:
                step = state.global_step
                loss = logs["loss"]
                lr = logs.get("learning_rate", 0)
                self.losses.append((step, loss))
                print(f"  Step {step:4d}/{max_steps} | Loss: {loss:.4f} | LR: {lr:.2e}")
            if "eval_loss" in logs:
                eval_loss = logs["eval_loss"]
                ppl = np.exp(min(eval_loss, 20))
                self.eval_losses.append((state.global_step, eval_loss))
                print(f"  \U0001f4ca Eval Loss: {eval_loss:.4f} | Perplexity: {ppl:.1f}")

loss_callback = LossLoggingCallback()

# === DATA COLLATOR ===
@dataclass
class PackedDataCollator:
    """Collator for pre-packed, pre-tokenized sequences."""
    def __call__(self, features):
        return {
            "input_ids": torch.tensor([f["input_ids"] for f in features], dtype=torch.long),
            "attention_mask": torch.tensor([f["attention_mask"] for f in features], dtype=torch.long),
            "labels": torch.tensor([f["labels"] for f in features], dtype=torch.long),
        }

# === TRAINER ===
training_args = TrainingArguments(
    output_dir="/kaggle/working/vazhi-dapt-v1_0",
    max_steps=max_steps,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type="cosine",
    warmup_ratio=WARMUP_RATIO,
    logging_steps=log_steps,
    save_steps=save_steps,
    eval_steps=eval_steps,
    eval_strategy="steps",
    save_total_limit=3,
    fp16=True,                    # T4 Tensor Cores for real fp16 speedup
    bf16=False,
    gradient_checkpointing=True,  # Needed for memory safety
    gradient_checkpointing_kwargs={"use_reentrant": False},
    max_grad_norm=1.0,
    optim="adamw_torch",          # Standard AdamW (no bitsandbytes needed for fp16 model)
    report_to="none",
    seed=RANDOM_SEED,
    load_best_model_at_end=False,
    dataloader_pin_memory=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=PackedDataCollator(),
    callbacks=[loss_callback],
)

print("\u2705 Trainer ready")
print(f"   Steps: {max_steps} | LR: {LEARNING_RATE} | Effective BS: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"   fp16: True | grad_ckpt: True | optimizer: AdamW (torch)")

The model is already on multiple devices. Skipping the move to device specified in `args`.


тЬЕ Trainer ready
   Steps: 500 | LR: 2e-05 | Effective BS: 32
   fp16: True | grad_ckpt: True | optimizer: AdamW (torch)


In [10]:
print("\U0001f680 Starting DAPT training...")
print(f"   {max_steps} steps, fp16=True, no gradient checkpointing")
print(f"   Tokens: ~{max_steps * BATCH_SIZE * GRADIENT_ACCUMULATION * MAX_SEQ_LENGTH / 1e6:.0f}M")
print()

train_result = trainer.train()

print("\n\u2705 Training complete!")
metrics = train_result.metrics
for k, v in metrics.items():
    print(f"   {k}: {v}")

# Final eval
print("\n\U0001f4ca Final eval on held-out blocks...")
eval_metrics = trainer.evaluate()
eval_loss = eval_metrics.get("eval_loss", float("inf"))
eval_ppl = np.exp(min(eval_loss, 20))
print(f"   Eval Loss:       {eval_loss:.4f}")
print(f"   Eval Perplexity: {eval_ppl:.1f}")

# Loss summary
if loss_callback.losses:
    start_loss = loss_callback.losses[0][1]
    end_loss = loss_callback.losses[-1][1]
    print(f"\n\U0001f4c8 Loss: {start_loss:.4f} \u2192 {end_loss:.4f} ({100*(start_loss - end_loss)/start_loss:.1f}% drop)")

ЁЯЪА Starting DAPT training...
   500 steps, fp16=True, no gradient checkpointing
   Tokens: ~16M



Step,Training Loss,Validation Loss
62,1.0596,1.044861
124,1.0442,1.033765
186,1.0428,1.02566
248,1.0424,1.019685
310,1.0273,1.015504
372,1.0273,1.013053


  Step   12/500 | Loss: 1.0799 | LR: 8.80e-06
  Step   24/500 | Loss: 1.0665 | LR: 1.84e-05
  Step   36/500 | Loss: 1.0844 | LR: 2.00e-05
  Step   48/500 | Loss: 1.0698 | LR: 1.99e-05
  Step   60/500 | Loss: 1.0596 | LR: 1.97e-05
  ЁЯУК Eval Loss: 1.0449 | Perplexity: 2.8
  Step   72/500 | Loss: 1.0514 | LR: 1.95e-05
  Step   84/500 | Loss: 1.0427 | LR: 1.93e-05
  Step   96/500 | Loss: 1.0523 | LR: 1.89e-05
  Step  108/500 | Loss: 1.0446 | LR: 1.86e-05
  Step  120/500 | Loss: 1.0442 | LR: 1.81e-05
  ЁЯУК Eval Loss: 1.0338 | Perplexity: 2.8
  Step  132/500 | Loss: 1.0571 | LR: 1.76e-05
  Step  144/500 | Loss: 1.0553 | LR: 1.71e-05
  Step  156/500 | Loss: 1.0660 | LR: 1.65e-05
  Step  168/500 | Loss: 1.0549 | LR: 1.59e-05
  Step  180/500 | Loss: 1.0428 | LR: 1.52e-05
  ЁЯУК Eval Loss: 1.0257 | Perplexity: 2.8
  Step  192/500 | Loss: 1.0567 | LR: 1.46e-05
  Step  204/500 | Loss: 1.0326 | LR: 1.38e-05
  Step  216/500 | Loss: 1.0640 | LR: 1.31e-05
  Step  228/500 | Loss: 1.0412 | LR: 1.23e-

KeyboardInterrupt: 

## 8. Save & Upload LoRA Adapter

In [11]:
ADAPTER_PATH = "/kaggle/working/vazhi-dapt-v1_0-lora"

print("\U0001f4be Saving LoRA adapter...")
trainer.save_model(ADAPTER_PATH)
tokenizer.save_pretrained(ADAPTER_PATH)

adapter_files = glob.glob(f"{ADAPTER_PATH}/*")
print(f"   Files: {[os.path.basename(f) for f in adapter_files]}")
assert any('adapter' in f for f in adapter_files), "No adapter files!"
print("\u2705 Adapter saved")

# Upload adapter backup (GPT5.2 #9)
api = HfApi()
api.create_repo(ADAPTER_REPO, exist_ok=True)
print(f"\U0001f4e4 Uploading adapter to {ADAPTER_REPO}...")
api.upload_folder(
    folder_path=ADAPTER_PATH,
    repo_id=ADAPTER_REPO,
    commit_message=f"DAPT v1.0 adapter: Sangraha Tamil, r={LORA_R}, lr={LEARNING_RATE}",
)
print(f"\u2705 Adapter uploaded: https://huggingface.co/{ADAPTER_REPO}")

ЁЯТ╛ Saving LoRA adapter...
   Files: ['tokenizer_config.json', 'merges.txt', 'chat_template.jinja', 'tokenizer.json', 'adapter_config.json', 'vocab.json', 'adapter_model.safetensors', 'added_tokens.json', 'README.md', 'training_args.bin', 'special_tokens_map.json']
тЬЕ Adapter saved
ЁЯУд Uploading adapter to CryptoYogi/qwen3-0.6b-tamil-lora...


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

тЬЕ Adapter uploaded: https://huggingface.co/CryptoYogi/qwen3-0.6b-tamil-lora


In [12]:
# No need to free and reload тАФ model is already in fp16!
# (We removed 4-bit quantization, so merge can happen directly)
print("\u2705 Model already in fp16 тАФ no reload needed for merge")

тЬЕ Model already in fp16 тАФ no reload needed for merge


## 9. Merge LoRA in FP16

**Hard rule (Lesson #39):** NEVER merge into 4-bit. Reload base in fp16.

In [13]:
# Model is already fp16 тАФ just merge the LoRA adapter directly
print(f"\U0001f517 Loading LoRA adapter for merge...")
base_model_fp16 = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map={"": 0},
    trust_remote_code=True,
)

peft_model = PeftModel.from_pretrained(base_model_fp16, ADAPTER_PATH)
peft_model.gradient_checkpointing_disable()
peft_model.config.use_cache = True
peft_model.eval()

print("\U0001f500 Merging LoRA in fp16...")
merged_model = peft_model.merge_and_unload()
print(f"\u2705 Merged: {merged_model.num_parameters():,} params")

ЁЯФЧ Loading LoRA adapter for merge...
ЁЯФА Merging LoRA in fp16...
тЬЕ Merged: 596,049,920 params


## 10. DAPT Evaluation

**GPT5.2 #7:** Proper eval, not just a quick test.

This is a Base model after DAPT тАФ it won't follow instructions.
It should generate coherent Tamil text continuations.

In [14]:
merged_model.eval()
merged_model.config.use_cache = True

eval_prompts = [
    ("prose", "\u0ba4\u0bae\u0bbf\u0bb4\u0bcd\u0ba8\u0bbe\u0b9f\u0bc1 \u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe\u0bb5\u0bbf\u0ba9\u0bcd \u0ba4\u0bc6\u0ba9\u0bcd \u0baa\u0b95\u0bc1\u0ba4\u0bbf\u0baf\u0bbf\u0bb2\u0bcd \u0b85\u0bae\u0bc8\u0ba8\u0bcd\u0ba4\u0bc1\u0bb3\u0bcd\u0bb3 \u0b92\u0bb0\u0bc1 \u0bae\u0bbe\u0ba8\u0bbf\u0bb2\u0bae\u0bcd."),
    ("prose", "\u0baa\u0bca\u0b99\u0bcd\u0b95\u0bb2\u0bcd \u0ba4\u0bae\u0bbf\u0bb4\u0bb0\u0bcd\u0b95\u0bb3\u0bbf\u0ba9\u0bcd \u0bae\u0bc1\u0b95\u0bcd\u0b95\u0bbf\u0baf \u0ba4\u0bbf\u0bb0\u0bc1\u0ba8\u0bbe\u0bb3\u0bcd."),
    ("literature", "\u0bb5\u0bb3\u0bcd\u0bb3\u0bc1\u0bb5\u0bb0\u0bcd \u0b95\u0bc2\u0bb1\u0bbf\u0baf \u0b85\u0bb1\u0bae\u0bcd, \u0baa\u0bca\u0bb0\u0bc1\u0bb3\u0bcd, \u0b87\u0ba9\u0bcd\u0baa\u0bae\u0bcd \u0b8e\u0ba9\u0bcd\u0bb1 \u0bae\u0bc2\u0ba9\u0bcd\u0bb1\u0bc1"),
    ("knowledge", "\u0b9a\u0bbf\u0ba4\u0bcd\u0ba4 \u0bae\u0bb0\u0bc1\u0ba4\u0bcd\u0ba4\u0bc1\u0bb5\u0bae\u0bcd \u0b8e\u0ba9\u0bcd\u0baa\u0ba4\u0bc1 \u0ba4\u0bae\u0bbf\u0bb4\u0bcd \u0bae\u0b95\u0bcd\u0b95\u0bb3\u0bbf\u0ba9\u0bcd \u0baa\u0bbe\u0bb0\u0bae\u0bcd\u0baa\u0bb0\u0bbf\u0baf"),
    ("daily", "\u0b95\u0bbe\u0bb2\u0bc8\u0baf\u0bbf\u0bb2\u0bcd \u0b8e\u0bb4\u0bc1\u0ba8\u0bcd\u0ba4\u0ba4\u0bc1\u0bae\u0bcd \u0bae\u0bc1\u0ba4\u0bb2\u0bbf\u0bb2\u0bcd"),
    ("short", "\u0ba4\u0bae\u0bbf\u0bb4\u0bcd"),
    ("short", "\u0ba8\u0ba9\u0bcd\u0bb1\u0bbf"),
    ("mixed", "India has many languages. \u0ba4\u0bae\u0bbf\u0bb4\u0bcd is one of the"),
]

print(f"\n{'='*60}")
print(f"\U0001f9ea DAPT EVAL: {len(eval_prompts)} Tamil text continuations")
print(f"   (Base model \u2014 expect text continuation, not chat)")
print(f"{'='*60}")

eval_results = []

for category, prompt_text in eval_prompts:
    inputs = tokenizer(prompt_text, return_tensors="pt").to(merged_model.device)

    with torch.no_grad():
        outputs = merged_model.generate(
            **inputs,
            max_new_tokens=150,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.2,
            no_repeat_ngram_size=4,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True)

    t_pct = tamil_char_pct(response)
    words = response.split()
    unique_ratio = len(set(words)) / max(len(words), 1)
    is_repetitive = unique_ratio < 0.3 and len(words) > 10
    is_empty = len(response.strip()) < 10
    is_code = any(kw in response[:100] for kw in ['def ', 'class ', 'import ', '{"', 'var '])

    status = "\u2705"
    if is_empty: status = "\u274c EMPTY"
    elif is_code: status = "\u274c CODE"
    elif is_repetitive: status = "\u26a0\ufe0f LOOP"
    elif t_pct < 20 and category != "mixed": status = "\u26a0\ufe0f LOW TAMIL"

    eval_results.append((category, prompt_text, response[:200], status, t_pct, unique_ratio))

    print(f"\n[{category.upper()}] {status} (Tamil: {t_pct:.0f}%, Unique: {unique_ratio:.0%})")
    print(f"  Prompt: {prompt_text[:60]}")
    print(f"  Output: {response[:300]}")
    print("-" * 50)

# Summary
print(f"\n{'='*60}")
print(f"\U0001f4ca DAPT EVAL SUMMARY")
print(f"{'='*60}")
pass_count = sum(1 for r in eval_results if r[3] == "\u2705")
avg_tamil = np.mean([r[4] for r in eval_results])
avg_unique = np.mean([r[5] for r in eval_results])
print(f"   Passed:      {pass_count}/{len(eval_results)}")
print(f"   Avg Tamil%:  {avg_tamil:.0f}%")
print(f"   Avg Unique:  {avg_unique:.0%}")
print(f"   Eval PPL:    {eval_ppl:.1f}")

for cat, prompt, resp, status, tamil, uniq in eval_results:
    print(f"   {status} [{cat}] Tamil:{tamil:.0f}% Uniq:{uniq:.0%}")

if pass_count >= len(eval_results) * 0.7 and avg_tamil > 30:
    print(f"\n\U0001f389 DAPT successful! Proceed to SFT.")
elif pass_count >= len(eval_results) * 0.4:
    print(f"\n\u26a0\ufe0f  Partial. Try more tokens or check loss curve.")
else:
    print(f"\n\u274c DAPT failed. Check loss curve, data quality, try r=32.")


ЁЯзк DAPT EVAL: 8 Tamil text continuations
   (Base model тАФ expect text continuation, not chat)

[PROSE] тЬЕ (Tamil: 68%, Unique: 100%)
  Prompt: родрооро┐ро┤рпНроиро╛роЯрпБ роЗроирпНродро┐ропро╛ро╡ро┐ройрпН родрпЖройрпН рокроХрпБродро┐ропро┐ро▓рпН роЕроорпИроирпНродрпБро│рпНро│ роТро░рпБ рооро╛роиро┐ро▓роорпН.
  Output:  1952-ро▓рпН, роЪрокро┐роЯро┐ роХрогроХрпНро╖роЩрпНроХроЪро┐ро░ро╛роо ро╡ро┤ро┐роХро╛роЯрпНроЪро┐ (Census of India) роОройрпНро▒ роЙро▒рпБрокрпНрокро┐ройро░рпН роЬрпЛро╕рпНро╕рпВроЯро╛ропрпН(Jossuatai), роЕроЩрпНроЧрпНро░рпАроЪрпН рокрпКро░рпБро│ро╛родро╛ро░ роУро░рпБроЯрпНроЯрпБ рокроХрпНрогроорпН роПро▒рпНрокроЯрпБродрпНродро┐рой.
роЗроирпНроиро┐ро▓рпИропро┐ро▓
--------------------------------------------------

[PROSE] тЬЕ (Tamil: 69%, Unique: 94%)
  Prompt: рокрпКроЩрпНроХро▓рпН родрооро┐ро┤ро░рпНроХро│ро┐ройрпН роорпБроХрпНроХро┐роп родро┐ро░рпБроиро╛ро│рпН.
  Output:  . .
роЗройро┐, роЗродро▒рпНрокроЯро┐роЪрпНроЪрпЖропрпНро╡рпЛроорпН!
роЕро╡роЪро░ роТро┤рпБроЩ

NameError: name 'eval_ppl' is not defined

## 11. Upload Merged Model

In [15]:
api = HfApi()
api.create_repo(OUTPUT_MODEL, exist_ok=True)

print(f"\U0001f4e4 Pushing merged fp16 model to {OUTPUT_MODEL}...")
merged_model.push_to_hub(
    OUTPUT_MODEL,
    private=False,
    commit_message=f"DAPT v1.0: Tamil-adapted Qwen3-0.6B (Sangraha, QLoRA r={LORA_R})",
)
tokenizer.push_to_hub(OUTPUT_MODEL)

print(f"\n\u2705 Model: https://huggingface.co/{OUTPUT_MODEL}")
print(f"\u2705 Adapter: https://huggingface.co/{ADAPTER_REPO}")
print(f"\n\U0001f449 Next: Run SFT notebook with BASE_MODEL = \"{OUTPUT_MODEL}\"")

ЁЯУд Pushing merged fp16 model to CryptoYogi/qwen3-0.6b-tamil...


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            


тЬЕ Model: https://huggingface.co/CryptoYogi/qwen3-0.6b-tamil
тЬЕ Adapter: https://huggingface.co/CryptoYogi/qwen3-0.6b-tamil-lora

ЁЯСЙ Next: Run SFT notebook with BASE_MODEL = "CryptoYogi/qwen3-0.6b-tamil"


## Summary

| Artifact | Repo | Purpose |
|----------|------|---------|
| Packed DAPT data | `CryptoYogi/vazhi-dapt-tamil-v1_0` | Reusable training data |
| Merged fp16 model | `CryptoYogi/qwen3-0.6b-tamil` | Reusable Tamil base for SFT |
| LoRA adapter | `CryptoYogi/qwen3-0.6b-tamil-lora` | Recovery backup |

### Next: SFT (Stage 3)
```python
BASE_MODEL = "CryptoYogi/qwen3-0.6b-tamil"  # THIS model
DATASET = "CryptoYogi/vazhi-tamil-sft-v4_0"  # or combined v3.6 + v4.0
```

### If DAPT failed
1. Loss didn't decrease тЖТ data may be too noisy, check filters
2. Tamil% low тЖТ increase TARGET_TOKENS in data prep, re-run
3. Repetitive output тЖТ try r=32 in this notebook (just change LORA_R)
4. All else fails тЖТ try Instruct model with very low LR (1e-5)