# VAZHI DAPT Data Prep v1.0 — Tamil Corpus for DAPT

**Pipeline Step 1 of 3:** Prepare filtered, packed Tamil training data.

```
Step 1 (THIS NOTEBOOK): Data Prep — CPU only, no GPU needed
  → Input:  AI4Bharat Sangraha verified Tamil corpus (streaming)
  → Output: CryptoYogi/vazhi-dapt-tamil-v1_0 (packed 1024-token blocks on HF)

Step 2: DAPT Training — Kaggle P100 GPU
  → Input:  Packed dataset from Step 1
  → Output: CryptoYogi/qwen3-0.6b-tamil (reusable Tamil base model)

Step 3: SFT — Kaggle P100 GPU
  → Input:  DAPT'd model + ChatML instruction pairs
  → Output: CryptoYogi/vazhi-qwen3-v3_9
```

**Why separate data prep?**
- No GPU needed — runs on local machine or Kaggle CPU
- If DAPT training fails (OOM, disconnect, wrong LR), data is already on HF
- Can experiment with different token budgets without re-downloading

**What this notebook does:**
1. Loads Qwen3-0.6B-Base tokenizer (CPU only)
2. Measures actual tokens/doc on 200 Sangraha samples (GPT5.2 #2)
3. Streams & filters Sangraha verified Tamil (GPT5.2 #6)
4. Packs into 1024-token blocks for causal LM training (GPT5.2 #5)
5. Uploads packed dataset to HuggingFace

**Runtime:** ~30-60 min on CPU (streaming + tokenization)

## 1. Install Dependencies

In [None]:
# Only need transformers (tokenizer) + datasets + huggingface_hub
# No torch/peft/bitsandbytes/trl needed — this is CPU-only
# Pin transformers<5.0 — transformers 5.x pulls scipy/sklearn deps
# that conflict with Colab's pre-installed numpy/scipy versions
!pip install -q -U \
  "transformers>=4.45.0,<5.0.0" \
  "datasets>=2.21.0" \
  "huggingface_hub>=0.24.7"

print("\u2705 Dependencies installed (CPU-only — no GPU packages needed)")

## 2. Configuration

In [None]:
import os
import json
import random
import hashlib
import numpy as np
from collections import Counter
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer
from huggingface_hub import login, HfApi

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# === TOKENIZER SOURCE ===
# Must match the model that will be trained in the DAPT notebook
BASE_MODEL = "Qwen/Qwen3-0.6B-Base"

# === OUTPUT ===
OUTPUT_DATASET = "CryptoYogi/vazhi-dapt-tamil-v1_0"

# === CORPUS SOURCE ===
SANGRAHA_CONFIG = "verified"   # Cleanest Tamil data
SANGRAHA_SPLIT = "tam"         # Tamil split

# === TOKEN BUDGET ===
# GPT5.2 #3: Control by token count, not epochs
TARGET_TOKENS = 30_000_000     # 30M tokens — sweet spot for 0.6B model
MAX_SEQ_LENGTH = 1024          # Pack into 1024-token blocks

# === DATA QUALITY FILTERS (GPT5.2 #6) ===
MIN_TAMIL_PCT = 50             # Minimum Tamil character percentage (verified corpus min is 51%)
MIN_DOC_CHARS = 200            # Drop very short docs
MAX_DOC_CHARS = 8000           # Drop very long docs (boilerplate risk)
MAX_REPETITION_RATIO = 0.5     # Drop docs with >50% repeated lines

# === EVAL SPLIT ===
EVAL_PCT = 0.02                # 2% held out for perplexity eval during training

print("\U0001f4cb DAPT Data Prep v1.0 Config:")
print(f"   Tokenizer from:  {BASE_MODEL}")
print(f"   Output dataset:  {OUTPUT_DATASET}")
print(f"   Token budget:    {TARGET_TOKENS:,}")
print(f"   Block size:      {MAX_SEQ_LENGTH} tokens")
print(f"   Source:          Sangraha {SANGRAHA_CONFIG}/{SANGRAHA_SPLIT}")
print(f"   Filters:         Tamil>={MIN_TAMIL_PCT}%, chars {MIN_DOC_CHARS}-{MAX_DOC_CHARS}, dedup, no repetition")
print(f"   Eval holdout:    {EVAL_PCT:.0%}")

In [None]:
# Login to HuggingFace
# On Kaggle:
# from kaggle_secrets import UserSecretsClient
# secrets = UserSecretsClient()
# hf_token = secrets.get_secret("HF_TOKEN")

# On local machine:
# hf_token = os.environ.get("HF_TOKEN") or input("Enter HF token: ")

# Uncomment the appropriate method above, then:
# login(token=hf_token)

# Or if you've already run `huggingface-cli login`:
login()
print("\u2705 Logged in to HuggingFace")

## 3. Load Tokenizer & Measure Token Counts

**GPT5.2 #2:** Don't estimate tokens from chars — measure with the actual tokenizer.
Tamil tokenization varies widely by content type.

In [None]:
print(f"\U0001f4e5 Loading tokenizer from {BASE_MODEL} (CPU only)...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)

print(f"\u2705 Tokenizer ready: {len(tokenizer)} tokens")
print(f"   eos_token: {tokenizer.eos_token!r} (ID {tokenizer.eos_token_id})")
print(f"   pad_token: {tokenizer.pad_token!r} (ID {tokenizer.pad_token_id})")

# If pad_token is None, set to eos_token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    print(f"   \u26a0\ufe0f  Set pad_token = eos_token")

In [None]:
# === HELPER FUNCTIONS ===

def count_tamil_chars(text):
    """Count Tamil Unicode characters (U+0B80 to U+0BFF)."""
    return sum(1 for c in text if '\u0B80' <= c <= '\u0BFF')

def tamil_char_pct(text):
    """Tamil character percentage of total text length."""
    if not text:
        return 0.0
    return 100.0 * count_tamil_chars(text) / len(text)

def has_excessive_repetition(text, threshold=MAX_REPETITION_RATIO):
    """Check if a doc has too many repeated lines (boilerplate/headers)."""
    lines = [l.strip() for l in text.split("\n") if l.strip()]
    if len(lines) < 3:
        return False
    line_counts = Counter(lines)
    most_common_count = line_counts.most_common(1)[0][1]
    return most_common_count / len(lines) > threshold

def text_hash(text):
    """Fast MD5 hash for dedup."""
    return hashlib.md5(text.encode("utf-8")).hexdigest()

print("\u2705 Helper functions defined")

In [None]:
# === MEASURE ACTUAL TOKENIZATION ON 200 SAMPLE DOCS ===
# GPT5.2 #2: "Sample 200 docs, run tokenizer, decide N docs from actual tokens"

print(f"\U0001f50d Sampling 200 docs from Sangraha {SANGRAHA_CONFIG}/{SANGRAHA_SPLIT}...")
ds_stream = load_dataset(
    "ai4bharat/sangraha", SANGRAHA_CONFIG, split=SANGRAHA_SPLIT, streaming=True
)

sample_docs = []
for item in ds_stream:
    sample_docs.append(item["text"])
    if len(sample_docs) >= 200:
        break

# Tokenize each and measure
doc_tokens = []
doc_tamil_pcts = []
doc_lengths = []

for doc in sample_docs:
    tokens = tokenizer.encode(doc, add_special_tokens=False)
    doc_tokens.append(len(tokens))
    doc_tamil_pcts.append(tamil_char_pct(doc))
    doc_lengths.append(len(doc))

avg_tokens = np.mean(doc_tokens)
median_tokens = np.median(doc_tokens)
avg_tamil = np.mean(doc_tamil_pcts)
avg_chars = np.mean(doc_lengths)
tokens_per_char = np.mean([t / max(c, 1) for t, c in zip(doc_tokens, doc_lengths)])

print(f"\n\U0001f4ca Tokenization Analysis (200 docs):")
print(f"   Avg tokens/doc:  {avg_tokens:.0f}")
print(f"   Median tokens:   {median_tokens:.0f}")
print(f"   Min/Max tokens:  {min(doc_tokens)} / {max(doc_tokens)}")
print(f"   Avg chars/doc:   {avg_chars:.0f}")
print(f"   Tokens/char:     {tokens_per_char:.2f}")
print(f"   Avg Tamil %:     {avg_tamil:.1f}%")
print(f"   Avg Latin %:     {np.mean([100 * sum(c.isascii() for c in d) / max(len(d), 1) for d in sample_docs]):.1f}%")

# Estimate docs needed
docs_needed_estimate = int(TARGET_TOKENS / avg_tokens)
print(f"\n\U0001f3af For {TARGET_TOKENS:,} token budget:")
print(f"   Estimated docs needed (pre-filter): ~{docs_needed_estimate:,}")
print(f"   (Actual count depends on filtering — next cell)")

## 4. Stream, Filter & Collect Sangraha Tamil Docs

**GPT5.2 #6:** Even "verified" corpora need filtering:
- Tamil character ratio >= 40%
- Length bounds: 200-8000 chars
- Drop docs with extreme line repetition
- Hash-based exact dedup

Streams until token budget is met (with 10% buffer for EOS overhead).

In [None]:
print(f"\U0001f4e5 Streaming Sangraha {SANGRAHA_CONFIG}/{SANGRAHA_SPLIT} with filters...")
print(f"   Tamil >= {MIN_TAMIL_PCT}% | Chars {MIN_DOC_CHARS}-{MAX_DOC_CHARS} | Dedup | No repetition")
print(f"   Token budget: {TARGET_TOKENS:,} (+ 10% buffer)")
print()

ds_stream = load_dataset(
    "ai4bharat/sangraha", SANGRAHA_CONFIG, split=SANGRAHA_SPLIT, streaming=True
)

clean_texts = []
seen_hashes = set()
total_tokens = 0
stats = {
    "total_seen": 0,
    "dropped_short": 0,
    "dropped_long": 0,
    "dropped_tamil": 0,
    "dropped_repetition": 0,
    "dropped_dedup": 0,
    "kept": 0,
}

BUFFER_FACTOR = 1.1
effective_budget = int(TARGET_TOKENS * BUFFER_FACTOR)

for item in ds_stream:
    stats["total_seen"] += 1
    text = item.get("text", "").strip()

    # Length filter
    if len(text) < MIN_DOC_CHARS:
        stats["dropped_short"] += 1
        continue
    if len(text) > MAX_DOC_CHARS:
        stats["dropped_long"] += 1
        continue

    # Tamil character ratio
    t_pct = tamil_char_pct(text)
    if t_pct < MIN_TAMIL_PCT:
        stats["dropped_tamil"] += 1
        continue

    # Repetition filter
    if has_excessive_repetition(text):
        stats["dropped_repetition"] += 1
        continue

    # Exact dedup
    h = text_hash(text)
    if h in seen_hashes:
        stats["dropped_dedup"] += 1
        continue
    seen_hashes.add(h)

    # Count tokens
    n_tokens = len(tokenizer.encode(text, add_special_tokens=False))

    clean_texts.append(text)
    total_tokens += n_tokens
    stats["kept"] += 1

    if stats["kept"] % 2000 == 0:
        pct = 100 * total_tokens / TARGET_TOKENS
        print(f"   ...{stats['kept']:,} docs kept, {total_tokens:,} tokens ({pct:.0f}% of budget)")

    # Stop when token budget is met
    if total_tokens >= effective_budget:
        print(f"\n\u2705 Token budget reached!")
        break

print(f"\n\U0001f4ca Filtering Results:")
print(f"   Total scanned:      {stats['total_seen']:,}")
print(f"   Dropped (short):    {stats['dropped_short']:,}")
print(f"   Dropped (long):     {stats['dropped_long']:,}")
print(f"   Dropped (Tamil%):   {stats['dropped_tamil']:,}")
print(f"   Dropped (repeat):   {stats['dropped_repetition']:,}")
print(f"   Dropped (dedup):    {stats['dropped_dedup']:,}")
print(f"   \u2705 Kept:           {stats['kept']:,} docs")
print(f"   \u2705 Total tokens:   {total_tokens:,}")

if total_tokens < TARGET_TOKENS * 0.9:
    print(f"\n\u26a0\ufe0f  Only got {total_tokens:,} tokens ({100 * total_tokens / TARGET_TOKENS:.0f}% of budget).")
    print(f"   Consider: relaxing filters, adding unverified/synthetic configs, or reducing TARGET_TOKENS.")

## 5. Pack Into 1024-Token Blocks

**GPT5.2 #5:** Don't pad each doc to max_length — concatenate all docs with EOS separators
into a continuous token stream, then split into fixed-length blocks.
This is the standard causal LM training format.

In [None]:
print(f"\U0001f4e6 Packing {len(clean_texts):,} docs into {MAX_SEQ_LENGTH}-token blocks...")

# Step 1: Tokenize all texts, concatenate with EOS separators
all_token_ids = []
eos_id = tokenizer.eos_token_id

for i, text in enumerate(clean_texts):
    tokens = tokenizer.encode(text, add_special_tokens=False)
    all_token_ids.extend(tokens)
    all_token_ids.append(eos_id)

    if (i + 1) % 5000 == 0:
        print(f"   ...tokenized {i + 1:,}/{len(clean_texts):,} docs")

print(f"   Total token stream: {len(all_token_ids):,} tokens")

# Step 2: Split into fixed-length blocks (discard partial tail)
n_blocks = len(all_token_ids) // MAX_SEQ_LENGTH
trimmed = all_token_ids[: n_blocks * MAX_SEQ_LENGTH]
blocks = [trimmed[i * MAX_SEQ_LENGTH : (i + 1) * MAX_SEQ_LENGTH] for i in range(n_blocks)]

print(f"\n\u2705 Packed into {len(blocks):,} blocks of {MAX_SEQ_LENGTH} tokens")
print(f"   Total training tokens: {len(blocks) * MAX_SEQ_LENGTH:,}")
print(f"   Discarded tail:        {len(all_token_ids) - len(trimmed):,} tokens")

# Step 3: Quick sanity — decode a sample block
sample_decoded = tokenizer.decode(blocks[0][:100])
print(f"\n\U0001f50d Sample block (first 100 tokens):")
print(f"   Tamil%: {tamil_char_pct(sample_decoded):.0f}%")
print(f"   Text:   {sample_decoded[:200]}...")

## 6. Create Dataset with Train/Eval Split & Upload to HF

In [None]:
# Create HF Dataset from packed blocks
packed_dataset = Dataset.from_dict({
    "input_ids": blocks,
    "attention_mask": [[1] * MAX_SEQ_LENGTH for _ in blocks],
    "labels": [list(b) for b in blocks],
})

# Train/eval split (2% for perplexity eval during training)
split = packed_dataset.train_test_split(test_size=EVAL_PCT, seed=RANDOM_SEED)

dataset_dict = DatasetDict({
    "train": split["train"],
    "validation": split["test"],
})

print(f"\U0001f4ca Dataset created:")
print(f"   Train:      {len(dataset_dict['train']):,} blocks ({len(dataset_dict['train']) * MAX_SEQ_LENGTH:,} tokens)")
print(f"   Validation: {len(dataset_dict['validation']):,} blocks ({len(dataset_dict['validation']) * MAX_SEQ_LENGTH:,} tokens)")
print(f"   Block size: {MAX_SEQ_LENGTH} tokens")
print(f"   Columns:    {list(dataset_dict['train'].column_names)}")

In [None]:
# Upload to HuggingFace
print(f"\U0001f4e4 Uploading to {OUTPUT_DATASET}...")

dataset_dict.push_to_hub(
    OUTPUT_DATASET,
    private=False,
    commit_message=(
        f"DAPT data v1.0: {len(blocks):,} packed blocks of {MAX_SEQ_LENGTH} tokens | "
        f"Source: Sangraha {SANGRAHA_CONFIG}/{SANGRAHA_SPLIT} | "
        f"Filters: Tamil>={MIN_TAMIL_PCT}%, chars {MIN_DOC_CHARS}-{MAX_DOC_CHARS}, dedup | "
        f"Tokenizer: {BASE_MODEL}"
    ),
)

print(f"\n\u2705 Dataset uploaded: https://huggingface.co/datasets/{OUTPUT_DATASET}")

## 7. Summary & Verification

In [None]:
# Verify the uploaded dataset can be loaded back
print(f"\U0001f50d Verifying upload by loading back from HF...")
verify_ds = load_dataset(OUTPUT_DATASET)

print(f"   Train:      {len(verify_ds['train']):,} blocks")
print(f"   Validation: {len(verify_ds['validation']):,} blocks")

# Verify a sample block
sample = verify_ds["train"][0]
assert len(sample["input_ids"]) == MAX_SEQ_LENGTH, f"Block size mismatch: {len(sample['input_ids'])}"
assert len(sample["labels"]) == MAX_SEQ_LENGTH, f"Labels size mismatch: {len(sample['labels'])}"
assert sample["input_ids"] == sample["labels"], "input_ids and labels should match for causal LM"

sample_text = tokenizer.decode(sample["input_ids"][:50])
print(f"   Sample text: {sample_text[:150]}...")
print(f"   Tamil%: {tamil_char_pct(sample_text):.0f}%")

print(f"\n\u2705 Verification passed!")
print(f"\n{'='*60}")
print(f"\U0001f4cb SUMMARY")
print(f"{'='*60}")
print(f"   Dataset:    {OUTPUT_DATASET}")
print(f"   Source:     Sangraha {SANGRAHA_CONFIG}/{SANGRAHA_SPLIT}")
print(f"   Docs kept:  {stats['kept']:,} (scanned {stats['total_seen']:,})")
print(f"   Tokens:     {total_tokens:,} (budget: {TARGET_TOKENS:,})")
print(f"   Blocks:     {len(blocks):,} x {MAX_SEQ_LENGTH} tokens")
print(f"   Tokenizer:  {BASE_MODEL}")
print(f"\n\U0001f449 Next step: Run Vazhi_DAPT_v1_0_Tamil.ipynb on Kaggle GPU")
print(f"   It will load this dataset with:")
print(f'   ds = load_dataset("{OUTPUT_DATASET}")')