# VAZHI - Tamil-LLaMA 7B Extreme Quantization Testing

**Goal**: Test if Tamil-LLaMA 7B can be quantized to ~1-1.5GB while preserving Tamil quality

**Why Tamil-LLaMA?**
- Only model that produces correct Tamil responses out of the box
- Q4_K_M is 3.9GB (too large), but extreme quants might work

**Quantization Levels to Test**:
| Type | Expected Size | Risk |
|------|---------------|------|
| Q4_K_M | 3.9GB | Known good |
| Q3_K_M | ~3.0GB | Low |
| Q2_K | ~2.5GB | Medium |
| IQ3_XXS | ~2.0GB | Medium |
| IQ2_XXS | ~1.5GB | High |
| IQ2_XS | ~1.3GB | Higher |
| IQ1_M | ~1.1GB | Very High |
| IQ1_S | ~1.0GB | Experimental |

## 1. Setup

In [None]:
# Install dependencies
!pip install -q huggingface_hub

# Clone llama.cpp for quantization
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && pip install -q -r requirements.txt

In [None]:
# Build llama.cpp tools
!cd llama.cpp && mkdir -p build && cd build && cmake .. && make -j4 llama-quantize llama-cli
print("Build complete!")

In [None]:
# Login to HuggingFace
from huggingface_hub import login, hf_hub_download, snapshot_download

try:
    from kaggle_secrets import UserSecretsClient
    secrets = UserSecretsClient()
    login(token=secrets.get_secret("HF_TOKEN"))
    print("Logged in via Kaggle")
except:
    try:
        from google.colab import userdata
        login(token=userdata.get('HF_TOKEN'))
        print("Logged in via Colab")
    except:
        login()

## 2. Download Tamil-LLaMA 7B

Model: `abhinand/tamil-llama-7b-instruct-v0.2`

In [None]:
# Check if GGUF already exists on HuggingFace
from huggingface_hub import HfApi, list_models

api = HfApi()
print("Searching for Tamil-LLaMA GGUF models...")

# Search for existing GGUFs
models = list(api.list_models(search="tamil-llama gguf", limit=10))
for m in models:
    print(f"  - {m.id}")

# Also check the original repo for GGUF files
print("\nChecking original repo for GGUF files...")
try:
    files = api.list_repo_files("abhinand/tamil-llama-7b-instruct-v0.2")
    gguf_files = [f for f in files if f.endswith('.gguf')]
    if gguf_files:
        print(f"Found GGUF files: {gguf_files}")
    else:
        print("No GGUF files in original repo - need to convert")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Option A: Download existing GGUF if available
# Uncomment if GGUF exists on HuggingFace

# GGUF_REPO = "TheBloke/tamil-llama-7b-instruct-v0.2-GGUF"  # Example
# hf_hub_download(GGUF_REPO, "tamil-llama-7b-instruct-v0.2.Q4_K_M.gguf", local_dir=".")

In [None]:
# Option B: Download HF model and convert to GGUF
# This is needed if no GGUF exists

MODEL_ID = "abhinand/tamil-llama-7b-instruct-v0.2"
LOCAL_DIR = "./tamil-llama-7b"

print(f"Downloading {MODEL_ID}...")
print("This will take a while (~14GB)...")

snapshot_download(
    repo_id=MODEL_ID,
    local_dir=LOCAL_DIR,
    ignore_patterns=["*.md", "*.txt"],  # Skip docs
)

print(f"Downloaded to {LOCAL_DIR}")
!ls -lh {LOCAL_DIR}

In [None]:
# Convert to GGUF F16 (base for all quantizations)
print("Converting to GGUF F16...")
!python llama.cpp/convert_hf_to_gguf.py {LOCAL_DIR} --outfile tamil-llama-f16.gguf --outtype f16

print("\nF16 GGUF created:")
!ls -lh tamil-llama-f16.gguf

## 3. Test Questions

Standard Tamil questions to evaluate quality at each quantization level.

In [None]:
TEST_QUESTIONS = [
    "வணக்கம், நீங்கள் யார்?",
    "திருக்குறளின் முதல் குறள் என்ன?",
    "தமிழ்நாட்டின் தலைநகரம் எது?",
    "OTP யாரிடமும் சொல்லலாமா?",
]

EXPECTED_ANSWERS = {
    "திருக்குறளின் முதல் குறள் என்ன?": "அகர முதல",
    "தமிழ்நாட்டின் தலைநகரம் எது?": "சென்னை",
    "OTP யாரிடமும் சொல்லலாமா?": "இல்லை",  # Should warn against sharing
}

def format_prompt(question):
    """Format prompt for Tamil-LLaMA instruction format"""
    # Tamil-LLaMA uses Alpaca-style format
    return f"""### Instruction:
{question}

### Response:
"""

In [None]:
import subprocess
import os

def test_gguf(model_path, question, max_tokens=150):
    """Test a GGUF model with a Tamil question"""
    prompt = format_prompt(question)
    
    cmd = [
        "./llama.cpp/build/bin/llama-cli",
        "-m", model_path,
        "-p", prompt,
        "-n", str(max_tokens),
        "--temp", "0.7",
        "-ngl", "0",  # CPU only for compatibility
    ]
    
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
        output = result.stdout
        
        # Extract response after prompt
        if "### Response:" in output:
            response = output.split("### Response:")[-1].strip()
            # Clean up
            response = response.split("###")[0].strip()  # Stop at next section
            return response[:500]  # Limit length
        return output[-500:]
    except subprocess.TimeoutExpired:
        return "[TIMEOUT]"
    except Exception as e:
        return f"[ERROR: {e}]"

def evaluate_quality(response, question):
    """Simple quality check"""
    # Check if response contains Tamil
    tamil_chars = sum(1 for c in response if 0x0B80 <= ord(c) <= 0x0BFF)
    has_tamil = tamil_chars > 10
    
    # Check for expected answer if available
    expected = EXPECTED_ANSWERS.get(question, "")
    has_expected = expected.lower() in response.lower() if expected else None
    
    # Check for gibberish patterns
    is_gibberish = (
        "system" in response.lower() * 3 or
        len(set(response)) < 10 or  # Too few unique chars
        response.count(response[:10]) > 3  # Repetition
    )
    
    return {
        "has_tamil": has_tamil,
        "has_expected": has_expected,
        "is_gibberish": is_gibberish,
        "tamil_chars": tamil_chars,
    }

## 4. Quantize to Multiple Levels

In [None]:
# Define quantization levels to test
QUANT_LEVELS = [
    ("Q4_K_M", "q4_k_m"),      # ~3.9GB - baseline
    ("Q3_K_M", "q3_k_m"),      # ~3.0GB
    ("Q2_K", "q2_k"),          # ~2.5GB
    ("IQ3_XXS", "iq3_xxs"),    # ~2.0GB
    ("IQ2_XXS", "iq2_xxs"),    # ~1.5GB - TARGET
    ("IQ2_XS", "iq2_xs"),      # ~1.3GB
    ("IQ1_M", "iq1_m"),        # ~1.1GB
    ("IQ1_S", "iq1_s"),        # ~1.0GB - Most aggressive
]

print("Quantization levels to test:")
for name, code in QUANT_LEVELS:
    print(f"  - {name} ({code})")

In [None]:
# Quantize to each level
import os

BASE_GGUF = "tamil-llama-f16.gguf"
results = {}

for name, code in QUANT_LEVELS:
    output_file = f"tamil-llama-{code}.gguf"
    
    if os.path.exists(output_file):
        print(f"\n{name}: Already exists, skipping...")
        continue
    
    print(f"\n{'='*60}")
    print(f"Quantizing to {name} ({code})...")
    print(f"{'='*60}")
    
    !./llama.cpp/build/bin/llama-quantize {BASE_GGUF} {output_file} {code}
    
    if os.path.exists(output_file):
        size_bytes = os.path.getsize(output_file)
        size_gb = size_bytes / (1024**3)
        print(f"\nCreated: {output_file} ({size_gb:.2f} GB)")
        results[name] = {"file": output_file, "size_gb": size_gb}
    else:
        print(f"\nFailed to create {output_file}")
        results[name] = {"file": None, "error": "Quantization failed"}

In [None]:
# Show all created files
print("\nAll GGUF files:")
!ls -lh tamil-llama-*.gguf

## 5. Test Each Quantization Level

In [None]:
# Test each quantized model
import time

all_results = []

for name, code in QUANT_LEVELS:
    model_file = f"tamil-llama-{code}.gguf"
    
    if not os.path.exists(model_file):
        print(f"\nSkipping {name} - file not found")
        continue
    
    size_gb = os.path.getsize(model_file) / (1024**3)
    
    print(f"\n{'='*60}")
    print(f"Testing {name} ({size_gb:.2f} GB)")
    print(f"{'='*60}")
    
    model_results = {
        "quant": name,
        "size_gb": size_gb,
        "questions": []
    }
    
    for q in TEST_QUESTIONS:
        print(f"\nQ: {q}")
        
        start = time.time()
        response = test_gguf(model_file, q)
        elapsed = time.time() - start
        
        quality = evaluate_quality(response, q)
        
        print(f"A: {response[:300]}...")
        print(f"   [Time: {elapsed:.1f}s, Tamil chars: {quality['tamil_chars']}, "
              f"Gibberish: {quality['is_gibberish']}]")
        
        model_results["questions"].append({
            "question": q,
            "response": response,
            "time_s": elapsed,
            "quality": quality
        })
    
    all_results.append(model_results)

## 6. Summary & Comparison

In [None]:
# Generate summary table
print("\n" + "="*80)
print("TAMIL-LLAMA 7B QUANTIZATION COMPARISON")
print("="*80)

print(f"\n{'Quant':<12} {'Size':<10} {'Tamil':<8} {'Correct':<10} {'Gibberish':<10} {'Verdict'}")
print("-"*70)

for result in all_results:
    quant = result["quant"]
    size = f"{result['size_gb']:.2f}GB"
    
    # Aggregate quality across questions
    tamil_count = sum(1 for q in result["questions"] if q["quality"]["has_tamil"])
    correct_count = sum(1 for q in result["questions"] 
                       if q["quality"]["has_expected"] is True)
    gibberish_count = sum(1 for q in result["questions"] 
                         if q["quality"]["is_gibberish"])
    
    total = len(result["questions"])
    
    tamil_pct = f"{tamil_count}/{total}"
    correct_pct = f"{correct_count}/{total}"
    gibberish_pct = f"{gibberish_count}/{total}"
    
    # Verdict
    if gibberish_count > 0:
        verdict = "FAIL"
    elif tamil_count == total and correct_count >= total - 1:
        verdict = "GOOD"
    elif tamil_count >= total - 1:
        verdict = "OK"
    else:
        verdict = "POOR"
    
    print(f"{quant:<12} {size:<10} {tamil_pct:<8} {correct_pct:<10} {gibberish_pct:<10} {verdict}")

print("\n" + "="*80)

In [None]:
# Find best quantization under 1.5GB
print("\nBest quantization under 1.5GB:")

viable = [r for r in all_results if r["size_gb"] <= 1.5]
if viable:
    # Sort by quality (fewest gibberish, most Tamil)
    for r in sorted(viable, key=lambda x: x["size_gb"], reverse=True):
        gibberish = sum(1 for q in r["questions"] if q["quality"]["is_gibberish"])
        tamil = sum(1 for q in r["questions"] if q["quality"]["has_tamil"])
        
        status = "USABLE" if gibberish == 0 and tamil >= len(r["questions"]) - 1 else "DEGRADED"
        print(f"  {r['quant']}: {r['size_gb']:.2f}GB - {status}")
else:
    print("  No viable quantization under 1.5GB")

## 7. Detailed Output Comparison

Compare actual responses across quantization levels for key questions.

In [None]:
# Compare Thirukkural responses across all quants
thirukkural_q = "திருக்குறளின் முதல் குறள் என்ன?"

print("\nThirukkural Response Comparison:")
print("="*60)

for result in all_results:
    for q in result["questions"]:
        if q["question"] == thirukkural_q:
            print(f"\n[{result['quant']} - {result['size_gb']:.2f}GB]")
            print(q["response"][:400])
            print("-"*40)

In [None]:
# Compare Capital city responses
capital_q = "தமிழ்நாட்டின் தலைநகரம் எது?"

print("\nCapital City Response Comparison:")
print("="*60)

for result in all_results:
    for q in result["questions"]:
        if q["question"] == capital_q:
            has_chennai = "சென்னை" in q["response"]
            status = "CORRECT" if has_chennai else "WRONG"
            print(f"\n[{result['quant']} - {result['size_gb']:.2f}GB] {status}")
            print(q["response"][:200])

## 8. Save Best Model

In [None]:
# Identify best viable model
# Manual selection based on test results above

BEST_QUANT = "iq2_xxs"  # Update based on test results
BEST_FILE = f"tamil-llama-{BEST_QUANT}.gguf"

if os.path.exists(BEST_FILE):
    size_gb = os.path.getsize(BEST_FILE) / (1024**3)
    print(f"Best model: {BEST_FILE} ({size_gb:.2f}GB)")
    
    # Rename for clarity
    final_name = f"vazhi-tamil-llama-{BEST_QUANT}.gguf"
    !cp {BEST_FILE} {final_name}
    print(f"Copied to: {final_name}")
else:
    print(f"File not found: {BEST_FILE}")

## 9. Upload to HuggingFace (Optional)

In [None]:
# Upload viable quantizations to HuggingFace
from huggingface_hub import HfApi, create_repo

UPLOAD = False  # Set to True to upload

if UPLOAD:
    api = HfApi()
    REPO_ID = "CryptoYogi/vazhi-tamil-llama-gguf"
    
    # Create repo
    create_repo(REPO_ID, repo_type="model", exist_ok=True)
    
    # Upload viable models (under 1.5GB with good quality)
    files_to_upload = [
        f"tamil-llama-{code}.gguf" 
        for name, code in QUANT_LEVELS 
        if os.path.exists(f"tamil-llama-{code}.gguf")
    ]
    
    for f in files_to_upload:
        print(f"Uploading {f}...")
        api.upload_file(
            path_or_fileobj=f,
            path_in_repo=f,
            repo_id=REPO_ID,
        )
    
    print(f"\nUploaded to: https://huggingface.co/{REPO_ID}")
else:
    print("Upload disabled. Set UPLOAD=True to upload.")

## 10. Conclusions

In [None]:
print("""
================================================================
TAMIL-LLAMA 7B QUANTIZATION STUDY - CONCLUSIONS
================================================================

Fill in after running tests:

| Quant     | Size    | Tamil Quality | Verdict     |
|-----------|---------|---------------|-------------|
| Q4_K_M    | 3.9GB   | [            ]| Baseline    |
| Q3_K_M    | ~3.0GB  | [            ]| [          ]|
| Q2_K      | ~2.5GB  | [            ]| [          ]|
| IQ3_XXS   | ~2.0GB  | [            ]| [          ]|
| IQ2_XXS   | ~1.5GB  | [            ]| [          ]|
| IQ2_XS    | ~1.3GB  | [            ]| [          ]|
| IQ1_M     | ~1.1GB  | [            ]| [          ]|
| IQ1_S     | ~1.0GB  | [            ]| [          ]|

RECOMMENDATION:
[Fill in the best viable quantization for VAZHI mobile app]

NEXT STEPS:
1. If viable quant found: Integrate with VAZHI app
2. If no viable quant: Try alternative approaches
   - Distillation from Tamil-LLaMA to smaller model
   - Custom Tamil tokenizer + small model
   - Hybrid architecture (small LLM + lookup tables)
""")