# VAZHI - Pre-trained Tamil Model Evaluation

**Goal**: Test existing Tamil models to find one suitable for mobile deployment

**Models to Test**:
1. Sarvam-1 (2B) - Indian AI company, optimized for 10 Indian languages
2. Gemma 2B Tamil - Community fine-tuned Google Gemma

**Why**: Our Qwen2.5-0.5B LoRA training failed (output garbage despite good loss).
Pre-trained models skip the training risk entirely.

## 1. Setup

In [None]:
# Install dependencies
!pip install -q llama-cpp-python huggingface_hub transformers accelerate
!pip install -q bitsandbytes  # For loading large models in 4-bit

In [None]:
# Check GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
if torch.cuda.is_available():
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Test Questions

Standard questions to evaluate Tamil capability:

In [None]:
TEST_QUESTIONS = [
    "வணக்கம், நீங்கள் யார்?",
    "திருக்குறளின் முதல் குறள் என்ன?",
    "தமிழ்நாட்டின் தலைநகரம் எது?",
    "ஔவையாரின் ஆத்திசூடி பற்றி சொல்லுங்கள்",
]

# Expected answers for validation
EXPECTED = {
    "திருக்குறளின் முதல் குறள் என்ன?": "அகர முதல எழுத்தெல்லாம் ஆதி பகவன் முதற்றே உலகு",
    "தமிழ்நாட்டின் தலைநகரம் எது?": "சென்னை",
}

## 3. Option A: Sarvam-1

**Model**: sarvamai/sarvam-1 (2B parameters)
**Optimized for**: 10 Indian languages including Tamil
**Company**: Sarvam AI (Bangalore-based)

In [None]:
# First, let's check if GGUF versions exist
from huggingface_hub import HfApi, list_models

api = HfApi()

# Search for Sarvam GGUF
print("Searching for Sarvam-1 GGUF models...")
models = list(api.list_models(search="sarvam gguf", limit=10))
for m in models:
    print(f"  - {m.id}")

print("\nSearching for Tamil Gemma GGUF models...")
models = list(api.list_models(search="tamil gemma gguf", limit=10))
for m in models:
    print(f"  - {m.id}")

In [None]:
# Test Sarvam-1 using transformers (not GGUF)
# This tells us if the model works before we try quantization

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

print("Loading Sarvam-1 in 4-bit...")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

sarvam_model = AutoModelForCausalLM.from_pretrained(
    "sarvamai/sarvam-1",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

sarvam_tokenizer = AutoTokenizer.from_pretrained(
    "sarvamai/sarvam-1",
    trust_remote_code=True,
)

print(f"Model loaded! Parameters: {sarvam_model.num_parameters():,}")

In [None]:
def test_sarvam(question, max_tokens=200):
    """Test Sarvam-1 with a Tamil question"""
    # Sarvam uses a specific prompt format - check their docs
    # For now, try simple prompt
    prompt = f"Question: {question}\nAnswer:"
    
    inputs = sarvam_tokenizer(prompt, return_tensors="pt").to(sarvam_model.device)
    
    with torch.no_grad():
        outputs = sarvam_model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=sarvam_tokenizer.eos_token_id,
        )
    
    response = sarvam_tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract answer part
    if "Answer:" in response:
        return response.split("Answer:")[-1].strip()
    return response

# Test all questions
print("=" * 60)
print("SARVAM-1 TEST RESULTS")
print("=" * 60)

for q in TEST_QUESTIONS:
    print(f"\nQ: {q}")
    print(f"A: {test_sarvam(q)}")

## 4. Option B: Gemma 2B Tamil

**Model**: abhinand/tamil-gemma-2b-instruct-v0.1
**Base**: Google Gemma 2B
**Fine-tuned by**: Community contributor (abhinand)

In [None]:
# Clear Sarvam from memory
import gc
del sarvam_model
del sarvam_tokenizer
gc.collect()
torch.cuda.empty_cache()
print("Cleared Sarvam from memory")

In [None]:
# Load Gemma 2B Tamil
print("Loading Gemma 2B Tamil in 4-bit...")

gemma_model = AutoModelForCausalLM.from_pretrained(
    "abhinand/tamil-gemma-2b-instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

gemma_tokenizer = AutoTokenizer.from_pretrained(
    "abhinand/tamil-gemma-2b-instruct-v0.1",
    trust_remote_code=True,
)

print(f"Model loaded! Parameters: {gemma_model.num_parameters():,}")

In [None]:
def test_gemma(question, max_tokens=200):
    """Test Gemma Tamil with a question"""
    # Gemma instruct format
    prompt = f"<start_of_turn>user\n{question}<end_of_turn>\n<start_of_turn>model\n"
    
    inputs = gemma_tokenizer(prompt, return_tensors="pt").to(gemma_model.device)
    
    with torch.no_grad():
        outputs = gemma_model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=gemma_tokenizer.eos_token_id,
        )
    
    response = gemma_tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract model response
    if "<start_of_turn>model" in response:
        return response.split("<start_of_turn>model")[-1].strip()
    return response

# Test all questions
print("=" * 60)
print("GEMMA 2B TAMIL TEST RESULTS")
print("=" * 60)

for q in TEST_QUESTIONS:
    print(f"\nQ: {q}")
    print(f"A: {test_gemma(q)}")

## 5. GGUF Conversion (for the better model)

Once we identify which model works better, convert to GGUF for mobile.

In [None]:
# Setup llama.cpp for GGUF conversion
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && pip install -q -r requirements.txt

In [None]:
# Choose the better model and convert
# Uncomment the model you want to convert:

# MODEL_TO_CONVERT = "sarvamai/sarvam-1"
# MODEL_TO_CONVERT = "abhinand/tamil-gemma-2b-instruct-v0.1"

# Download and convert to GGUF
# !python llama.cpp/convert_hf_to_gguf.py {MODEL_TO_CONVERT} --outfile tamil-model-f16.gguf --outtype f16

In [None]:
# Build quantize tool
# !cd llama.cpp && mkdir -p build && cd build && cmake .. && make -j4 llama-quantize

In [None]:
# Quantize to different sizes
# !./llama.cpp/build/bin/llama-quantize tamil-model-f16.gguf tamil-model-q8_0.gguf q8_0
# !./llama.cpp/build/bin/llama-quantize tamil-model-f16.gguf tamil-model-q4_k_m.gguf q4_k_m
# !ls -lh tamil-model-*.gguf

## 6. Test GGUF Output Quality

Critical test: Does the quantized model still produce good Tamil?

In [None]:
# Build llama-cli
# !cd llama.cpp && cd build && make -j4 llama-cli

In [None]:
# Test GGUF model
# !./llama.cpp/build/bin/llama-cli -m tamil-model-q4_k_m.gguf \
#     -p "திருக்குறளின் முதல் குறள் என்ன?" \
#     -n 150 --temp 0.7 -ngl 0

## 7. Summary & Decision

Fill this in after testing:

In [None]:
print("""
=================================================================
MODEL COMPARISON SUMMARY
=================================================================

| Aspect              | Sarvam-1           | Gemma 2B Tamil     |
|---------------------|--------------------|--------------------|  
| Tamil Quality       | [Fill after test]  | [Fill after test]  |
| Thirukkural Correct | [Yes/No]           | [Yes/No]           |
| Response Coherence  | [1-5 rating]       | [1-5 rating]       |
| GGUF Q4 Size        | [Size]             | [Size]             |
| GGUF Quality        | [Works/Broken]     | [Works/Broken]     |

DECISION: [Which model to use for VAZHI]
""")