# VAZHI GGUF Diagnostic Notebook

**Goal**: Identify where quality loss occurs in the GGUF pipeline

**Test Question**: திருக்குறளின் முதல் குறள் என்ன?

**Expected Answer**: அகர முதல எழுத்தெல்லாம்...

**Checkpoints**:
1. LoRA Model (before merge) - Should work ✅
2. Merged Model (after merge) - Test this
3. GGUF F16 (after conversion) - Test this
4. GGUF Q8_0 (light quantization) - Test this
5. GGUF Q4_K_M (aggressive quantization) - Currently broken

**Platform**: Kaggle (30GB RAM) recommended

## Setup

In [None]:
!pip install -q torch transformers peft accelerate huggingface_hub sentencepiece

In [None]:
from huggingface_hub import login
login()  # Enter your HF token

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import gc

# Configuration
BASE_MODEL = "Qwen/Qwen2.5-3B-Instruct"
LORA_ADAPTER = "CryptoYogi/vazhi-lora"

# Test prompt - Thirukkural first verse
TEST_PROMPT = """<|im_start|>system
நீங்கள் VAZHI (வழி), தமிழ் மக்களுக்கான AI உதவியாளர்.
தமிழ் கலாச்சாரம், திருக்குறள், சித்தர்கள், கோவில்கள் பற்றி உதவுங்கள்.
<|im_end|>
<|im_start|>user
திருக்குறளின் முதல் குறள் என்ன?<|im_end|>
<|im_start|>assistant
"""

EXPECTED_KEYWORDS = ["அகர", "முதல", "எழுத்தெல்லாம்", "ஆதி", "பகவன்"]

def test_model(model, tokenizer, name):
    """Test model and check for expected keywords"""
    print(f"\n{'='*60}")
    print(f"Testing: {name}")
    print(f"{'='*60}")
    
    inputs = tokenizer(TEST_PROMPT, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = response.split("assistant")[-1].strip()
    
    print(f"\nResponse:\n{answer}\n")
    
    # Check for expected keywords
    found = [kw for kw in EXPECTED_KEYWORDS if kw in answer]
    print(f"Keywords found: {len(found)}/{len(EXPECTED_KEYWORDS)} - {found}")
    
    if len(found) >= 3:
        print("✅ PASS - Response contains Thirukkural content")
        return True
    else:
        print("❌ FAIL - Response missing Thirukkural content")
        return False

## Checkpoint 1: Test LoRA Model (Before Merge)

This should work - it's what we tested after training.

In [None]:
print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)

print("Loading LoRA adapter...")
lora_model = PeftModel.from_pretrained(
    base_model,
    LORA_ADAPTER,
    torch_dtype=torch.float16,
)

# Test LoRA model
checkpoint1_pass = test_model(lora_model, tokenizer, "Checkpoint 1: LoRA Model (before merge)")

## Checkpoint 2: Test Merged Model (After Merge)

This is where the LoRA weights are merged into the base model.

In [None]:
print("Merging LoRA into base model...")
merged_model = lora_model.merge_and_unload()

# Test merged model
checkpoint2_pass = test_model(merged_model, tokenizer, "Checkpoint 2: Merged Model (after merge)")

In [None]:
# Save merged model for GGUF conversion
MERGED_OUTPUT = "./vazhi-merged"
print(f"Saving merged model to {MERGED_OUTPUT}...")
merged_model.save_pretrained(MERGED_OUTPUT, safe_serialization=True)
tokenizer.save_pretrained(MERGED_OUTPUT)
print("Saved!")

# Clear memory
del lora_model
del merged_model
del base_model
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

## Setup llama.cpp for GGUF Conversion

In [None]:
# Clone and build llama.cpp
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && mkdir -p build && cd build && cmake .. && make -j4
!pip install -q -r llama.cpp/requirements.txt

## Checkpoint 3: Test GGUF F16 (After Conversion, Before Quantization)

This tests if the HuggingFace → GGUF conversion preserves quality.

In [None]:
# Convert to GGUF F16
print("Converting to GGUF F16...")
!python llama.cpp/convert_hf_to_gguf.py \
    {MERGED_OUTPUT} \
    --outfile vazhi-f16.gguf \
    --outtype f16

!ls -lh vazhi-f16.gguf

In [None]:
# Test F16 GGUF
print("\n" + "="*60)
print("Testing: Checkpoint 3: GGUF F16 (no quantization)")
print("="*60)

!./llama.cpp/build/bin/llama-cli \
    -m vazhi-f16.gguf \
    -p "<|im_start|>system\nநீங்கள் VAZHI, தமிழ் கலாச்சாரம் பற்றி உதவுங்கள்.<|im_end|>\n<|im_start|>user\nதிருக்குறளின் முதல் குறள் என்ன?<|im_end|>\n<|im_start|>assistant\n" \
    -n 150 \
    --temp 0.7 \
    -ngl 0 \
    --stop "<|im_end|>" \
    2>&1 | tail -30

## Checkpoint 4: Test GGUF Q8_0 (Light Quantization)

Q8_0 is 8-bit quantization - less aggressive than Q4_K_M.

In [None]:
# Quantize to Q8_0
print("Quantizing to Q8_0...")
!./llama.cpp/build/bin/llama-quantize \
    vazhi-f16.gguf \
    vazhi-q8_0.gguf \
    q8_0

!ls -lh vazhi-q8_0.gguf

In [None]:
# Test Q8_0 GGUF
print("\n" + "="*60)
print("Testing: Checkpoint 4: GGUF Q8_0 (8-bit quantization)")
print("="*60)

!./llama.cpp/build/bin/llama-cli \
    -m vazhi-q8_0.gguf \
    -p "<|im_start|>system\nநீங்கள் VAZHI, தமிழ் கலாச்சாரம் பற்றி உதவுங்கள்.<|im_end|>\n<|im_start|>user\nதிருக்குறளின் முதல் குறள் என்ன?<|im_end|>\n<|im_start|>assistant\n" \
    -n 150 \
    --temp 0.7 \
    -ngl 0 \
    --stop "<|im_end|>" \
    2>&1 | tail -30

## Checkpoint 5: Test GGUF Q4_K_M (Aggressive Quantization)

This is what we currently have - 4-bit quantization.

In [None]:
# Quantize to Q4_K_M
print("Quantizing to Q4_K_M...")
!./llama.cpp/build/bin/llama-quantize \
    vazhi-f16.gguf \
    vazhi-q4_k_m.gguf \
    q4_k_m

!ls -lh vazhi-q4_k_m.gguf

In [None]:
# Test Q4_K_M GGUF
print("\n" + "="*60)
print("Testing: Checkpoint 5: GGUF Q4_K_M (4-bit quantization)")
print("="*60)

!./llama.cpp/build/bin/llama-cli \
    -m vazhi-q4_k_m.gguf \
    -p "<|im_start|>system\nநீங்கள் VAZHI, தமிழ் கலாச்சாரம் பற்றி உதவுங்கள்.<|im_end|>\n<|im_start|>user\nதிருக்குறளின் முதல் குறள் என்ன?<|im_end|>\n<|im_start|>assistant\n" \
    -n 150 \
    --temp 0.7 \
    -ngl 0 \
    --stop "<|im_end|>" \
    2>&1 | tail -30

## Summary

Run this cell after testing all checkpoints to see the summary.

In [None]:
print("\n" + "="*60)
print("DIAGNOSTIC SUMMARY")
print("="*60)
print("""
Review the outputs above and mark each checkpoint:

| Checkpoint | Stage | Expected | Actual | File Size |
|------------|-------|----------|--------|----------|
| 1 | LoRA Model | ✅ Pass | ? | N/A |
| 2 | Merged Model | ? | ? | ~6GB |
| 3 | GGUF F16 | ? | ? | ~6GB |
| 4 | GGUF Q8_0 | ? | ? | ~3.2GB |
| 5 | GGUF Q4_K_M | ❌ Fail | ? | ~1.8GB |

If quality degrades at:
- Checkpoint 2: Problem with LoRA merge
- Checkpoint 3: Problem with GGUF conversion
- Checkpoint 4: Problem with quantization (use Q8_0 instead)
- Checkpoint 5 only: Q4_K_M too aggressive, use Q8_0
""")
print("\nFile sizes:")
!ls -lh vazhi-*.gguf 2>/dev/null || echo "No GGUF files yet"

## Upload Best Working Model

Once you identify the best working quantization, upload it to HuggingFace.

In [None]:
from huggingface_hub import HfApi

api = HfApi()
GGUF_REPO = "CryptoYogi/vazhi-gguf"

# Upload Q8_0 if it works better
# Uncomment the model you want to upload:

# print("Uploading Q8_0 model...")
# api.upload_file(
#     path_or_fileobj="vazhi-q8_0.gguf",
#     path_in_repo="vazhi-q8_0.gguf",
#     repo_id=GGUF_REPO,
#     repo_type="model",
# )
# print(f"Uploaded to https://huggingface.co/{GGUF_REPO}")