# VAZHI GGUF Quantization

**Goal**: Create a ~1.7GB quantized model for offline mobile inference

**Steps**:
1. Merge LoRA adapter with Qwen 2.5 3B base model
2. Convert merged model to GGUF format
3. Quantize to Q4_K_M
4. Test the quantized model

**Requirements**: Colab with ~12GB RAM (free tier should work)

## Step 0: Setup Environment

In [None]:
# Install dependencies
!pip install -q torch transformers peft accelerate huggingface_hub sentencepiece

# Check available RAM
!free -h

In [None]:
# Login to HuggingFace (needed to download models)
from huggingface_hub import login
login()  # Enter your HF token when prompted

## Step 1: Merge LoRA with Base Model

We'll load the base model and LoRA adapter, merge them, and save the result.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import gc

# Model configuration
BASE_MODEL = "Qwen/Qwen2.5-3B-Instruct"
LORA_ADAPTER = "CryptoYogi/vazhi-lora"
MERGED_OUTPUT = "./vazhi-merged"

print(f"Base model: {BASE_MODEL}")
print(f"LoRA adapter: {LORA_ADAPTER}")

In [None]:
# Load tokenizer first (small memory footprint)
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
print(f"Tokenizer loaded. Vocab size: {tokenizer.vocab_size}")

In [None]:
# Load base model in float16 to save memory
print("Loading base model in float16...")
print("This may take a few minutes...")

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="cpu",  # Keep on CPU to save GPU memory
    trust_remote_code=True,
    low_cpu_mem_usage=True,
)

print(f"Base model loaded. Parameters: {base_model.num_parameters():,}")
!free -h

In [None]:
# Load and merge LoRA adapter
print("Loading LoRA adapter...")

model = PeftModel.from_pretrained(
    base_model,
    LORA_ADAPTER,
    torch_dtype=torch.float16,
)

print("LoRA adapter loaded. Merging...")

# Merge LoRA weights into base model
model = model.merge_and_unload()

print("Merge complete!")
!free -h

In [None]:
# Save merged model
print(f"Saving merged model to {MERGED_OUTPUT}...")
print("This may take a few minutes...")

model.save_pretrained(MERGED_OUTPUT, safe_serialization=True)
tokenizer.save_pretrained(MERGED_OUTPUT)

print("Merged model saved!")
!ls -lh {MERGED_OUTPUT}

In [None]:
# Clear memory before next step
print("Clearing memory...")
del model
del base_model
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

print("Memory cleared.")
!free -h

## Step 2: Quick Test of Merged Model

Before converting to GGUF, let's verify the merged model works.

In [None]:
# Quick test of merged model
print("Loading merged model for quick test...")

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

test_model = AutoModelForCausalLM.from_pretrained(
    MERGED_OUTPUT,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
test_tokenizer = AutoTokenizer.from_pretrained(MERGED_OUTPUT)

print("Model loaded for testing.")

In [None]:
# Test with Thirukkural question
test_prompt = """<|im_start|>system
நீங்கள் VAZHI (வழி), தமிழ் மக்களுக்கான AI உதவியாளர்.
<|im_end|>
<|im_start|>user
திருக்குறளின் முதல் குறள் என்ன?<|im_end|>
<|im_start|>assistant
"""

inputs = test_tokenizer(test_prompt, return_tensors="pt").to(test_model.device)

print("Generating response...")
with torch.no_grad():
    outputs = test_model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=test_tokenizer.eos_token_id,
    )

response = test_tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n" + "="*50)
print("MERGED MODEL TEST")
print("="*50)
print(response.split("assistant")[-1].strip())

In [None]:
# Clear test model
del test_model
del test_tokenizer
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None
print("Test model cleared.")

## Step 3: Install llama.cpp and Convert to GGUF

In [None]:
# Clone and build llama.cpp
!git clone https://github.com/ggerganov/llama.cpp.git
%cd llama.cpp
!make -j4
%cd ..

In [None]:
# Install llama.cpp Python dependencies
!pip install -q -r llama.cpp/requirements.txt

In [None]:
# Convert merged model to GGUF format (F16)
print("Converting to GGUF format...")
print("This may take 5-10 minutes...")

!python llama.cpp/convert_hf_to_gguf.py \
    {MERGED_OUTPUT} \
    --outfile vazhi-f16.gguf \
    --outtype f16

print("\nConversion complete!")
!ls -lh vazhi-f16.gguf

## Step 4: Quantize to Q4_K_M

Q4_K_M provides good balance of quality and size (~1.7GB for 3B model)

In [None]:
# Quantize to Q4_K_M
print("Quantizing to Q4_K_M...")
print("This provides best quality/size balance for mobile...")

!./llama.cpp/llama-quantize \
    vazhi-f16.gguf \
    vazhi-q4_k_m.gguf \
    q4_k_m

print("\nQuantization complete!")
!ls -lh vazhi-*.gguf

In [None]:
# Optional: Create smaller Q4_0 version for very low-end devices
print("Creating Q4_0 version (smaller but lower quality)...")

!./llama.cpp/llama-quantize \
    vazhi-f16.gguf \
    vazhi-q4_0.gguf \
    q4_0

print("\nAll GGUF files:")
!ls -lh vazhi-*.gguf

## Step 5: Test Quantized Model

In [None]:
# Test the Q4_K_M model with llama.cpp CLI
print("Testing quantized model...")
print("="*50)

!./llama.cpp/llama-cli \
    -m vazhi-q4_k_m.gguf \
    -p "<|im_start|>system\nநீங்கள் VAZHI, தமிழ் உதவியாளர்.<|im_end|>\n<|im_start|>user\nதிருக்குறளின் முதல் குறள் என்ன?<|im_end|>\n<|im_start|>assistant\n" \
    -n 150 \
    --temp 0.7 \
    -ngl 0

In [None]:
# Test with a scam detection question
print("\nTesting scam detection...")
print("="*50)

!./llama.cpp/llama-cli \
    -m vazhi-q4_k_m.gguf \
    -p "<|im_start|>system\nநீங்கள் VAZHI, தமிழ் உதவியாளர்.<|im_end|>\n<|im_start|>user\nஇது மோசடியா: 'நீங்கள் 50 லட்சம் lottery வென்றீர்கள், உங்கள் bank details அனுப்புங்கள்'<|im_end|>\n<|im_start|>assistant\n" \
    -n 150 \
    --temp 0.7 \
    -ngl 0

## Step 6: Upload to HuggingFace

In [None]:
# Upload GGUF files to HuggingFace
from huggingface_hub import HfApi, create_repo

# Create or use existing repo
GGUF_REPO = "CryptoYogi/vazhi-gguf"

try:
    create_repo(GGUF_REPO, repo_type="model", exist_ok=True)
    print(f"Repository ready: {GGUF_REPO}")
except Exception as e:
    print(f"Repo exists or error: {e}")

In [None]:
# Upload the quantized models
api = HfApi()

print("Uploading Q4_K_M model (recommended)...")
api.upload_file(
    path_or_fileobj="vazhi-q4_k_m.gguf",
    path_in_repo="vazhi-q4_k_m.gguf",
    repo_id=GGUF_REPO,
    repo_type="model",
)
print("Q4_K_M uploaded!")

print("\nUploading Q4_0 model (smaller)...")
api.upload_file(
    path_or_fileobj="vazhi-q4_0.gguf",
    path_in_repo="vazhi-q4_0.gguf",
    repo_id=GGUF_REPO,
    repo_type="model",
)
print("Q4_0 uploaded!")

print(f"\nModels available at: https://huggingface.co/{GGUF_REPO}")

In [None]:
# Create README for the GGUF repo
readme_content = """---
license: apache-2.0
language:
- ta
- en
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
- tamil
- gguf
- llama.cpp
- mobile
- offline
---

# VAZHI GGUF - Tamil AI Assistant (Quantized)

Quantized versions of VAZHI for offline mobile inference.

## Models

| File | Size | Quality | Use Case |
|------|------|---------|----------|
| vazhi-q4_k_m.gguf | ~1.7GB | Best | Recommended for most devices |
| vazhi-q4_0.gguf | ~1.5GB | Good | Low-memory devices |

## Usage with llama.cpp

```bash
./llama-cli -m vazhi-q4_k_m.gguf \
    -p "<|im_start|>user\nதிருக்குறளின் முதல் குறள்?<|im_end|>\n<|im_start|>assistant\n" \
    -n 150
```

## Base Model

- Base: Qwen/Qwen2.5-3B-Instruct
- Fine-tuned: [CryptoYogi/vazhi-lora](https://huggingface.co/CryptoYogi/vazhi-lora)
- Training: 3,007 Tamil Q&A pairs across 6 domains

## Domains

- Culture (Thirukkural, temples)
- Education (scholarships, exams)
- Security (scam detection)
- Legal (RTI, consumer rights)
- Government (schemes)
- Healthcare (Siddha medicine)

## License

Apache 2.0
"""

with open("GGUF_README.md", "w") as f:
    f.write(readme_content)

api.upload_file(
    path_or_fileobj="GGUF_README.md",
    path_in_repo="README.md",
    repo_id=GGUF_REPO,
    repo_type="model",
)
print("README uploaded!")

## Summary

### Created Files:
- `vazhi-q4_k_m.gguf` - Recommended (~1.7GB)
- `vazhi-q4_0.gguf` - Smaller (~1.5GB)

### Next Steps:
1. Download GGUF from HuggingFace
2. Integrate into Flutter app using llama.cpp bindings
3. Test on actual mobile devices

### Download Command:
```bash
# Using huggingface-cli
huggingface-cli download CryptoYogi/vazhi-gguf vazhi-q4_k_m.gguf

# Or direct URL
wget https://huggingface.co/CryptoYogi/vazhi-gguf/resolve/main/vazhi-q4_k_m.gguf
```

In [None]:
# Final summary
print("="*60)
print("VAZHI GGUF QUANTIZATION COMPLETE!")
print("="*60)
print("\nFiles created:")
!ls -lh vazhi-*.gguf
print(f"\nUploaded to: https://huggingface.co/{GGUF_REPO}")
print("\nReady for mobile integration!")