# Lab 3.2.6: GGUF Conversion

**Module:** 3.2 - Model Quantization & Optimization  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚òÜ‚òÜ

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the GGUF format and its advantages
- [ ] Convert a Hugging Face model to GGUF
- [ ] Apply different quantization levels (Q2_K to Q8_0)
- [ ] Run inference with llama.cpp

---

## üìö Prerequisites

- Completed: Labs 3.2.1-3.2.5
- Tools: llama.cpp (built from source)
- Hardware: Works on CPU and GPU

---

## üåç Real-World Context

**Why GGUF?**
- **llama.cpp**: The most popular local LLM inference engine
- **Universal**: Works on CPU, CUDA, Metal, ROCm, Vulkan
- **Efficient**: Hand-optimized kernels for each platform
- **Portable**: Single-file models, easy to share and deploy

**GGUF = GGML Universal Format** - successor to GGML, designed for:
- Extensibility (add new features without breaking compatibility)
- Metadata storage (tokenizer, model config in the same file)
- Multiple quantization types in one file

---

## üßí ELI5: What is GGUF?

> **Think of model formats like movie file formats...**
>
> **PyTorch (.bin/.safetensors):** Like a professional movie master file.
> - High quality, but needs special software (PyTorch) to play
> - Large file size
>
> **GGUF:** Like an MP4 file.
> - Plays anywhere (CPU, GPU, phone, browser)
> - Compressed but still looks great
> - Everything in one file (video + audio = model + tokenizer)
>
> **In AI terms:**
> - GGUF packages model weights + tokenizer + config in one file
> - Works with llama.cpp, ollama, LM Studio, and many more
> - Supports 2-bit to 8-bit quantization (Q2_K to Q8_0)

---

## Part 1: Understanding GGUF Quantization Types

In [None]:
import os
import subprocess
from pathlib import Path

print("=" * 60)
print("GGUF Conversion Lab")
print("=" * 60)

# GGUF quantization types and their properties
GGUF_QUANT_TYPES = {
    'Q2_K': {'bits': 2.5, 'quality': 'Low', 'use_case': 'Maximum compression'},
    'Q3_K_S': {'bits': 3.4, 'quality': 'Low-Medium', 'use_case': 'Small model, okay quality'},
    'Q3_K_M': {'bits': 3.9, 'quality': 'Medium', 'use_case': 'Balanced for small'},
    'Q4_0': {'bits': 4.5, 'quality': 'Medium', 'use_case': 'Legacy, use Q4_K_M instead'},
    'Q4_K_S': {'bits': 4.5, 'quality': 'Medium-Good', 'use_case': 'Good compression'},
    'Q4_K_M': {'bits': 4.8, 'quality': 'Good', 'use_case': 'Recommended default'},
    'Q5_0': {'bits': 5.5, 'quality': 'Good', 'use_case': 'Legacy, use Q5_K_M instead'},
    'Q5_K_S': {'bits': 5.5, 'quality': 'Very Good', 'use_case': 'Higher quality'},
    'Q5_K_M': {'bits': 5.7, 'quality': 'Very Good', 'use_case': 'Best quality/size'},
    'Q6_K': {'bits': 6.6, 'quality': 'Excellent', 'use_case': 'Near FP16 quality'},
    'Q8_0': {'bits': 8.5, 'quality': 'Excellent', 'use_case': 'Maximum quality'},
    'F16': {'bits': 16.0, 'quality': 'Perfect', 'use_case': 'No quantization'},
}

print("\nGGUF Quantization Types:")
print("=" * 70)
print(f"{'Type':<10} {'Bits':>8} {'Quality':<15} {'Use Case'}")
print("-" * 70)

for qtype, info in GGUF_QUANT_TYPES.items():
    print(f"{qtype:<10} {info['bits']:>8.1f} {info['quality']:<15} {info['use_case']}")

print("\n" + "=" * 70)
print("Recommendation: Q4_K_M for most users (best balance)")
print("              Q5_K_M for quality-critical applications")

In [None]:
# Calculate model sizes for different quantization types

def calculate_gguf_size(params_b, quant_type):
    """Calculate approximate GGUF file size in GB."""
    bits = GGUF_QUANT_TYPES[quant_type]['bits']
    return params_b * bits / 8  # Convert to GB

# Common model sizes
model_sizes = [7, 13, 34, 70]
quant_types = ['Q2_K', 'Q4_K_M', 'Q5_K_M', 'Q8_0', 'F16']

print("\nGGUF File Sizes (GB) by Model and Quantization:")
print("=" * 70)
print(f"{'Model':<10}", end="")
for qt in quant_types:
    print(f"{qt:>12}", end="")
print()
print("-" * 70)

for size in model_sizes:
    print(f"{size}B{'':<7}", end="")
    for qt in quant_types:
        gb = calculate_gguf_size(size, qt)
        print(f"{gb:>12.1f}", end="")
    print()

print("\nNote: Actual sizes may vary slightly due to metadata overhead.")

---

## Part 2: Setting Up llama.cpp

In [None]:
# Check if llama.cpp is available

LLAMA_CPP_PATH = Path.home() / "llama.cpp"  # Default location

# You can also set this to your llama.cpp installation
# LLAMA_CPP_PATH = Path("/path/to/your/llama.cpp")

convert_script = LLAMA_CPP_PATH / "convert_hf_to_gguf.py"
quantize_binary = LLAMA_CPP_PATH / "llama-quantize"
main_binary = LLAMA_CPP_PATH / "llama-cli"

print("Checking llama.cpp installation...")
print(f"Path: {LLAMA_CPP_PATH}")
print()

if LLAMA_CPP_PATH.exists():
    print(f"  llama.cpp directory: Found")
    print(f"  convert_hf_to_gguf.py: {'Found' if convert_script.exists() else 'Missing'}")
    print(f"  llama-quantize: {'Found' if quantize_binary.exists() else 'Missing'}")
    print(f"  llama-cli: {'Found' if main_binary.exists() else 'Missing'}")
    
    HAS_LLAMA_CPP = convert_script.exists()
else:
    print("  llama.cpp not found!")
    HAS_LLAMA_CPP = False

if not HAS_LLAMA_CPP:
    print("\nTo install llama.cpp:")
    print("""
    # Clone repository
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    
    # Build with CUDA support (for DGX Spark)
    cmake -B build -DGGML_CUDA=ON
    cmake --build build --config Release -j
    
    # Install Python dependencies for conversion
    pip install -r requirements.txt
    """)

---

## Part 3: Converting a Model to GGUF

In [None]:
# Model to convert
MODEL_NAME = "microsoft/phi-2"  # Small model for demo
OUTPUT_DIR = Path("./gguf_models")
OUTPUT_DIR.mkdir(exist_ok=True)

print(f"Model: {MODEL_NAME}")
print(f"Output directory: {OUTPUT_DIR}")

In [None]:
# Step 1: Download the model (if not already cached)
from transformers import AutoModelForCausalLM, AutoTokenizer

print("Step 1: Downloading model from Hugging Face...")

# Get cache path
from huggingface_hub import snapshot_download

try:
    model_path = snapshot_download(
        MODEL_NAME,
        local_dir=OUTPUT_DIR / "hf_model",
        local_dir_use_symlinks=False,
    )
    print(f"Model downloaded to: {model_path}")
except Exception as e:
    print(f"Download failed: {e}")
    print("Using HF cache instead...")
    model_path = MODEL_NAME  # Use HF cache

In [None]:
# Step 2: Convert to GGUF (FP16 first)

if HAS_LLAMA_CPP:
    print("\nStep 2: Converting to GGUF (FP16)...")
    
    fp16_output = OUTPUT_DIR / "model-f16.gguf"
    
    cmd = [
        "python", str(convert_script),
        model_path,
        "--outfile", str(fp16_output),
        "--outtype", "f16",
    ]
    
    print(f"Running: {' '.join(cmd)}")
    
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        print("Conversion successful!")
        print(f"Output: {fp16_output}")
        
        if fp16_output.exists():
            size_gb = fp16_output.stat().st_size / 1e9
            print(f"File size: {size_gb:.2f} GB")
    except subprocess.CalledProcessError as e:
        print(f"Conversion failed: {e.stderr}")
else:
    print("\nllama.cpp not available. Showing expected command:")
    print(f"""
    python {LLAMA_CPP_PATH}/convert_hf_to_gguf.py \\
        {model_path} \\
        --outfile {OUTPUT_DIR}/model-f16.gguf \\
        --outtype f16
    """)

In [None]:
# Step 3: Quantize to different types

QUANTIZATION_TYPES = ['Q4_K_M', 'Q5_K_M', 'Q8_0']

if HAS_LLAMA_CPP and fp16_output.exists():
    print("\nStep 3: Quantizing to different types...")
    
    for qtype in QUANTIZATION_TYPES:
        output_file = OUTPUT_DIR / f"model-{qtype.lower()}.gguf"
        
        print(f"\nQuantizing to {qtype}...")
        
        cmd = [
            str(quantize_binary),
            str(fp16_output),
            str(output_file),
            qtype,
        ]
        
        try:
            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
            size_gb = output_file.stat().st_size / 1e9
            print(f"  Output: {output_file}")
            print(f"  Size: {size_gb:.2f} GB")
        except subprocess.CalledProcessError as e:
            print(f"  Failed: {e.stderr}")
else:
    print("\nQuantization command format:")
    print(f"""
    {quantize_binary} model-f16.gguf model-q4_k_m.gguf Q4_K_M
    """)

---

## Part 4: Running Inference with llama.cpp

In [None]:
# Test inference with the quantized model

if HAS_LLAMA_CPP and main_binary.exists():
    print("Testing inference with llama.cpp...")
    
    # Use Q4_K_M model
    model_file = OUTPUT_DIR / "model-q4_k_m.gguf"
    
    if model_file.exists():
        prompt = "The key to machine learning is"
        
        cmd = [
            str(main_binary),
            "-m", str(model_file),
            "-p", prompt,
            "-n", "50",  # Generate 50 tokens
            "-ngl", "99",  # Offload all layers to GPU
            "--no-display-prompt",
        ]
        
        print(f"\nPrompt: '{prompt}'")
        print("\nGenerating...\n")
        
        try:
            result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
            print(f"Output: {result.stdout}")
            
            if result.stderr:
                # Extract performance info from stderr
                for line in result.stderr.split('\n'):
                    if 'eval time' in line.lower() or 'token/s' in line.lower():
                        print(line)
        except subprocess.TimeoutExpired:
            print("Inference timed out.")
        except Exception as e:
            print(f"Error: {e}")
    else:
        print(f"Model file not found: {model_file}")
else:
    print("\nllama.cpp inference command:")
    print(f"""
    {main_binary} \\
        -m model-q4_k_m.gguf \\
        -p "The key to machine learning is" \\
        -n 50 \\
        -ngl 99  # Offload to GPU
    """)

In [None]:
# Using GGUF models with Python (llama-cpp-python)

print("Using GGUF with Python (llama-cpp-python)")
print("=" * 50)

try:
    from llama_cpp import Llama
    HAS_LLAMA_CPP_PYTHON = True
    print("llama-cpp-python: Available")
except ImportError:
    HAS_LLAMA_CPP_PYTHON = False
    print("llama-cpp-python: Not installed")
    print("  Install with: pip install llama-cpp-python")
    print("  For CUDA: CMAKE_ARGS='-DGGML_CUDA=on' pip install llama-cpp-python")

if HAS_LLAMA_CPP_PYTHON:
    model_file = OUTPUT_DIR / "model-q4_k_m.gguf"
    
    if model_file.exists():
        print(f"\nLoading {model_file}...")
        
        llm = Llama(
            model_path=str(model_file),
            n_gpu_layers=-1,  # Offload all layers to GPU
            n_ctx=2048,       # Context length
            verbose=False,
        )
        
        # Generate
        prompt = "The key to machine learning is"
        print(f"\nPrompt: '{prompt}'")
        
        output = llm(
            prompt,
            max_tokens=50,
            echo=False,
        )
        
        print(f"\nGenerated: {output['choices'][0]['text']}")
        
        # Cleanup
        del llm

---

## Part 5: Quality Comparison

In [None]:
# Compare file sizes and expected quality

print("\nQuantization Quality vs Size Trade-off")
print("=" * 60)

# List all GGUF files
gguf_files = list(OUTPUT_DIR.glob("*.gguf"))

if gguf_files:
    print(f"\n{'File':<30} {'Size (GB)':>12} {'Bits/Weight':>12}")
    print("-" * 60)
    
    for f in sorted(gguf_files):
        size_gb = f.stat().st_size / 1e9
        # Estimate bits from filename
        name = f.stem.upper()
        bits = "Unknown"
        for qt, info in GGUF_QUANT_TYPES.items():
            if qt.replace('_', '') in name.replace('_', '').replace('-', ''):
                bits = f"{info['bits']:.1f}"
                break
        
        print(f"{f.name:<30} {size_gb:>12.2f} {bits:>12}")
else:
    print("\nNo GGUF files found.")
    print("Expected quality comparison:")
    print("""
    | Quantization | Perplexity Increase | Recommendation |
    |--------------|---------------------|----------------|
    | Q8_0 | +0.01 | Highest quality |
    | Q6_K | +0.02 | Near-lossless |
    | Q5_K_M | +0.05 | Excellent |
    | Q4_K_M | +0.10 | Great (default) |
    | Q3_K_M | +0.20 | Good |
    | Q2_K | +0.50 | Acceptable |
    """)

---

## ‚úã Try It Yourself

### Exercise 1: Convert a Different Model

Convert a model of your choice to GGUF and test all quantization levels.

### Exercise 2: Benchmark Inference Speed

Compare inference speed across quantization types:
```bash
for q in q2_k q4_k_m q5_k_m q8_0; do
    echo "Testing $q:"
    ./llama-cli -m model-$q.gguf -p "Hello" -n 100 -ngl 99 2>&1 | grep "token/s"
done
```

In [None]:
# Exercise: Your code here

# TODO: Convert a different model to GGUF
# TODO: Compare sizes and test inference

# Your code here...

---

## Common Mistakes

### Mistake 1: Not Using GPU Offloading

```bash
# Wrong: Running entirely on CPU (slow!)
./llama-cli -m model.gguf -p "Hello"

# Right: Offload layers to GPU
./llama-cli -m model.gguf -p "Hello" -ngl 99
```

### Mistake 2: Wrong Quantization Choice

```bash
# Wrong: Using Q4_0 (legacy, worse quality)
./llama-quantize model-f16.gguf model-q4.gguf Q4_0

# Right: Use Q4_K_M (better quality, same size)
./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M
```

### Mistake 3: Forgetting to Build with CUDA

```bash
# Wrong: Building without CUDA (CPU only)
cmake -B build

# Right: Enable CUDA for DGX Spark
cmake -B build -DGGML_CUDA=ON
```

---

## Checkpoint

You've learned:

- **GGUF format**: Universal format for llama.cpp ecosystem
- **Quantization types**: Q2_K to Q8_0, with K variants being better
- **Conversion workflow**: HF -> F16 GGUF -> Quantized GGUF
- **GPU offloading**: Use `-ngl` for fast inference on DGX Spark

---

## Further Reading

- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [GGUF Format Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)

---

## Cleanup

In [None]:
# Optional: Clean up generated files
# Uncomment to delete GGUF files

# import shutil
# if OUTPUT_DIR.exists():
#     shutil.rmtree(OUTPUT_DIR)
#     print(f"Cleaned up {OUTPUT_DIR}")

print("Notebook complete! Ready for Lab 3.2.7: Quality Benchmark Suite")

---

## Next Steps

In the next notebook, we'll create a **Quality Benchmark Suite** to compare all quantization methods!

‚û°Ô∏è Continue to: [Lab 3.2.7: Quality Benchmark Suite](lab-3.2.7-quality-benchmark-suite.ipynb)