# Task 11.4: GGUF Conversion

**Module:** 11 - Model Quantization & Optimization  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚òÜ‚òÜ

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the GGUF format and its advantages
- [ ] Convert models to GGUF format
- [ ] Apply various quantization levels (Q2 to Q8)
- [ ] Run inference with llama.cpp
- [ ] Compare GGUF variants on quality and speed

---

## üìö Prerequisites

- Completed: Tasks 11.1-11.3
- Knowledge of: Basic quantization concepts
- Hardware: DGX Spark with 128GB unified memory

---

## üåç Real-World Context

**The Problem:** GPTQ and AWQ are great for GPUs, but what if you want to:
- Run models on CPUs?
- Deploy to edge devices?
- Use a simple, dependency-light inference engine?

**Enter GGUF (GPT-Generated Unified Format)**:
- Created by Georgi Gerganov for llama.cpp
- Single-file format with embedded metadata
- Works on CPU, GPU, Apple Silicon, Android
- Supports 2-8 bit quantization
- Powers millions of local LLM deployments!

---

## üßí ELI5: What is GGUF?

> **Imagine you're sending a recipe to a friend...**
>
> **Old way (PyTorch format):** Send them:
> - The ingredient list (weights)
> - Cooking instructions (model architecture)
> - Equipment needed (dependencies)
> - 10 separate files they need to organize
>
> **GGUF way:** Send ONE file that contains:
> - Everything they need
> - Instructions on how to read it
> - Works with any cooking style (CPU/GPU/Metal)
>
> **In AI terms:** GGUF is a self-contained format that packages the model, its architecture info, and quantization details into a single portable file that works everywhere.

---

## Part 1: Understanding GGUF Quantization Types

GGUF supports many quantization variants, each with different tradeoffs:

| Type | Bits/Weight | Quality | Speed | Use Case |
|------|-------------|---------|-------|----------|
| F16 | 16 | Best | Baseline | Full precision |
| Q8_0 | 8 | Excellent | Fast | Quality-critical |
| Q6_K | 6.6 | Very Good | Fast | Balanced |
| Q5_K_M | 5.5 | Good | Faster | Recommended default |
| Q4_K_M | 4.8 | Good | Faster | Popular choice |
| Q4_0 | 4 | OK | Fast | Memory constrained |
| Q3_K_M | 3.4 | Fair | Fastest | Very small |
| Q2_K | 2.6 | Poor | Fastest | Extreme compression |

### The "K" Variants

The `K` suffix means "k-quants" - a smarter quantization that:
- Uses different bit-widths for different layers
- Protects important layers (attention) with more bits
- Gives better quality than uniform quantization

In [None]:
import os
import subprocess
import time
import gc

print("="*60)
print("DGX Spark Environment Check")
print("="*60)

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Visualize quantization type comparison
import matplotlib.pyplot as plt
import numpy as np

# Note: Quality and speed values are approximate estimates.
# Actual performance varies by model architecture, hardware, and workload.
# Always benchmark on your specific use case!
quant_types = {
    'F16':     {'bits': 16.0, 'quality': 100, 'speed': 50},
    'Q8_0':    {'bits': 8.0,  'quality': 98,  'speed': 70},
    'Q6_K':    {'bits': 6.6,  'quality': 95,  'speed': 75},
    'Q5_K_M':  {'bits': 5.5,  'quality': 92,  'speed': 80},
    'Q4_K_M':  {'bits': 4.8,  'quality': 88,  'speed': 85},
    'Q4_0':    {'bits': 4.0,  'quality': 82,  'speed': 88},
    'Q3_K_M':  {'bits': 3.4,  'quality': 75,  'speed': 92},
    'Q2_K':    {'bits': 2.6,  'quality': 60,  'speed': 95},
}

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

names = list(quant_types.keys())
bits = [quant_types[n]['bits'] for n in names]
quality = [quant_types[n]['quality'] for n in names]
speed = [quant_types[n]['speed'] for n in names]

# Model size (relative to F16)
sizes = [b/16 * 100 for b in bits]

colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(names)))[::-1]

# Bits per weight
axes[0].barh(names, bits, color=colors)
axes[0].set_xlabel('Bits per Weight')
axes[0].set_title('Storage Efficiency')
axes[0].invert_yaxis()

# Quality score
axes[1].barh(names, quality, color=colors)
axes[1].set_xlabel('Quality Score (100=best)')
axes[1].set_title('Model Quality')
axes[1].set_xlim(50, 105)
axes[1].invert_yaxis()

# Size reduction
axes[2].barh(names, sizes, color=colors)
axes[2].set_xlabel('Size (% of F16)')
axes[2].set_title('Model Size')
axes[2].invert_yaxis()

plt.tight_layout()
plt.savefig('gguf_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)  # Free memory from figure

print("\nüí° Recommendation: Q4_K_M or Q5_K_M for best balance!")

---

## Part 2: Setting Up llama.cpp

llama.cpp is a C/C++ inference engine that:
- Runs on CPU (with AVX/AVX2/AVX512 optimizations)
- Supports CUDA GPU acceleration
- Has Apple Metal support
- Is incredibly fast and efficient

In [None]:
# Clone and build llama.cpp
# This needs to be done once

import os
import subprocess

# Allow override via environment variable
LLAMA_CPP_DIR = os.environ.get("LLAMA_CPP_DIR", os.path.expanduser("~/llama.cpp"))

print(f"llama.cpp directory: {LLAMA_CPP_DIR}")

if os.path.exists(LLAMA_CPP_DIR):
    print("‚úÖ Directory already exists")

    # Check git status
    try:
        result = subprocess.run(
            ["git", "rev-parse", "--short", "HEAD"],
            cwd=LLAMA_CPP_DIR,
            capture_output=True,
            text=True
        )
        if result.returncode == 0:
            print(f"  Current commit: {result.stdout.strip()}")

        # Check for updates
        result = subprocess.run(
            ["git", "fetch", "--dry-run"],
            cwd=LLAMA_CPP_DIR,
            capture_output=True,
            text=True
        )
        if result.stderr:
            print("  ‚ÑπÔ∏è  Updates may be available. Run: cd ~/llama.cpp && git pull")
    except Exception:
        pass

    print("\nüí° To use a different location, set LLAMA_CPP_DIR environment variable")
    print("   Example: export LLAMA_CPP_DIR=/path/to/llama.cpp")
else:
    print("Cloning llama.cpp...")
    result = subprocess.run(
        ["git", "clone", "https://github.com/ggerganov/llama.cpp.git", LLAMA_CPP_DIR],
        capture_output=True,
        text=True
    )
    if result.returncode == 0:
        print("‚úÖ Clone successful!")
    else:
        print(f"‚ùå Clone failed: {result.stderr}")
        raise RuntimeError("Failed to clone llama.cpp")

In [None]:
# Build llama.cpp with CUDA support (for DGX Spark)
print("Building llama.cpp with CUDA support...")
print("(This may take a few minutes)")

build_cmd = f"""
cd {LLAMA_CPP_DIR} && \
cmake -B build -DGGML_CUDA=ON && \
cmake --build build --config Release -j$(nproc)
"""

import time as _time
_build_start = _time.time()

result = subprocess.run(build_cmd, shell=True, capture_output=True, text=True)

_build_time = _time.time() - _build_start

if result.returncode == 0:
    print(f"‚úÖ Build successful! (took {_build_time:.1f}s)")
else:
    print("‚ùå BUILD FAILED")
    print("\nBuild errors:")
    # Show last 2000 chars of error output
    error_output = result.stderr[-2000:] if len(result.stderr) > 2000 else result.stderr
    print(error_output)
    print("\nPossible solutions:")
    print("  1. Ensure CUDA toolkit is installed: nvcc --version")
    print("  2. Ensure cmake is installed: cmake --version")
    print("  3. Check that you're running in an NGC container with CUDA support")
    raise RuntimeError("llama.cpp build failed - cannot continue without compiled binaries")

In [None]:
# Verify build succeeded before continuing
# This cell will fail early with a clear error if binaries are missing

quantize_bin = os.path.join(LLAMA_CPP_DIR, "build", "bin", "llama-quantize")
main_bin = os.path.join(LLAMA_CPP_DIR, "build", "bin", "llama-cli")

print("Checking for compiled binaries...")

# Check primary paths first
if not os.path.exists(quantize_bin):
    # Check alternative paths (older llama.cpp versions)
    alt_quantize = os.path.join(LLAMA_CPP_DIR, "quantize")
    alt_main = os.path.join(LLAMA_CPP_DIR, "main")
    
    if os.path.exists(alt_quantize):
        quantize_bin = alt_quantize
        main_bin = alt_main
        print("  Using legacy binary paths (older llama.cpp version)")

# Final verification - cannot continue without binaries
if not os.path.exists(quantize_bin):
    print("‚ùå ERROR: llama.cpp binaries not found!")
    print("\nChecked the following locations:")
    print(f"  - {os.path.join(LLAMA_CPP_DIR, 'build', 'bin', 'llama-quantize')}")
    print(f"  - {os.path.join(LLAMA_CPP_DIR, 'quantize')}")
    print("\nPossible solutions:")
    print("  1. Run the build cell above successfully first")
    print("  2. Check that cmake and CUDA toolkit are installed")
    print("  3. Ensure you're using an NGC container with build tools")
    raise FileNotFoundError("llama.cpp binaries not found - please run the build cell first")

print(f"  llama-quantize: ‚úì Found at {quantize_bin}")
print(f"  llama-cli:      {'‚úì Found' if os.path.exists(main_bin) else '‚ö† Not found (optional)'}")

---

## Part 3: Converting Models to GGUF

The conversion process:
1. **Download/load the HuggingFace model**
2. **Convert to GGUF format (F16)**
3. **Quantize to desired precision**

In [None]:
# Install required packages for conversion
!pip install sentencepiece gguf --quiet

print("Conversion dependencies installed!")

In [None]:
# Choose a model to convert
# Using a small model for this demo
model_id = "facebook/opt-350m"

# For larger models:
# model_id = "meta-llama/Llama-2-7b-hf"  # Requires HF login
# model_id = "mistralai/Mistral-7B-v0.1"

output_dir = "./gguf_models"
os.makedirs(output_dir, exist_ok=True)

print(f"Model: {model_id}")
print(f"Output directory: {output_dir}")

In [None]:
# First, download the model from HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

print("Downloading model from HuggingFace...")

# Create a local directory for the HF model
hf_model_dir = os.path.join(output_dir, "hf_model")
os.makedirs(hf_model_dir, exist_ok=True)

# Download with error handling for network issues
try:
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
except Exception as e:
    print(f"‚ùå Failed to download model: {e}")
    print("\nPossible solutions:")
    print("  1. Check your internet connection")
    print("  2. Verify the model ID is correct")
    print("  3. For gated models, run: huggingface-cli login")
    raise

tokenizer.save_pretrained(hf_model_dir)
model.save_pretrained(hf_model_dir)

print(f"‚úÖ Model saved to {hf_model_dir}")

# Free memory
del model
gc.collect()
torch.cuda.empty_cache()

In [None]:
# Convert to GGUF F16 format
print("Converting to GGUF F16 format...")

gguf_f16_path = os.path.join(output_dir, "model-f16.gguf")

# Check for different script names (llama.cpp has renamed this several times)
possible_scripts = [
    os.path.join(LLAMA_CPP_DIR, "convert_hf_to_gguf.py"),
    os.path.join(LLAMA_CPP_DIR, "convert-hf-to-gguf.py"),
    os.path.join(LLAMA_CPP_DIR, "convert.py"),
]

convert_script = None
for script in possible_scripts:
    if os.path.exists(script):
        convert_script = script
        break

if convert_script is None:
    print("‚ùå ERROR: Conversion script not found!")
    print("\nChecked the following locations:")
    for script in possible_scripts:
        print(f"  - {script}")
    print("\nPossible solutions:")
    print("  1. Update llama.cpp: cd ~/llama.cpp && git pull")
    print("  2. Check that the repository was cloned correctly")
    raise FileNotFoundError("Conversion script not found in llama.cpp directory")

print(f"Using conversion script: {os.path.basename(convert_script)}")

convert_cmd = f"""
python3 {convert_script} \
    {hf_model_dir} \
    --outfile {gguf_f16_path} \
    --outtype f16
"""

result = subprocess.run(convert_cmd, shell=True, capture_output=True, text=True)

if os.path.exists(gguf_f16_path):
    size_mb = os.path.getsize(gguf_f16_path) / 1e6
    print(f"‚úì F16 GGUF created: {gguf_f16_path}")
    print(f"  Size: {size_mb:.1f} MB")
else:
    print("‚ùå Conversion failed!")
    print("\nConversion output:")
    print(result.stdout[-1500:] if len(result.stdout) > 1500 else result.stdout)
    print("\nError output:")
    print(result.stderr[-1500:] if len(result.stderr) > 1500 else result.stderr)
    raise RuntimeError("GGUF conversion failed - see output above for details")

In [None]:
# Quantize to different precisions
quant_types_to_create = ['Q8_0', 'Q5_K_M', 'Q4_K_M', 'Q4_0', 'Q2_K']

quantized_models = {}

print("Quantizing to different precision levels...")
print("=" * 60)

for qtype in quant_types_to_create:
    output_path = os.path.join(output_dir, f"model-{qtype}.gguf")
    
    print(f"\nCreating {qtype}...")
    
    start_time = time.time()
    
    quant_cmd = f"{quantize_bin} {gguf_f16_path} {output_path} {qtype}"
    result = subprocess.run(quant_cmd, shell=True, capture_output=True, text=True)
    
    if os.path.exists(output_path):
        size_mb = os.path.getsize(output_path) / 1e6
        quant_time = time.time() - start_time
        quantized_models[qtype] = {
            'path': output_path,
            'size_mb': size_mb,
            'quant_time': quant_time
        }
        print(f"  ‚úÖ Size: {size_mb:.1f} MB (took {quant_time:.1f}s)")
    else:
        print(f"  ‚ùå Failed: {result.stderr}")

print("\n" + "=" * 60)
print("Quantization complete!")

In [None]:
# Summary of created files
print("\nGGUF Files Created:")
print("=" * 60)
print(f"{'Type':<12} {'Size (MB)':>12} {'vs F16':>12} {'Compression':>12}")
print("-" * 60)

f16_size = os.path.getsize(gguf_f16_path) / 1e6
print(f"{'F16':<12} {f16_size:>12.1f} {'baseline':>12} {'1.0x':>12}")

for qtype, data in quantized_models.items():
    ratio = f16_size / data['size_mb']
    reduction_pct = (1 - data['size_mb'] / f16_size) * 100
    reduction_str = f"{reduction_pct:.0f}% less"
    print(f"{qtype:<12} {data['size_mb']:>12.1f} {reduction_str:>12} {ratio:>11.1f}x")

print("=" * 60)

---

## Part 4: Running Inference with llama.cpp

Now let's test our GGUF models!

In [None]:
def run_llama_cpp_inference(model_path, prompt, n_tokens=50, n_gpu_layers=99):
    """
    Run inference with llama.cpp.
    
    Args:
        model_path: Path to GGUF file
        prompt: Input prompt
        n_tokens: Number of tokens to generate
        n_gpu_layers: Number of layers to offload to GPU (99 = all)
    
    Returns:
        dict: Contains output, tokens_per_second, etc.
    """
    cmd = f"""
    {main_bin} \
        -m {model_path} \
        -p "{prompt}" \
        -n {n_tokens} \
        -ngl {n_gpu_layers} \
        --temp 0 \
        2>&1
    """
    
    start_time = time.time()
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    total_time = time.time() - start_time
    
    output = result.stdout
    
    # Parse timing info from llama.cpp output
    tokens_per_sec = None
    for line in output.split('\n'):
        if 'eval time' in line.lower() and 'token' in line.lower():
            try:
                # Parse "X tokens / Y ms (Z tok/s)"
                parts = line.split('(')
                if len(parts) > 1:
                    tokens_per_sec = float(parts[-1].split()[0])
            except:
                pass
    
    return {
        'output': output,
        'total_time': total_time,
        'tokens_per_sec': tokens_per_sec or n_tokens / total_time
    }

print("Inference function defined!")

In [None]:
# Test inference with different quantization levels
test_prompt = "The future of artificial intelligence is"
n_tokens = 50

print(f"Prompt: '{test_prompt}'")
print(f"Generating {n_tokens} tokens...\n")
print("=" * 60)

inference_results = {}

# Test F16 first
print("\nTesting F16...")
result = run_llama_cpp_inference(gguf_f16_path, test_prompt, n_tokens)
inference_results['F16'] = result
print(f"  Speed: {result['tokens_per_sec']:.1f} tok/s")

# Test quantized versions
for qtype, data in quantized_models.items():
    print(f"\nTesting {qtype}...")
    result = run_llama_cpp_inference(data['path'], test_prompt, n_tokens)
    inference_results[qtype] = result
    print(f"  Speed: {result['tokens_per_sec']:.1f} tok/s")

In [None]:
# Compare results
print("\n" + "=" * 70)
print("GGUF Inference Comparison")
print("=" * 70)
print(f"{'Type':<12} {'Size (MB)':>12} {'Tok/s':>12} {'Speedup':>12}")
print("-" * 70)

baseline_speed = inference_results['F16']['tokens_per_sec']

print(f"{'F16':<12} {f16_size:>12.1f} {baseline_speed:>12.1f} {'1.0x':>12}")

for qtype, data in quantized_models.items():
    speed = inference_results[qtype]['tokens_per_sec']
    speedup = speed / baseline_speed
    print(f"{qtype:<12} {data['size_mb']:>12.1f} {speed:>12.1f} {speedup:>11.2f}x")

print("=" * 70)

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

types = ['F16'] + list(quantized_models.keys())
sizes = [f16_size] + [quantized_models[t]['size_mb'] for t in quantized_models.keys()]
speeds = [inference_results[t]['tokens_per_sec'] for t in types]

colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(types)))

# Size comparison
axes[0].barh(types, sizes, color=colors)
axes[0].set_xlabel('Size (MB)')
axes[0].set_title('Model Size')
axes[0].invert_yaxis()
for i, v in enumerate(sizes):
    axes[0].text(v + 5, i, f'{v:.0f}', va='center')

# Speed comparison
axes[1].barh(types, speeds, color=colors)
axes[1].set_xlabel('Tokens/second')
axes[1].set_title('Inference Speed')
axes[1].invert_yaxis()
for i, v in enumerate(speeds):
    axes[1].text(v + 0.5, i, f'{v:.1f}', va='center')

plt.tight_layout()
plt.savefig('gguf_inference_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)  # Free memory from figure

---

## Part 5: Using llama-cpp-python for Python Integration

If you want to use GGUF models in Python, `llama-cpp-python` provides a convenient wrapper.

In [None]:
# Install llama-cpp-python with CUDA support
# Note: On DGX Spark (ARM64), this compiles from source - may take several minutes

import subprocess
import os

print("Installing llama-cpp-python with CUDA support...")
print("‚ö†Ô∏è  On ARM64 (DGX Spark), this compiles from source.")
print("   This may take 5-10 minutes. Please be patient...")

# Set environment for CUDA build
env = os.environ.copy()
env["CMAKE_ARGS"] = "-DGGML_CUDA=on"

result = subprocess.run(
    ["pip", "install", "llama-cpp-python", "--no-cache-dir", "--force-reinstall"],
    env=env,
    capture_output=True,
    text=True
)

if result.returncode != 0:
    print(f"‚ùå Installation failed!")
    print("Error output (last 1500 chars):")
    print(result.stderr[-1500:] if len(result.stderr) > 1500 else result.stderr)
    print("\nüí° Possible solutions:")
    print("  1. Ensure you're in an NGC container with build tools")
    print("  2. Try: apt-get update && apt-get install -y cmake build-essential")
    print("  3. Use a pre-built container with llama-cpp-python installed")
    raise RuntimeError("llama-cpp-python installation failed")
else:
    print("‚úÖ Installation successful!")

In [None]:
from llama_cpp import Llama

# Load a quantized model
model_path = quantized_models['Q4_K_M']['path']

print(f"Loading {model_path}...")

llm = Llama(
    model_path=model_path,
    n_ctx=2048,       # Context window
    n_gpu_layers=99,  # Offload all layers to GPU
    verbose=False
)

print("Model loaded!")

In [None]:
# Run inference with Python API
prompt = "Explain machine learning in simple terms:"

print(f"Prompt: {prompt}\n")
print("Response:")
print("-" * 40)

start_time = time.time()

output = llm(
    prompt,
    max_tokens=100,
    temperature=0.7,
    top_p=0.9,
    echo=False
)

inference_time = time.time() - start_time

response_text = output['choices'][0]['text']
print(response_text)

print("-" * 40)
print(f"\nGenerated {output['usage']['completion_tokens']} tokens in {inference_time:.2f}s")
print(f"Speed: {output['usage']['completion_tokens']/inference_time:.1f} tok/s")

In [None]:
# Clean up Python model
del llm
gc.collect()

---

## ‚úã Try It Yourself

### Exercise 1: Convert a Different Model

Try converting a different model (e.g., Mistral-7B or Llama-2-7B) to GGUF format.

<details>
<summary>üí° Hint</summary>

```python
model_id = "mistralai/Mistral-7B-v0.1"
# Follow the same conversion steps
```
</details>

In [None]:
# TODO: Convert a different model
# YOUR CODE HERE

### Exercise 2: Perplexity Evaluation

Use llama.cpp's built-in perplexity tool to evaluate quality across quantization levels.

<details>
<summary>üí° Hint</summary>

```bash
./build/bin/llama-perplexity -m model.gguf -f wiki.test.txt
```
</details>

In [None]:
# TODO: Evaluate perplexity
# YOUR CODE HERE

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Using GPU Acceleration

```bash
# ‚ùå Wrong: No GPU layers
./main -m model.gguf -p "Hello"

# ‚úÖ Right: Offload to GPU
./main -m model.gguf -p "Hello" -ngl 99
```

**Why:** Without `-ngl`, everything runs on CPU. Use `-ngl 99` to offload all layers to GPU.

### Mistake 2: Wrong Quantization Type for Use Case

```python
# ‚ùå Wrong: Q2_K for critical applications
model = "model-Q2_K.gguf"  # Quality is too low!

# ‚úÖ Right: Use Q4_K_M or higher for important tasks
model = "model-Q4_K_M.gguf"
```

**Why:** Q2_K has significant quality loss. Use Q4_K_M for the best balance.

### Mistake 3: Not Setting Context Size

```python
# ‚ùå Wrong: Default context may be too small
llm = Llama(model_path=path)

# ‚úÖ Right: Set appropriate context size
llm = Llama(model_path=path, n_ctx=4096)
```

**Why:** Default context is often 512 tokens. Set it based on your needs.

---

## üéâ Checkpoint

You've completed the learning objectives:

- [x] **Understand the GGUF format and its advantages** - Single-file, portable, works everywhere
- [x] **Convert models to GGUF format** - Using llama.cpp's conversion scripts
- [x] **Apply various quantization levels (Q2 to Q8)** - K-quants for smart quantization
- [x] **Run inference with llama.cpp** - Fast inference on CPU and GPU
- [x] **Compare GGUF variants on quality and speed** - Q4_K_M recommended for best balance

### Key Takeaways

- ‚úÖ **GGUF format**: Single-file, portable, works everywhere
- ‚úÖ **K-quants**: Smart quantization that protects important layers
- ‚úÖ **Q4_K_M recommended**: Best balance of size and quality
- ‚úÖ **llama.cpp**: Fast inference on CPU and GPU
- ‚úÖ **Python integration**: llama-cpp-python for easy use

---

## üöÄ Challenge (Optional)

**Build a GGUF Model Server**

Create a simple FastAPI server that:
1. Loads a GGUF model
2. Exposes a `/generate` endpoint
3. Supports streaming responses

```python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 100):
    # YOUR CODE HERE
    pass
```

---

## üìñ Further Reading

- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [GGUF Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
- [TheBloke GGUF Models](https://huggingface.co/TheBloke) (Pre-converted models!)

---

## üßπ Cleanup

In [None]:
# Clean up GGUF files (optional - comment out to keep them)
import shutil

# Uncomment to delete:
# shutil.rmtree(output_dir, ignore_errors=True)

gc.collect()
torch.cuda.empty_cache()

print("Cleanup complete!")
print(f"GPU memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

---

## Next Steps

In the next notebook, we'll explore **Blackwell FP4 quantization** - the exclusive DGX Spark superpower!

‚û°Ô∏è Continue to: [05-fp4-deep-dive.ipynb](05-fp4-deep-dive.ipynb)