# Lab 3.1.7: Ollama Integration - Deploy Your Fine-Tuned Model

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐☆☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Merge LoRA weights with the base model
- [ ] Convert the merged model to GGUF format
- [ ] Import your fine-tuned model into Ollama
- [ ] Test the model locally
- [ ] Benchmark performance

---

## Why Ollama?

**Ollama** is a powerful tool for running LLMs locally. After fine-tuning, deploying to Ollama gives you:

- **Easy API access**: Simple REST API for integration
- **Optimized inference**: Uses llama.cpp under the hood
- **Model management**: Easy to load, switch, and share models
- **Community integration**: Works with many tools and UIs

---

## ELI5: From Training to Production

> **Imagine you just trained a new chef (fine-tuning).** Now you want them to work in your restaurant.
>
> **Step 1 - Merge the Knowledge:**  
> Your chef learned new skills (LoRA) but still uses their original training (base model). Merging combines everything into one complete chef.
>
> **Step 2 - Pack Their Tools:**  
> Converting to GGUF is like packing their cooking tools into a standardized, efficient kit that works anywhere.
>
> **Step 3 - Open the Restaurant:**  
> Ollama is like opening your restaurant - customers can now place orders (send prompts) and get dishes (responses).
>
> **The result:** Your fine-tuned model is ready to serve, accessible via simple commands or API calls!

---

## Prerequisites

Before starting, ensure you have:
- A fine-tuned LoRA adapter from Task 10.2 or 10.3
- Ollama installed on your DGX Spark
- llama.cpp for GGUF conversion

In [None]:
# Setup and imports
import os
import json
import subprocess
import shutil
import time  # Required for benchmarking
from pathlib import Path
from typing import Optional, Dict, List
import requests

# Check Ollama installation
def check_ollama():
    """Check if Ollama is installed and running."""
    try:
        result = subprocess.run(['ollama', '--version'], capture_output=True, text=True)
        print(f"Ollama version: {result.stdout.strip()}")
        
        # Check if Ollama server is running
        try:
            response = requests.get('http://localhost:11434/api/tags', timeout=5)
            if response.status_code == 200:
                print("Ollama server is running")
                models = response.json().get('models', [])
                print(f"Installed models: {len(models)}")
                return True
        except requests.exceptions.ConnectionError:
            print("Ollama server is not running. Start with: ollama serve")
            return False
    except FileNotFoundError:
        print("Ollama is not installed.")
        print("Install with: curl -fsSL https://ollama.com/install.sh | sh")
        return False

check_ollama()

In [None]:
# Installation instructions
installation_guide = """
INSTALLATION GUIDE
==================

1. INSTALL OLLAMA
-----------------
# Linux (including DGX Spark)
curl -fsSL https://ollama.com/install.sh | sh

# Verify
ollama --version

# Start the server
ollama serve

2. INSTALL LLAMA.CPP (for GGUF conversion)
-----------------------------------------
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support (for DGX Spark)
make LLAMA_CUDA=1

# Or use pip for Python bindings
pip install llama-cpp-python

3. VERIFY SETUP
---------------
# Check Ollama
ollama list

# Check llama.cpp
ls llama.cpp/convert*.py
"""

print(installation_guide)

---

## Part 1: Merging LoRA Weights

First, we need to merge the LoRA adapter with the base model.

In [None]:
import torch
import gc

# Configuration
BASE_MODEL = "meta-llama/Llama-3.1-8B-Instruct"  # Base model
ADAPTER_PATH = "./llama3-8b-lora-adapter"  # Your LoRA adapter
MERGED_OUTPUT_PATH = "./llama3-8b-merged"  # Output path

print(f"Base model: {BASE_MODEL}")
print(f"Adapter path: {ADAPTER_PATH}")
print(f"Output path: {MERGED_OUTPUT_PATH}")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

def merge_lora_weights(
    base_model_path: str,
    adapter_path: str,
    output_path: str,
    push_to_hub: bool = False,
):
    """
    Merge LoRA adapter weights into the base model.
    
    Args:
        base_model_path: HuggingFace model ID or local path
        adapter_path: Path to LoRA adapter
        output_path: Where to save merged model
        push_to_hub: Whether to upload to HuggingFace Hub
    """
    print("Step 1: Loading base model...")
    
    # Load base model in float16 for merging
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_path,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
    )
    
    print(f"  Base model loaded: {sum(p.numel() for p in base_model.parameters()):,} parameters")
    
    print("\nStep 2: Loading LoRA adapter...")
    
    # Load adapter
    model = PeftModel.from_pretrained(
        base_model,
        adapter_path,
        torch_dtype=torch.float16,
    )
    
    print("  Adapter loaded")
    
    print("\nStep 3: Merging weights...")
    
    # Merge and unload adapter
    model = model.merge_and_unload()
    
    print("  Weights merged")
    
    print("\nStep 4: Saving merged model...")
    
    # Save merged model
    model.save_pretrained(output_path, safe_serialization=True)
    
    # Save tokenizer too
    tokenizer = AutoTokenizer.from_pretrained(base_model_path)
    tokenizer.save_pretrained(output_path)
    
    print(f"  Model saved to {output_path}")
    
    # Calculate size
    total_size = sum(
        f.stat().st_size for f in Path(output_path).glob('**/*') if f.is_file()
    )
    print(f"  Total size: {total_size / 1e9:.2f} GB")
    
    # Cleanup
    del model, base_model
    gc.collect()
    torch.cuda.empty_cache()
    
    return output_path

print("Merge function ready!")
print("\nTo merge, run:")
print(f"merge_lora_weights('{BASE_MODEL}', '{ADAPTER_PATH}', '{MERGED_OUTPUT_PATH}')")

In [None]:
# Uncomment to actually merge (requires adapter from Task 10.2)
# merged_path = merge_lora_weights(BASE_MODEL, ADAPTER_PATH, MERGED_OUTPUT_PATH)

# For demo purposes, we'll use the base model
merged_path = MERGED_OUTPUT_PATH

---

## Part 2: Converting to GGUF Format

GGUF (GPT-Generated Unified Format) is the format used by llama.cpp and Ollama.

In [None]:
# GGUF conversion options
gguf_quantization_options = """
GGUF QUANTIZATION OPTIONS
=========================

| Type     | Size   | Quality | Speed | Use Case                    |
|----------|--------|---------|-------|-----------------------------|
| F32      | ~140GB | Best    | Slow  | Debugging only              |
| F16      | ~14GB  | Best    | Good  | When you have lots of RAM   |
| Q8_0     | ~8GB   | Great   | Fast  | Best quality/size trade-off |
| Q6_K     | ~6GB   | Great   | Fast  | Good balance                |
| Q5_K_M   | ~5GB   | Good    | Fast  | Recommended for most users  |
| Q4_K_M   | ~4GB   | Good    | Fast  | When memory is tight        |
| Q4_0     | ~4GB   | Fair    | Fast  | Maximum compression         |
| Q3_K_M   | ~3GB   | Fair    | Fast  | Extreme compression         |
| Q2_K     | ~2GB   | Poor    | Fast  | Emergency use only          |

FOR DGX SPARK (128GB):
- F16 or Q8_0 recommended for best quality
- You have plenty of memory, prioritize quality

FOR 8B MODELS:
- Q8_0: ~8GB - Excellent quality, fits easily
- Q5_K_M: ~5GB - Great quality, smaller

FOR 70B MODELS:
- Q4_K_M: ~40GB - Good quality, fits in 128GB
- Q5_K_M: ~50GB - Better quality, still fits
"""

print(gguf_quantization_options)

In [None]:
def convert_to_gguf(
    model_path: str,
    output_path: str,
    quantization: str = "Q5_K_M",
    llama_cpp_path: str = "./llama.cpp",
):
    """
    Convert a HuggingFace model to GGUF format.
    
    Args:
        model_path: Path to merged HF model
        output_path: Where to save GGUF file
        quantization: Quantization type (e.g., Q5_K_M)
        llama_cpp_path: Path to llama.cpp directory
    """
    print("Step 1: Converting to GGUF (FP16)...")
    
    # First, convert to GGUF FP16
    fp16_output = output_path.replace('.gguf', '-f16.gguf')
    
    convert_script = f"{llama_cpp_path}/convert_hf_to_gguf.py"
    
    cmd = [
        "python", convert_script,
        model_path,
        "--outfile", fp16_output,
        "--outtype", "f16",
    ]
    
    print(f"  Running: {' '.join(cmd)}")
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
        return None
    
    print(f"  FP16 GGUF created: {fp16_output}")
    
    if quantization.upper() != "F16":
        print(f"\nStep 2: Quantizing to {quantization}...")
        
        quantize_bin = f"{llama_cpp_path}/llama-quantize"
        
        cmd = [
            quantize_bin,
            fp16_output,
            output_path,
            quantization,
        ]
        
        print(f"  Running: {' '.join(cmd)}")
        result = subprocess.run(cmd, capture_output=True, text=True)
        
        if result.returncode != 0:
            print(f"Error: {result.stderr}")
            return fp16_output  # Return FP16 if quantization fails
        
        print(f"  Quantized GGUF created: {output_path}")
        
        # Cleanup FP16 intermediate
        os.remove(fp16_output)
    else:
        output_path = fp16_output
    
    # Check file size
    size = os.path.getsize(output_path) / 1e9
    print(f"\nFinal GGUF size: {size:.2f} GB")
    
    return output_path

print("GGUF conversion function ready!")

In [None]:
# Alternative: Using Python llama-cpp-python for conversion
python_conversion = """
PYTHON-BASED CONVERSION (Alternative)
=====================================

If you don't have llama.cpp compiled, use the Python approach:

1. Install llama-cpp-python:
   pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

2. Use HuggingFace's tool:
   pip install huggingface_hub
   
   from huggingface_hub import snapshot_download
   
   # Download pre-quantized GGUF if available
   snapshot_download(
       repo_id="your-username/your-model-GGUF",
       local_dir="./model-gguf"
   )

3. Or use LLaMA Factory export (recommended):
   llamafactory-cli export \\
       --model_name_or_path ./merged-model \\
       --export_dir ./gguf-output \\
       --export_quantization_bit 4 \\
       --export_legacy_format false
"""

print(python_conversion)

---

## Part 3: Importing to Ollama

Once you have the GGUF file, importing to Ollama is straightforward.

In [None]:
def create_modelfile(gguf_path: str, model_name: str, output_path: str = "./Modelfile"):
    """
    Create an Ollama Modelfile for importing a GGUF model.
    
    Args:
        gguf_path: Path to GGUF file
        model_name: Name for the model in Ollama
        output_path: Where to save Modelfile
    """
    
    # Get absolute path
    gguf_abs_path = os.path.abspath(gguf_path)
    
    modelfile_content = f'''# Modelfile for {model_name}
# Fine-tuned model imported from GGUF

# Base model from GGUF file
FROM {gguf_abs_path}

# Model parameters (adjust as needed)
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER num_predict 512

# System prompt (customize for your fine-tuned model)
SYSTEM """You are a helpful AI assistant that has been fine-tuned for specific tasks. 
You provide accurate, helpful, and concise responses."""

# Chat template (for Llama 3.1 style)
TEMPLATE """{{{{ if .System }}}}<|start_header_id|>system<|end_header_id|>

{{{{ .System }}}}<|eot_id|>{{{{ end }}}}{{{{ if .Prompt }}}}<|start_header_id|>user<|end_header_id|>

{{{{ .Prompt }}}}<|eot_id|>{{{{ end }}}}<|start_header_id|>assistant<|end_header_id|>

{{{{ .Response }}}}<|eot_id|>"""

# License (update as appropriate)
LICENSE """This model is based on Llama 3.1 and subject to Meta's license.
Fine-tuning performed by [Your Name]."""
'''
    
    with open(output_path, 'w') as f:
        f.write(modelfile_content)
    
    print(f"Modelfile created: {output_path}")
    print(f"\nTo import into Ollama, run:")
    print(f"  ollama create {model_name} -f {output_path}")
    
    return output_path

# Example
# create_modelfile("./model.gguf", "my-finetuned-llama")

In [None]:
# Complete import workflow
import_workflow = """
COMPLETE OLLAMA IMPORT WORKFLOW
================================

OPTION 1: Using Modelfile (Recommended)
----------------------------------------

# Step 1: Create Modelfile
cat > Modelfile << 'EOF'
FROM ./my-model-q5.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM "You are a helpful assistant."

TEMPLATE """<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
EOF

# Step 2: Create model in Ollama
ollama create my-finetuned-model -f Modelfile

# Step 3: Verify import
ollama list

# Step 4: Test the model
ollama run my-finetuned-model "Hello, test prompt"


OPTION 2: Direct GGUF Import (Simpler)
--------------------------------------

# Ollama can import GGUF directly with default settings
ollama create my-model -f ./Modelfile

# Where Modelfile just contains:
# FROM ./my-model.gguf


OPTION 3: Push to Ollama Library
--------------------------------

# If you want to share your model
# First, create account at ollama.com

# Tag your model
ollama cp my-model username/my-model

# Push to library
ollama push username/my-model

# Others can then use:
ollama pull username/my-model
"""

print(import_workflow)

---

## Part 4: Testing the Imported Model

In [None]:
import requests
import json

class OllamaClient:
    """
    Simple client for interacting with Ollama API.
    
    Args:
        base_url: Ollama server URL (default: http://localhost:11434)
        timeout: Request timeout in seconds (default: 120)
                 Note: Large model generation can take 30-60+ seconds
    """
    
    def __init__(self, base_url: str = "http://localhost:11434", timeout: int = 120):
        self.base_url = base_url
        self.timeout = timeout  # Default timeout in seconds
    
    def list_models(self) -> list:
        """List all available models."""
        try:
            response = requests.get(f"{self.base_url}/api/tags", timeout=self.timeout)
            response.raise_for_status()
            return response.json().get('models', [])
        except requests.exceptions.Timeout:
            print(f"Error: Request timed out after {self.timeout}s while listing models.")
            print("  Hint: If Ollama is loading a large model, try increasing timeout.")
            return []
        except requests.exceptions.ConnectionError:
            print(f"Error: Cannot connect to Ollama at {self.base_url}")
            print("  Hint: Make sure Ollama is running with 'ollama serve'")
            return []
        except requests.exceptions.RequestException as e:
            print(f"Error listing models: {e}")
            return []
    
    def generate(
        self,
        model: str,
        prompt: str,
        system: str = None,
        stream: bool = False,
    ) -> str:
        """
        Generate a response from a model.
        
        Args:
            model: Model name
            prompt: User prompt
            system: Optional system prompt
            stream: Whether to stream response
        
        Returns:
            Generated text or error message
        
        Note:
            Generation can take 30-60+ seconds for large models or long responses.
            Increase timeout in __init__ if you encounter timeout errors.
        """
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": stream,
        }
        
        if system:
            payload["system"] = system
        
        try:
            response = requests.post(
                f"{self.base_url}/api/generate",
                json=payload,
                timeout=self.timeout
            )
            response.raise_for_status()
            return response.json().get('response', '')
        except requests.exceptions.Timeout:
            return (
                f"Error: Request timed out after {self.timeout}s.\n"
                f"Possible causes:\n"
                f"  1. Model '{model}' is still loading (first request after model change)\n"
                f"  2. Response generation is taking longer than expected\n"
                f"  3. System is under heavy load\n"
                f"Solutions:\n"
                f"  - Create client with higher timeout: OllamaClient(timeout=300)\n"
                f"  - Use a smaller/faster model\n"
                f"  - Reduce prompt complexity or max_tokens"
            )
        except requests.exceptions.ConnectionError:
            return (
                f"Error: Cannot connect to Ollama at {self.base_url}\n"
                f"Make sure Ollama is running with 'ollama serve'"
            )
        except requests.exceptions.RequestException as e:
            return f"Error: {str(e)}"
    
    def chat(
        self,
        model: str,
        messages: list,
        stream: bool = False,
    ) -> str:
        """
        Chat with a model.
        
        Args:
            model: Model name
            messages: List of {"role": ..., "content": ...}
            stream: Whether to stream response
        
        Returns:
            Assistant's response or error message
        """
        payload = {
            "model": model,
            "messages": messages,
            "stream": stream,
        }
        
        try:
            response = requests.post(
                f"{self.base_url}/api/chat",
                json=payload,
                timeout=self.timeout
            )
            response.raise_for_status()
            return response.json().get('message', {}).get('content', '')
        except requests.exceptions.Timeout:
            return (
                f"Error: Chat request timed out after {self.timeout}s.\n"
                f"Try creating client with higher timeout: OllamaClient(timeout=300)"
            )
        except requests.exceptions.ConnectionError:
            return f"Error: Cannot connect to Ollama at {self.base_url}"
        except requests.exceptions.RequestException as e:
            return f"Error: {str(e)}"


# Create client with 120 second timeout (appropriate for large model inference)
# Increase timeout if you experience timeout errors with large models
client = OllamaClient(timeout=120)

# List models
try:
    models = client.list_models()
    print("Available Ollama models:")
    for m in models:
        print(f"  - {m['name']}: {m.get('size', 'unknown size')}")
    if not models:
        print("  (No models found - try 'ollama pull llama3.2' to download a model)")
except Exception as e:
    print(f"Could not connect to Ollama: {e}")
    print("Make sure Ollama is running: ollama serve")

In [None]:
# Test prompts for evaluating fine-tuned model
test_prompts = [
    "What is machine learning?",
    "Explain the difference between supervised and unsupervised learning.",
    "Write a Python function to calculate the factorial of a number.",
    "What are the benefits of using LoRA for fine-tuning?",
]

def test_model(model_name: str, prompts: list, client: OllamaClient):
    """
    Test a model with multiple prompts.
    """
    print(f"Testing model: {model_name}")
    print("=" * 60)
    
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("-" * 40)
        
        try:
            response = client.generate(model_name, prompt)
            print(f"Response: {response[:500]}..." if len(response) > 500 else f"Response: {response}")
        except Exception as e:
            print(f"Error: {e}")
        
        print()

# Uncomment to test (requires model to be loaded in Ollama)
# test_model("my-finetuned-model", test_prompts, client)

---

## Part 5: Performance Benchmarking

In [None]:
import time
from typing import Dict, List

def benchmark_model(
    model_name: str,
    prompts: List[str],
    client: OllamaClient,
    num_runs: int = 3,
) -> Dict:
    """
    Benchmark model performance.
    
    Returns:
        Dictionary with timing statistics
    """
    results = {
        "model": model_name,
        "num_prompts": len(prompts),
        "num_runs": num_runs,
        "times": [],
        "tokens": [],
    }
    
    print(f"Benchmarking {model_name}...")
    
    for run in range(num_runs):
        print(f"  Run {run + 1}/{num_runs}")
        
        for prompt in prompts:
            start = time.time()
            response = client.generate(model_name, prompt)
            elapsed = time.time() - start
            
            # Rough token count (words * 1.3)
            tokens = len(response.split()) * 1.3
            
            results["times"].append(elapsed)
            results["tokens"].append(tokens)
    
    # Calculate statistics
    results["avg_time"] = sum(results["times"]) / len(results["times"])
    results["avg_tokens"] = sum(results["tokens"]) / len(results["tokens"])
    results["tokens_per_second"] = results["avg_tokens"] / results["avg_time"]
    
    print(f"\nResults for {model_name}:")
    print(f"  Average response time: {results['avg_time']:.2f}s")
    print(f"  Average tokens: {results['avg_tokens']:.0f}")
    print(f"  Tokens/second: {results['tokens_per_second']:.1f}")
    
    return results

print("Benchmark function ready!")
print("\nUsage: results = benchmark_model('my-model', test_prompts, client)")

In [None]:
# Comparison template
comparison_template = """
MODEL COMPARISON TEMPLATE
=========================

| Metric             | Base Model | Fine-tuned | Improvement |
|--------------------|------------|------------|-------------|
| Response Time (s)  |            |            |             |
| Tokens/Second      |            |            |             |
| Quality (1-5)      |            |            |             |
| Relevance (1-5)    |            |            |             |
| Accuracy (%)       |            |            |             |

QUALITATIVE ASSESSMENT
----------------------

Strengths of fine-tuned model:
1. 
2. 
3. 

Areas for improvement:
1. 
2. 

Recommendations:
1. 
2. 
"""

print(comparison_template)

---

## Part 6: Complete Pipeline Script

In [None]:
# Complete end-to-end deployment script
deployment_script = '''
#!/bin/bash
# deploy_finetuned_model.sh
# Complete pipeline to deploy a fine-tuned model to Ollama

set -e  # Exit on error

# Configuration
BASE_MODEL="meta-llama/Llama-3.1-8B-Instruct"
ADAPTER_PATH="./llama3-8b-lora-adapter"
MERGED_PATH="./llama3-8b-merged"
GGUF_PATH="./llama3-8b-finetuned.gguf"
MODEL_NAME="my-finetuned-llama"
QUANTIZATION="Q5_K_M"

echo "====================================="
echo "Fine-tuned Model Deployment Pipeline"
echo "====================================="

# Step 1: Merge LoRA weights
echo "\nStep 1: Merging LoRA weights..."
python << 'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained("$BASE_MODEL", torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(base, "$ADAPTER_PATH")
model = model.merge_and_unload()
model.save_pretrained("$MERGED_PATH")
AutoTokenizer.from_pretrained("$BASE_MODEL").save_pretrained("$MERGED_PATH")
print("Merge complete!")
EOF

# Step 2: Convert to GGUF
echo "\nStep 2: Converting to GGUF..."
python llama.cpp/convert_hf_to_gguf.py $MERGED_PATH --outfile ${GGUF_PATH%.gguf}-f16.gguf --outtype f16
./llama.cpp/llama-quantize ${GGUF_PATH%.gguf}-f16.gguf $GGUF_PATH $QUANTIZATION
rm ${GGUF_PATH%.gguf}-f16.gguf  # Cleanup

# Step 3: Create Modelfile
echo "\nStep 3: Creating Modelfile..."
cat > Modelfile << EOF
FROM $GGUF_PATH

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM "You are a helpful AI assistant."
EOF

# Step 4: Import to Ollama
echo "\nStep 4: Importing to Ollama..."
ollama create $MODEL_NAME -f Modelfile

# Step 5: Verify
echo "\nStep 5: Verifying..."
ollama list | grep $MODEL_NAME

echo "\n====================================="
echo "Deployment complete!"
echo "Test with: ollama run $MODEL_NAME"
echo "====================================="
'''

print(deployment_script)

---

## Checkpoint

You've learned:
- ✅ How to merge LoRA weights with base model
- ✅ How to convert to GGUF format with various quantization levels
- ✅ How to create Modelfiles and import to Ollama
- ✅ How to test and benchmark deployed models
- ✅ Complete deployment pipeline automation

---

## Congratulations!

You've completed Module 3.1: Large Language Model Fine-Tuning!

**Your achievements:**
- Understood LoRA theory and implemented it from scratch
- Fine-tuned an 8B model with LoRA
- Fine-tuned a **70B model with QLoRA** on DGX Spark!
- Created professional instruction datasets
- Trained with DPO for preference optimization
- Explored LLaMA Factory GUI
- Deployed your model to Ollama

---

## Further Reading

- [Ollama Documentation](https://github.com/ollama/ollama)
- [llama.cpp](https://github.com/ggerganov/llama.cpp)
- [GGUF Format Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)

---

## Next Steps

Continue to **[Module 3.2: Quantization & Optimization](../module-3.2-quantization/)** to learn advanced techniques for making your models faster and smaller!

---

## Cleanup

In [None]:
# Cleanup
import gc

# Clear any loaded models from memory
if 'model' in dir():
    del model
if 'tokenizer' in dir():
    del tokenizer

gc.collect()

# Clear GPU cache if available
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("GPU cache cleared")
except ImportError:
    pass

print("Cleanup complete!")