<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Deploy GGUF Checkpoints with NIM

This notebook shows you how to deploy memory-efficient quantized models using GGUF format with NVIDIA NIM. Perfect for running large models on consumer GPUs or maximizing the number of models per server.

## What You'll Build

By the end of this notebook, you'll be able to:
- ✅ Deploy quantized models that use 50-75% less memory
- ✅ Run large models on consumer GPUs (8GB-16GB VRAM)
- ✅ Choose the right quantization level for your needs
- ✅ Handle GGUF's special configuration requirements

## When to Use This Approach

**Choose this notebook if you:**
- Have limited GPU memory (8GB-16GB VRAM)
- Want to run larger models on smaller GPUs
- Need to deploy multiple models on one GPU
- Can accept slight quality trade-offs for efficiency

**Consider other notebooks if you:**
- Have plenty of GPU memory (→ See Notebook 1: HuggingFace)
- Need maximum quality/performance (→ See Notebook 2: TensorRT-LLM)

## Understanding GGUF Quantization

Quantization reduces model size by using fewer bits for weights:

| Format | Model Size | Quality | Use Case |
|--------|------------|---------|----------|
| Full Precision | 100% (baseline) | Perfect | Research, fine-tuning |
| Q8_0 | ~33% | Near-perfect | Quality-focused deployment |
| Q5_K_M | ~22% | Excellent | Balanced deployment |
| Q4_K_M | ~18% | Very Good | **Recommended** - best balance |
| Q3_K_M | ~14% | Good | Memory-constrained |

**Example**: Llama-3.2-3B
- Full model: ~13GB → Won't fit on RTX 3060
- Q4_K_M: ~2.1GB → Runs comfortably on 8GB GPUs

## The GGUF Challenge

GGUF files don't include configuration metadata, so we need to:
1. Download the GGUF model file
2. Get the config.json from the original model
3. Organize them correctly for NIM

Don't worry - we'll walk through this step-by-step!

## What's Covered

This tutorial includes:
* **Setup**: Understanding GGUF requirements
* **Example 1**: Deploying pre-downloaded GGUF models locally
* **Example 2**: Comparing different quantization levels
* **Example 3**: Custom deployment configurations
* **Bonus**: Quick reference for all quantization options

## Prerequisites

### Hardware Requirements

GGUF deployment is more resource-friendly than full-precision models:

- **GPU**: NVIDIA GPU with at least 8GB VRAM (for Llama-3.2-3B with Q4_K_M quantization)
  - Recommended: RTX 4070, RTX 3080, or higher
  - Supported: RTX 3060 12GB, RTX 4060 Ti 16GB for smaller models
- **Driver**: NVIDIA Driver version 535 or higher
- **CUDA**: CUDA 12.0 or higher
- **System Memory**: At least 16GB RAM recommended
- **Storage**: 5-15GB free space depending on quantization level

**Model Size Estimates (Llama-3.2-3B):**
- Q4_K_M: ~2.1GB (recommended balance of quality/size)
- Q5_K_M: ~2.6GB (higher quality)
- Q8_0: ~3.2GB (highest quality quantized)

For detailed hardware specifications, refer to the [NIM LLM Documentation](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html).

### System Setup

First, let's verify your GPU setup and install necessary dependencies:


In [None]:
import requests
import time
import docker
import os
import getpass

# Check GPU availability
!nvidia-smi

### Install Required Software


In [None]:
# Update system and install required packages
!sudo apt-get update && echo "✓ System packages updated successfully"
!sudo apt-get install git-lfs wget -y && echo "✓ Git LFS and wget installed successfully"
!git lfs install && echo "✓ Git LFS initialized successfully"

In [None]:
# Install Python dependencies
!pip install docker requests huggingface-hub && echo "✓ Python dependencies installed successfully"

### Get API Keys

#### NVIDIA NGC API Key

The NVIDIA NGC API Key is mandatory for accessing NVIDIA container registry and pulling secure container images.
Refer to [Generating NGC API Keys](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) for more information.


In [None]:
# Replace with your actual NGC API key
if not os.environ.get("NGC_API_KEY", "").startswith("nvapi-"):
    ngc_api_key = getpass.getpass("Enter your NGC API Key: ")
    assert ngc_api_key.startswith("nvapi-"), "Not a valid key"
    os.environ["NGC_API_KEY"] = ngc_api_key
    print("✓ NGC API Key set successfully")

### Docker Login


In [None]:
# Login to NGC registry
!echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin && echo "✓ Docker login successful"

#### Hugging Face Token

You'll also need a [Huggingface Token](https://huggingface.co/settings/tokens) to download models.


In [None]:
if not os.environ.get("HF_USERNAME", ""):
    hf_username = getpass.getpass("Enter your Huggingface Username: ")
    os.environ["HF_USERNAME"] = hf_username
    print("✓ Hugging Face username set successfully")

In [None]:
if not os.environ.get("HF_TOKEN", "").startswith("hf_"):
    hf_token = getpass.getpass("Enter your Huggingface Token: ")
    assert hf_token.startswith("hf_"), "Not a valid key"
    os.environ["HF_TOKEN"] = hf_token
    print("✓ Hugging Face token set successfully")

### Setup NIM Container

Choose your NIM container image and pull it:


In [None]:
# Set the NIM image - using universal NIM for GGUF support
os.environ['NIM_IMAGE'] = "nvcr.io/nvidian/nim-llm-dev/universal-nim:1.11.0.rc6"
print(f"Using NIM image: {os.environ['NIM_IMAGE']}")

In [None]:
# Pull the NIM container image
!docker pull $NIM_IMAGE && echo "✓ NIM container image pulled successfully"

### Setup Common Variables


In [None]:
os.environ["CONTAINER_NAME"] = "GGUF-NIM"
os.environ["LOCAL_NIM_CACHE"] = os.path.expanduser("~/.cache/nim")
os.environ["GGUF_WORK_DIR"] = os.path.expanduser("~/gguf_models")

# Create necessary directories
os.makedirs(os.environ["LOCAL_NIM_CACHE"], exist_ok=True)
os.makedirs(os.environ["GGUF_WORK_DIR"], exist_ok=True)

print("✓ Directories created successfully")

### Utility Functions

Below are some utility functions we'll use in this notebook. These are for simplifying the process of deploying and monitoring NIMs in a notebook environment, and aren't required in general.


In [None]:
def check_service_ready_from_logs(container_name, print_logs=False, timeout=600):
    """
    Check if NIM service is ready by monitoring Docker logs for 'Application startup complete' message.

    Args:
        container_name (str): Name of the Docker container
        print_logs (bool): Whether to print logs while monitoring (default: False)
        timeout (int): Maximum time to wait in seconds (default: 600)

    Returns:
        bool: True if service is ready, False if timeout reached
    """
    print("Waiting for NIM service to start...")
    start_time = time.time()

    try:
        client = docker.from_env()
        container = client.containers.get(container_name)

        # Stream logs in real-time using the blocking generator
        log_buffer = ""
        for log_chunk in container.logs(stdout=True, stderr=True, follow=True, stream=True):
            # Check timeout
            if time.time() - start_time > timeout:
                print(f"❌ Timeout reached ({timeout}s). Service may not have started properly.")
                return False

            # Decode chunk and add to buffer
            chunk = log_chunk.decode('utf-8', errors='ignore')
            log_buffer += chunk

            # Process complete lines
            while '\n' in log_buffer:
                line, log_buffer = log_buffer.split('\n', 1)
                line = line.strip()

                if print_logs and line:
                    print(f"[LOG] {line}")

                # Check for startup complete message
                if "Application startup complete" in line:
                    print("✓ Application startup complete! Service is ready.")
                    return True

    except Exception as e:
        print(f"❌ Error: {e}")
        return False

    print(f"❌ Timeout reached ({timeout}s). Service may not have started properly.")
    return False

def check_service_ready():
    """Fallback health check using HTTP endpoint"""
    url = 'http://localhost:8000/v1/health/ready'
    print("Checking service health endpoint...")

    while True:
        try:
            response = requests.get(url, headers={'accept': 'application/json'})
            if response.status_code == 200 and response.json().get("message") == "Service is ready.":
                print("✓ Service ready!")
                break
        except requests.ConnectionError:
            pass
        print("⏳ Still starting...")
        time.sleep(30)

def generate_text(model, prompt, max_tokens=250, temperature=0.7):
    """Generate text using the NIM service"""
    try:
        response = requests.post(
            f"http://localhost:8000/v1/completions",
            json={
                "model": model,
                "prompt": prompt,
                "max_tokens": max_tokens,
                "temperature": temperature
            },
            timeout=60
        )
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error making request: {e}")
        return None

print("✓ Utility functions loaded successfully")

## GGUF Deployment Examples

Let's explore how to deploy GGUF models locally using NIM.

## Example 1: Pre-download and Local GGUF Deployment

This example shows how to pre-download GGUF models and deploy them locally. This approach provides reliable offline usage and faster startup times since models are already available locally.

### Download External Config File

GGUF repositories don't include the config.json file needed by NIM. We need to download it from the original Llama-3.2-3B-Instruct repository:


In [None]:
# Create a temporary directory for the config file
config_temp_dir = os.path.expanduser("~/gguf_config_temp")
os.makedirs(config_temp_dir, exist_ok=True)
os.environ["CONFIG_TEMP_DIR"] = config_temp_dir

# Download config.json from the original Llama-3.2-3B-Instruct repository
print("Downloading config.json from original model repository...")
!wget -O "$CONFIG_TEMP_DIR/config.json" https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/resolve/main/config.json && echo "✓ Config file downloaded successfully"

# Also download tokenizer files that may be needed
!wget -O "$CONFIG_TEMP_DIR/tokenizer.json" https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/resolve/main/tokenizer.json 2>/dev/null || echo "tokenizer.json not found - continuing"
!wget -O "$CONFIG_TEMP_DIR/tokenizer_config.json" https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/resolve/main/tokenizer_config.json 2>/dev/null || echo "tokenizer_config.json not found - continuing"

print("✓ Configuration files downloaded successfully")

In [None]:
# Verify the config file was downloaded
!ls -la "$CONFIG_TEMP_DIR"
!head -5 "$CONFIG_TEMP_DIR/config.json"

### Download GGUF Model Locally


In [None]:
# Create separate directories for different quantizations
q4_model_path = os.path.join(os.environ["GGUF_WORK_DIR"], "Llama-3.2-3B-Instruct-Q4_K_M")
q8_model_path = os.path.join(os.environ["GGUF_WORK_DIR"], "Llama-3.2-3B-Instruct-Q8_0")

os.makedirs(q4_model_path, exist_ok=True)
os.makedirs(q8_model_path, exist_ok=True)

os.environ["Q4_MODEL_PATH"] = q4_model_path
os.environ["Q8_MODEL_PATH"] = q8_model_path

# Download specific GGUF model files to their respective directories
print("Downloading GGUF model files locally...")
print("This may take several minutes depending on your internet connection...")

# Download the Q4_K_M quantization
!wget -O "$Q4_MODEL_PATH/Llama-3.2-3B-Instruct-Q4_K_M.gguf" \
  https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf && \
  echo "✓ Q4_K_M quantization downloaded successfully"

# Download the Q8_0 quantization
!wget -O "$Q8_MODEL_PATH/Llama-3.2-3B-Instruct-Q8_0.gguf" \
  https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q8_0.gguf && \
  echo "✓ Q8_0 quantization downloaded successfully"

print("✓ GGUF model files downloaded successfully")

In [None]:
# Verify the download and check available quantization files
!ls -la "$Q4_MODEL_PATH"/*.gguf
!ls -la "$Q8_MODEL_PATH"/*.gguf

In [None]:
# Copy configuration files to both quantization directories
!cp "$CONFIG_TEMP_DIR/config.json" "$Q4_MODEL_PATH/" && echo "✓ Config file copied to Q4_K_M directory"
!cp "$CONFIG_TEMP_DIR/tokenizer"*.json "$Q4_MODEL_PATH/" 2>/dev/null || echo "Some tokenizer files not found - continuing"

!cp "$CONFIG_TEMP_DIR/config.json" "$Q8_MODEL_PATH/" && echo "✓ Config file copied to Q8_0 directory"
!cp "$CONFIG_TEMP_DIR/tokenizer"*.json "$Q8_MODEL_PATH/" 2>/dev/null || echo "Some tokenizer files not found - continuing"

print("✓ Configuration files copied to all quantization directories")

In [None]:
# Verify the complete model setup for both quantizations
!echo "Q4_K_M model directory:"
!ls -la "$Q4_MODEL_PATH"
!echo
!echo "Q8_0 model directory:"
!ls -la "$Q8_MODEL_PATH"

### Deploy Local GGUF Model


In [None]:
# Deploy Q4_K_M quantization locally
print("Deploying Q4_K_M quantization locally...")

!docker run -it --rm \
  --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME="/opt/models/q4_model" \
  -e NIM_SERVED_MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct" \
  -v "$Q4_MODEL_PATH:/opt/models/q4_model" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  -d \
  $NIM_IMAGE

In [None]:
# Use the log-based check (set print_logs=True to see detailed logs)
if not check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True):
    print("Falling back to health endpoint check...")
    check_service_ready()

### Test Local GGUF Deployment


In [None]:
# Test the Q4_K_M model deployment
result = generate_text(
    model="meta-llama/Llama-3.2-3B-Instruct",
    prompt="Explain the concept of machine learning in simple terms:",
    max_tokens=200
)
print("Q4_K_M Quantization Result:")
print("=" * 50)
print(result['choices'][0]['text'] if result else "Failed to generate text")

In [None]:
# Stop the current deployment
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"

## Available Quantization Levels

The bartowski/Llama-3.2-3B-Instruct-GGUF repository includes multiple quantization levels. Each quantization requires its own directory with the GGUF file and configuration files.

### Directory Structure for Each Quantization:

Each quantization needs to be organized as follows:
```
quantization_directory/
├── config.json                    # From original model repo
├── tokenizer.json                 # From original model repo
├── tokenizer_config.json          # From original model repo
└── model_name-QUANTIZATION.gguf   # The quantized model file
```

### Downloaded Quantizations:

**Q4_K_M (Recommended - Best Balance)**
- Directory: `$Q4_MODEL_PATH`
- Size: ~2.1GB
- Quality: Good balance of quality and efficiency
- Memory: ~4GB VRAM required

**Q8_0 (Highest Quality Quantized)**
- Directory: `$Q8_MODEL_PATH`
- Size: ~3.2GB
- Quality: Near full-precision quality
- Memory: ~6GB VRAM required

### Additional Quantizations Available:

**Q5_K_M (Higher Quality)**

In [None]:
# Create directory and download
Q5_MODEL_PATH="$GGUF_WORK_DIR/Llama-3.2-3B-Instruct-Q5_K_M"
!mkdir -p "$Q5_MODEL_PATH"
!wget -O "$Q5_MODEL_PATH/Llama-3.2-3B-Instruct-Q5_K_M.gguf" \
  https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q5_K_M.gguf

# Copy configuration files
!cp "$CONFIG_TEMP_DIR"/*.json "$Q5_MODEL_PATH/"

# Deploy with:
-e NIM_MODEL_NAME="/opt/models/q5_model" \
-v "$Q5_MODEL_PATH:/opt/models/q5_model" \

- Size: ~2.6GB
- Quality: Better quality than Q4_K_M
- Memory: ~5GB VRAM required

**Other Available Quantizations:**
- Q2_K: Ultra-compressed (~1.3GB)
- Q3_K_M: Small size (~1.7GB)
- Q6_K: High quality (~2.9GB)
- IQ4_XS: Experimental quantization

To set up any additional quantization:

In [None]:
# Create directory for the quantization
QUANT_DIR="$GGUF_WORK_DIR/Model-QUANTIZATION_NAME"
!mkdir -p "$QUANT_DIR"

# Download the GGUF file
!wget -O "$QUANT_DIR/FILENAME.gguf" \
  https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/FILENAME.gguf

# Copy configuration files
!cp "$CONFIG_TEMP_DIR"/*.json "$QUANT_DIR/"

## Example 2: Performance Comparison with Different Quantization Levels

Let's deploy different quantizations to understand the trade-offs. Since each quantization needs its own directory, we'll deploy them separately:

### Deploy Q8_0 Quantization (Higher Quality)

Each GGUF quantization must be in its own directory containing the GGUF file and configuration files:


In [None]:
# Deploy Q8_0 quantization for comparison
print("Deploying Q8_0 quantization (higher quality, larger size)...")

!docker run -it --rm \
  --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME="/opt/models/q8_model" \
  -e NIM_SERVED_MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct-Q8" \
  -v "$Q8_MODEL_PATH:/opt/models/q8_model" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  -d \
  $NIM_IMAGE

In [None]:
# Use the log-based check
if not check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True):
    print("Falling back to health endpoint check...")
    check_service_ready()

In [None]:
# Test Q8_0 quantization
result = generate_text(
    model="meta-llama/Llama-3.2-3B-Instruct-Q8",
    prompt="Write a brief story about a robot learning to paint:",
    max_tokens=150
)
print("Q8_0 Quantization Result:")
print("=" * 50)
print(result['choices'][0]['text'] if result else "Failed to generate text")

In [None]:
# Stop the current deployment
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"

## Example 3: Custom Configuration Deployment

This example shows how to deploy with custom NIM parameters:


In [None]:
# Deploy Q4_K_M with custom parameters
print("Deploying with custom configuration...")

!docker run -it --rm \
  --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME="/opt/models/q4_model" \
  -e NIM_SERVED_MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct" \
  -e NIM_MAX_INPUT_LENGTH=4096 \
  -e NIM_MAX_OUTPUT_LENGTH=1024 \
  -e NIM_MODEL_PROFILE="vllm" \
  -v "$Q4_MODEL_PATH:/opt/models/q4_model" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  -d \
  $NIM_IMAGE

In [None]:
# Use the log-based check
check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True)

In [None]:
# Test with longer input/output capabilities
result = generate_text(
    model="meta-llama/Llama-3.2-3B-Instruct",
    prompt="Create a detailed plan for building a web application using modern technologies. Include frontend, backend, database, and deployment considerations:",
    max_tokens=500
)
print("Custom Configuration Result:")
print("=" * 50)
print(result['choices'][0]['text'] if result else "Failed to generate text")

## Performance Testing


In [None]:
# Performance comparison test
test_prompts = [
    "What is artificial intelligence?",
    "Write a simple Python loop:",
    "Explain quantum computing:",
    "Create a recipe for chocolate cake:"
]

print("Performance testing with GGUF Q4_K_M:")
print("=" * 50)

for i, prompt in enumerate(test_prompts, 1):
    start_time = time.time()
    result = generate_text(
        model="meta-llama/Llama-3.2-3B-Instruct",
        prompt=prompt,
        max_tokens=100
    )
    end_time = time.time()

    print(f"Test {i}: {end_time - start_time:.2f}s - {prompt}")

## Cleanup


In [None]:
# Final cleanup
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"
print("✓ Container stopped successfully")

# Optional: Clean up downloaded models (uncomment if you want to save disk space)
# !rm -rf "$GGUF_WORK_DIR"
# !rm -rf "$CONFIG_TEMP_DIR"
# print("✓ Downloaded models cleaned up")

print("✓ All containers stopped successfully")

## Summary

This notebook demonstrated deploying GGUF checkpoints with NIM:

1. **Local Deployment**: Pre-downloading models with separate directories per quantization
2. **Quantization Options**: Understanding different quantization levels and their trade-offs
3. **Custom Configuration**: Deploying with custom NIM parameters

**Key Points:**
- GGUF models require external config.json files from the original model repository
- Each quantization level must be in its own directory with the GGUF file and configuration files
- Different quantization levels offer trade-offs between model size, quality, and memory usage
- Q4_K_M provides the best balance for most use cases
- Local deployment enables offline usage and faster startup times

**Next Steps:**
- Experiment with different quantization levels for your specific use case
- Try other GGUF models from the community
- Compare performance across different quantizations
- Set up additional quantization directories as needed

For more information about GGUF format and quantization techniques, refer to the community documentation and model cards on Hugging Face.