<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Deploy Any LLM with NIM

This notebook demonstrates how to deploy almost any Large Language Model (LLMs) using NVIDIA NIM. NIM provides a streamlined way to deploy and serve LLMs with optimized performance and flexibility.

## Introduction

Deploying various LLMs often involves working with multiple inference frameworks and manual optimization, which can be time-consuming. NIM simplifies this by providing a consistent interface and automatically handling model analysis, backend selection, and configuration.

This tutorial covers:
*   Understanding how NIM handles different model formats.
*   Deploying models directly from Hugging Face.
*   Listing available backend options for a model.
*   Customizing deployments.
*   Deploying models from local storage.

## Prerequisites

### Hardware Requirements

Before proceeding, ensure your system meets the following requirements:

- **GPU**: NVIDIA GPU with at least 24GB VRAM (for Codestral-22B) or 8GB VRAM (for smaller models like Qwen2.5-0.5B)
- **Driver**: NVIDIA Driver version 535 or higher
- **CUDA**: CUDA 12.0 or higher
- **System Memory**: At least 32GB RAM recommended
- **Storage**: Sufficient disk space for model downloads and caching

For detailed hardware specifications, please refer to the [NIM LLM Documentation](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html).

### System Setup

First, let's verify your GPU setup and install necessary dependencies:


In [None]:
!nvidia-smi

### Install Required Software


In [None]:
# Update system and install git-lfs for model downloads
!sudo apt-get update && echo "✓ System packages updated successfully"
!sudo apt-get install git-lfs -y && echo "✓ Git LFS installed successfully"
!git lfs install && echo "✓ Git LFS initialized successfully"

In [None]:
# Install Python dependencies for Docker management
!pip install docker requests && echo "✓ Python Docker SDK and requests installed successfully"

### Get API Keys

#### NVIDIA NGC API Key

The NVIDIA NGC API Key is mandatory for accessing NVIDIA container registry and pulling secure container images.
Refer to [Generating NGC API Keys](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) for more information.


In [3]:
import getpass
import os

os.environ["NGC_API_KEY"] = "YOUR_NGC_API_KEY_HERE"

if not os.environ.get("NGC_API_KEY", "").startswith("nvapi-"):
    ngc_api_key = getpass.getpass("Enter your NGC API Key: ")
    assert ngc_api_key.startswith("nvapi-"), "Not a valid key"
    os.environ["NGC_API_KEY"] = ngc_api_key
    print("✓ NGC API Key set successfully")

In [None]:
%%bash
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

#### Hugging Face Token

You'll also need a [Huggingface Token](https://huggingface.co/settings/tokens) to download models.


In [None]:
if not os.environ.get("HF_USERNAME", ""):
    # hf_username = getpass.getpass("Enter your Huggingface Username: ")
    os.environ["HF_USERNAME"] = "nealv"
    print("✓ Hugging Face username set successfully")

In [None]:
if not os.environ.get("HF_TOKEN", "").startswith("hf_"):
    # hf_token = getpass.getpass("Enter your Huggingface Token: ")
    # assert hf_token.startswith("hf_"), "Not a valid key"
    os.environ["HF_TOKEN"] = "YOUR_HF_TOKEN_HERE"
    print("✓ Hugging Face token set successfully")

### Setup NIM Container

<!-- FIX THIS BIT -->
Choose your NIM container image and pull it:


In [None]:
# Available NIM container options (choose one):
# nim_images = {
#     "universal": "nvcr.io/nvidian/nim-llm-dev/universal-nim:1.11.0.rc4",
#     "latest": "nvcr.io/nim/meta/llama3-8b-instruct:1.0.0"
# }

# Set the NIM image - you can change this to your preferred version
os.environ['NIM_IMAGE'] = "nvcr.io/nvidian/nim-llm-dev/universal-nim:1.11.0.rc6"
print(f"Using NIM image: {os.environ['NIM_IMAGE']}")

In [None]:
# Pull the NIM container image
!docker pull $NIM_IMAGE
print("✓ NIM container image pulled successfully")

### Download Model to Local Storage

We'll download Qwen2.5-0.5B, a lightweight LLM, for use in Example 4:


In [10]:
!mkdir -p ~/models/Qwen2.5-0.5B

<div class="alert alert-block alert-warning">
    <b>Note:</b>  NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.
 </div>


In [None]:
!git clone https://$HF_USERNAME:$HF_TOKEN@huggingface.co/Qwen/Qwen2.5-0.5B \
    ~/models/Qwen2.5-0.5B && echo "✓ Qwen2.5-0.5B model downloaded successfully"

## Deployment Examples

Let's explore different ways to deploy models using NIM.

### Example 1: Basic Deployment from Hugging Face

This example shows how to deploy Codestral-22B directly from Hugging Face.

<div class="alert alert-block alert-info">
<b>Important:</b> You must accept the model's license agreement at https://huggingface.co/mistralai/Codestral-22B-v0.1 before using this model.
</div>


In [12]:
os.environ["CONTAINER_NAME"] = "LLM-NIM"
# os.environ['NIM_IMAGE'] = "..." # TODO: Need to change to public URL
os.environ["LOCAL_NIM_CACHE"] = os.path.expanduser("~/.cache/nim")
os.makedirs(os.environ["LOCAL_NIM_CACHE"], exist_ok=True)

In [None]:
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

After running the following cell, you should be able to see the `LLM-NIM` container running.



In [None]:
!docker ps  # Check container is running

While the LLM NIM service is getting ready, you may run the following cell to see live logs.

<div class="alert alert-block alert-success">
<b>Note:</b> NIM service takes a few minutes to initialize. Monitor with logs if needed.
</div>


In [15]:
# Optional: Monitor logs during startup (set print_logs=True to see detailed logs)
# check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True)

In [None]:
import requests
import time
import docker
import os

def check_service_ready_from_logs(container_name, print_logs=False, timeout=600):
    """
    Check if NIM service is ready by monitoring Docker logs for 'Application startup complete' message.

    Args:
        container_name (str): Name of the Docker container
        print_logs (bool): Whether to print logs while monitoring (default: False)
        timeout (int): Maximum time to wait in seconds (default: 600)

    Returns:
        bool: True if service is ready, False if timeout reached
    """
    print("Waiting for NIM service to start...")
    start_time = time.time()

    try:
        client = docker.from_env()
        container = client.containers.get(container_name)

                # Stream logs in real-time using the blocking generator
        log_buffer = ""
        for log_chunk in container.logs(stdout=True, stderr=True, follow=True, stream=True):
            # Check timeout
            if time.time() - start_time > timeout:
                print(f"❌ Timeout reached ({timeout}s). Service may not have started properly.")
                return False

            # Decode chunk and add to buffer
            chunk = log_chunk.decode('utf-8', errors='ignore')
            log_buffer += chunk

            # Process complete lines
            while '\n' in log_buffer:
                line, log_buffer = log_buffer.split('\n', 1)
                line = line.strip()

                if print_logs and line:
                    print(f"[LOG] {line}")

                # Check for startup complete message
                if "Application startup complete" in line:
                    print("✓ Application startup complete! Service is ready.")
                    return True

    except Exception as e:
        print(f"❌ Error: {e}")
        return False

    print(f"❌ Timeout reached ({timeout}s). Service may not have started properly.")
    return False

def check_service_ready():
    """Fallback health check using HTTP endpoint"""
    url = 'http://localhost:8000/v1/health/ready'
    print("Checking service health endpoint...")

    while True:
        try:
            response = requests.get(url, headers={'accept': 'application/json'})
            if response.status_code == 200 and response.json().get("message") == "Service is ready.":
                print("✓ Service ready!")
                break
        except requests.ConnectionError:
            pass
        print("⏳ Still starting...")
        time.sleep(30)

# Use the log-based check first, fallback to health endpoint if needed
container_name = os.environ.get("CONTAINER_NAME", "LLM-NIM")
if not check_service_ready_from_logs(container_name, print_logs=True):
    print("Falling back to health endpoint check...")
    check_service_ready()

Now let's test the deployed model:


In [None]:
import requests

def generate_text(model, prompt, max_tokens=1000, temperature=0.7):
    """Generate text using the NIM service"""
    try:
        response = requests.post(
            f"http://localhost:8000/v1/completions",
            json={
                "model": model,
                "prompt": prompt,
                "max_tokens": max_tokens,
                "temperature": temperature
            },
            timeout=60
        )
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error making request: {e}")
        return None

# Example usage
result = generate_text(
    model="mistralai/Codestral-22B-v0.1",
    prompt="Write a complete function that computes fibonacci numbers in Rust:"
)
print("Generated Code:")
print("=" * 50)
print(result['choices'][0]['text'])

Before we move onto the next example, let's stop the LLM NIM service.



In [None]:
!docker stop $CONTAINER_NAME

### Example 2: Deployment Using Different Backend Options

NIM supports multiple backends for model deployment. Let's explore TensorRT-LLM and vLLM backends:

#### TensorRT-LLM Backend


In [None]:
# Using TensorRT-LLM backend by specifying the NIM_MODEL_PROFILE parameter
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_MODEL_PROFILE="tensorrt_llm" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# Use the log-based check (set print_logs=True to see detailed logs)
if not check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True):
    print("Falling back to health endpoint check...")
    check_service_ready()

Test the TensorRT-LLM backend:


In [None]:
result = generate_text(
    model="mistralai/Codestral-22B-v0.1",
    prompt="Write a complete Python function that computes fibonacci numbers with memoization:"
)
print("TensorRT-LLM Backend Result:")
print("=" * 50)
print(result['choices'][0]['text'])

Before we move onto the next example, let's stop the LLM NIM service.



In [None]:
!docker stop $CONTAINER_NAME

#### vLLM Backend


In [None]:
# Using vLLM backend by specifying the NIM_MODEL_PROFILE parameter
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_MODEL_PROFILE="vllm" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# Use the log-based check (set print_logs=True to see detailed logs)
if not check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True):
    print("Falling back to health endpoint check...")
    check_service_ready()

Test the vLLM backend:



In [None]:
result = generate_text(
    model="mistralai/Codestral-22B-v0.1",
    prompt="Write a complete C++ function that computes fibonacci numbers efficiently:"
)
print("vLLM Backend Result:")
print("=" * 50)
print(result['choices'][0]['text'])

Before we move onto the next example, let's stop the LLM NIM service.



In [None]:
!docker stop $CONTAINER_NAME

### Example 3: Customizing Model Parameters

This example demonstrates how custom parameters affect model behavior. We'll deploy with specific constraints and test them:

**Key Parameters:**
* `NIM_TENSOR_PARALLEL_SIZE=2`: Uses 2 GPUs in parallel for better performance
* `NIM_MAX_INPUT_LENGTH=2048`: Limits input to 2048 tokens
* `NIM_MAX_OUTPUT_LENGTH=512`: Limits output to 512 tokens


In [None]:
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_TENSOR_PARALLEL_SIZE=2 \
 -e NIM_MAX_INPUT_LENGTH=2048 \
 -e NIM_MAX_OUTPUT_LENGTH=512 \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# Use the log-based check (set print_logs=True to see detailed logs)
if not check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True):
    print("Falling back to health endpoint check...")
    check_service_ready()

Test with custom parameters:



In [None]:
result = generate_text(model="mistralai/Codestral-22B-v0.1",
                       prompt="Write me a function that computes fibonacci in Javascript")
print(result['choices'][0]['text'])

Before we move onto the next example, let's stop the LLM NIM service.



In [None]:
!docker stop $CONTAINER_NAME

### Example 4: Deployment from Local Model

This example shows how to deploy Qwen2.5-0.5B from the locally downloaded model:


In [None]:
# Verify model files exist
!ls ~/models/Qwen2.5-0.5B

In [35]:
os.environ["LOCAL_MODEL_DIR"] = os.path.expanduser("~/models/Qwen2.5-0.5B")

In [None]:
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus '"device=0"' \
 --shm-size=16GB \
 -e NIM_MODEL_NAME="/opt/models/Qwen2.5-0.5B" \
 -e NIM_SERVED_MODEL_NAME="Qwen/Qwen2.5-0.5B" \
 -v "$LOCAL_MODEL_DIR:/opt/models/Qwen2.5-0.5B" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# Use the log-based check (set print_logs=True to see detailed logs)
if not check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True):
    print("Falling back to health endpoint check...")
    check_service_ready()

Test the local model deployment:



In [None]:
result = generate_text(model="Qwen/Qwen2.5-0.5B",
                       prompt="Once upon a time ")
print(result['choices'][0]['text'])

In [None]:
# Final cleanup
!docker stop $CONTAINER_NAME
print("✓ All containers stopped successfully")