<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Deploy TensorRT-LLM Checkpoints and Engines with NIM

This notebook demonstrates how to deploy your own TensorRT-LLM checkpoints and Engines with NIM. For
demonstration purposes, we download and convert weights from HuggingFace manually, but note that you can also
deploy HuggingFace weights directly without any manual conversion, as shown in [the previous notebook](./1_HuggingFace_Safetensors.ipynb).

⚠️ This notebook assumes familiarity with TensorRT-LLM and TensorRT-LLM optimizations. Consider starting with Notebook 1
unless you specifically need custom TensorRT optimizations.

## What You'll Build

By the end of this notebook, you'll be able to:
- ✅ Convert HuggingFace models to TensorRT-LLM format
- ✅ Deploy TensorRT-LLM checkpoints for development
- ✅ Compile highly optimized TensorRT-LLM engines
- ✅ Deploy production-ready engines with maximum performance

## When to Use This Approach

**Choose this notebook if you:**
- Need the absolute best inference performance
- Have production workloads requiring low latency
- Want to optimize for specific hardware configurations
- Can invest time in the conversion process (15-45 minutes)

## The Process

```mermaid
graph LR
    A[HuggingFace Model] --> B[TensorRT-LLM Checkpoint]
    B --> C[TensorRT-LLM Engine]
    C --> D[Deploy with NIM]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#9f9,stroke:#333,stroke-width:2px
```

1. **Download**: Get a HuggingFace model (5 minutes)
2. **Convert**: Create TensorRT-LLM checkpoint (10-15 minutes)
3. **Compile**: Build optimized engine (10-30 minutes)
4. **Deploy**: Serve with NIM (instant)

## What's Covered

This tutorial includes:
* **Setup**: Preparing your environment and downloading models
* **Example 1**: Converting Safetensors to TensorRT-LLM checkpoints
* **Example 2**: Deploying checkpoints for testing
* **Example 3**: Compiling optimized engines for production
* **Example 4**: Deploying engines with performance benchmarking

## Prerequisites

### Hardware Requirements

TensorRT-LLM conversion and deployment requires significant resources:

- **GPU**: NVIDIA GPU with at least 24GB VRAM (for Llama-3-8B)
- **System Memory**: At least 64GB RAM recommended for conversion process
- **Storage**:
  - 50GB+ free space for model downloads and conversion artifacts
  - SSD recommended for faster I/O during conversion

**Conversion Time Estimates:**
- Checkpoint conversion: 5-15 minutes depending on hardware
- Engine compilation: 10-30 minutes depending on optimization settings

For detailed hardware specifications, refer to the [TensorRT-LLM documentation](https://nvidia.github.io/TensorRT-LLM/).

### System Setup

First, let's verify your GPU setup and install necessary dependencies:



In [None]:
!nvidia-smi

### Install Required Software



In [None]:
# Install Python dependencies for Docker management
%pip install docker requests huggingface-hub && echo "✓ Python dependencies installed successfully"

### Get API Keys

#### NVIDIA NGC API Key

The NVIDIA NGC API Key is mandatory for accessing NVIDIA container registry and pulling secure container images.
Refer to [Generating NGC API Keys](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) for more information.



In [None]:
import getpass
import os

if not os.environ.get("NGC_API_KEY", "").startswith("nvapi-"):
    ngc_api_key = getpass.getpass("Enter your NGC API Key: ")
    assert ngc_api_key.startswith("nvapi-"), "Not a valid key"
    os.environ["NGC_API_KEY"] = ngc_api_key
    print("✓ NGC API Key set successfully")

In [None]:
!echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

#### Hugging Face Token

You'll also need a [Huggingface Token](https://huggingface.co/settings/tokens) to download models.



In [None]:
if not os.environ.get("HF_TOKEN", "").startswith("hf_"):
    hf_token = getpass.getpass("Enter your Huggingface Token: ")
    assert hf_token.startswith("hf_"), "Not a valid key"
    os.environ["HF_TOKEN"] = hf_token
    print("✓ Hugging Face token set successfully")

### Setup NIM Container

Choose your NIM container image and pull it:



In [None]:
# Set the NIM image
os.environ['NIM_IMAGE'] = "nvcr.io/nvidia/nim/nim-llm:latest"
print(f"Using NIM image: {os.environ['NIM_IMAGE']}")

In [None]:
# Pull the NIM container image
!docker pull $NIM_IMAGE && echo "✓ NIM container image pulled successfully"

### Utility Functions

Below are some utility functions we'll use in this notebook. These are for simplifying the process of deploying and monitoring NIMs in a notebook environment, and aren't required in general.


In [None]:
import requests
import time
import docker
import os

def check_service_ready_from_logs(container_name, print_logs=False, timeout=600):
    """
    Check if NIM service is ready by monitoring Docker logs for 'Application startup complete' message.

    Args:
        container_name (str): Name of the Docker container
        print_logs (bool): Whether to print logs while monitoring (default: False)
        timeout (int): Maximum time to wait in seconds (default: 600)

    Returns:
        bool: True if service is ready, False if timeout reached
    """
    print("Waiting for NIM service to start...")
    start_time = time.time()

    try:
        client = docker.from_env()
        container = client.containers.get(container_name)

        # Stream logs in real-time using the blocking generator
        log_buffer = ""
        for log_chunk in container.logs(stdout=True, stderr=True, follow=True, stream=True):
            # Check timeout
            if time.time() - start_time > timeout:
                print(f"❌ Timeout reached ({timeout}s). Service may not have started properly.")
                return False

            # Decode chunk and add to buffer
            chunk = log_chunk.decode('utf-8', errors='ignore')
            log_buffer += chunk

            # Process complete lines
            while '\n' in log_buffer:
                line, log_buffer = log_buffer.split('\n', 1)
                line = line.strip()

                if print_logs and line:
                    print(f"[LOG] {line}")

                # Check for startup complete message
                if "Application startup complete" in line:
                    print("✓ Application startup complete! Service is ready.")
                    return True

    except Exception as e:
        print(f"❌ Error: {e}")
        return False

    print(f"❌ Timeout reached ({timeout}s). Service may not have started properly.")
    return False

def check_service_ready():
    """Fallback health check using HTTP endpoint"""
    url = 'http://localhost:8000/v1/health/ready'
    print("Checking service health endpoint...")

    while True:
        try:
            response = requests.get(url, headers={'accept': 'application/json'})
            if response.status_code == 200 and response.json().get("message") == "Service is ready.":
                print("✓ Service ready!")
                break
        except requests.ConnectionError:
            pass
        print("⏳ Still starting...")
        time.sleep(30)

def generate_text(model, prompt, max_tokens=1000, temperature=0.7):
    """Generate text using the NIM service"""
    try:
        response = requests.post(
            f"http://localhost:8000/v1/chat/completions",
            json={
                "model": model,
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": max_tokens,
                "temperature": temperature
            },
            timeout=60
        )
        response.raise_for_status()
        return response.json()['choices'][0]['message']['content']
    except requests.exceptions.RequestException as e:
        print(f"Error making request: {e}")
        return None

print("✓ Utility functions loaded successfully")

### Download Base Model

We'll download Llama-3-8B-Instruct as our base model for TensorRT-LLM conversion.

<div class="alert alert-block alert-info">
<b>Note:</b> You can modify the `model_save_location` variable below to use a different directory for storing models and conversion artifacts.
</div>



In [None]:
# Set base directory for all files - you can modify this path as needed
# Examples: ".", "~", "/tmp", "/scratch", etc.
base_work_dir = "/ephemeral"
os.environ["BASE_WORK_DIR"] = base_work_dir

# Set up model download location
model_save_location = os.path.join(base_work_dir, "models")

os.environ["MODEL_SAVE_LOCATION"] = model_save_location
os.environ["LOCAL_MODEL_DIR"] = os.path.join(model_save_location, "llama3-8b-instruct-hf")

# Create model directory
os.makedirs(os.environ["LOCAL_MODEL_DIR"], exist_ok=True)

<div class="alert alert-block alert-warning">
    <b>Note:</b> NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.
</div>

<div class="alert alert-block alert-info">
<b>Important:</b> You must accept the model's license agreement at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct before using this model.
</div>



In [None]:
!huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir "$LOCAL_MODEL_DIR" && echo "✓ Model downloaded successfully"

## TensorRT-LLM Conversion Examples

Let's explore the complete workflow from Hugging Face models to optimized TensorRT-LLM engines.

### Setup Common Variables



In [None]:
os.environ["CONTAINER_NAME"] = "TRTLLM-NIM"
os.environ["LOCAL_NIM_CACHE"] = os.path.join(base_work_dir, ".cache/nim")
os.environ["TRTLLM_CKPT_DIR"] = os.path.join(model_save_location, "llama3-8b-instruct-ckpt")
os.environ["TRTLLM_ENGINE_DIR"] = os.path.join(model_save_location, "llama3-8b-instruct-engine")

# Create necessary directories
os.makedirs(os.environ["LOCAL_NIM_CACHE"], exist_ok=True)
os.makedirs(os.path.join(os.environ["TRTLLM_CKPT_DIR"], "trtllm_ckpt"), exist_ok=True)
os.makedirs(os.path.join(os.environ["TRTLLM_ENGINE_DIR"], "trtllm_engine"), exist_ok=True)

print("✓ Directories created successfully")

## Example 1: Convert Safetensors to TensorRT-LLM Checkpoint

First, we'll convert the Hugging Face safetensors model to a TensorRT-LLM checkpoint format.



In [None]:
# Verify the source model files
!ls -Rlh $LOCAL_MODEL_DIR

### Convert to TensorRT-LLM Checkpoint

We'll use the TensorRT-LLM tools inside the NIM container to perform the conversion.

> INFO
> For more information on TensorRT-LLM Checkpoints and the available options, refer to the [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/architecture/checkpoint.html)



In [None]:
# Convert safetensors to TensorRT-LLM checkpoint
# This uses the checkpoint_convert.py script inside the NIM container
print("Starting conversion to TensorRT-LLM checkpoint...")
print("This process may take a few minutes depending on your hardware.")

!docker run --rm \
  --runtime=nvidia \
  --gpus '"device=0,1"' \
  --shm-size=16GB \
  -v $LOCAL_MODEL_DIR:/input_model \
  -v $TRTLLM_CKPT_DIR:/output_dir \
  -u $(id -u) \
  $NIM_IMAGE \
  python3 /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
  --model_dir /input_model \
  --output_dir /output_dir/trtllm_ckpt \
  --dtype bfloat16

In [None]:
# Copy the required files from the huggingface model directory to the TensorRT-LLM checkpoint directory
!cp -r $LOCAL_MODEL_DIR/config.json $TRTLLM_CKPT_DIR/config.json
!cp -r $LOCAL_MODEL_DIR/generation_config.json $TRTLLM_CKPT_DIR/generation_config.json
!cp -r $LOCAL_MODEL_DIR/tokenizer.json $TRTLLM_CKPT_DIR/tokenizer.json
!cp -r $LOCAL_MODEL_DIR/tokenizer_config.json $TRTLLM_CKPT_DIR/tokenizer_config.json
!cp -r $LOCAL_MODEL_DIR/special_tokens_map.json $TRTLLM_CKPT_DIR/special_tokens_map.json

In [None]:
# Verify the directory structure of the checkpoint folder
!ls -Rlh $TRTLLM_CKPT_DIR

## Example 2: Deploy TensorRT-LLM Checkpoint with NIM

Now let's deploy the TensorRT-LLM checkpoint using NIM:



In [None]:
# Deploy TensorRT-LLM checkpoint with NIM
print("Deploying TensorRT-LLM checkpoint with NIM...")

!docker run -it --rm \
  --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=16GB \
  -e NIM_MODEL_NAME="/opt/models/my_model" \
  -e NIM_SERVED_MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct" \
  -e NIM_MODEL_PROFILE="tensorrt_llm" \
  -v "$TRTLLM_CKPT_DIR:/opt/models/my_model" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  -d \
  $NIM_IMAGE

In [None]:
!docker ps  # Check container is running

In [None]:
check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True)

### Test TensorRT-LLM Checkpoint Deployment



In [None]:
# Test the deployed TensorRT-LLM checkpoint
result = generate_text(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    prompt="Explain the benefits of TensorRT-LLM optimization"
)
print("TensorRT-LLM Checkpoint Result:")
print("=" * 50)
print(result if result else "Failed to generate text")

In [None]:
# Stop the checkpoint deployment before moving to engine compilation
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"

## Example 3: Compile TensorRT-LLM Engine

Now let's compile the checkpoint into a fully optimized TensorRT-LLM engine:

For more information on TensorRT-LLM Engines and the available options, refer to the [trtllm-build documentation](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-build.html).

For detailed optimization guidance, refer to the [TensorRT-LLM Performance Guide](https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html).



In [None]:
# Compile TensorRT-LLM checkpoint to engine
print("Compiling TensorRT-LLM checkpoint to engine...")
print("This process may take several minutes depending on your hardware and optimization settings.")

!docker run --rm \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -v $TRTLLM_CKPT_DIR:/input_checkpoints \
  -v $TRTLLM_ENGINE_DIR:/output_engines \
  -w /output_engines \
  -u $(id -u) \
  $NIM_IMAGE \
  trtllm-build --checkpoint_dir /input_checkpoints/trtllm_ckpt \
  --output_dir /output_engines/trtllm_engine

In [None]:
# Copy the required files from the huggingface model directory to the TensorRT-LLM engine directory
!cp -r $LOCAL_MODEL_DIR/config.json $TRTLLM_ENGINE_DIR/config.json
!cp -r $LOCAL_MODEL_DIR/generation_config.json $TRTLLM_ENGINE_DIR/generation_config.json
!cp -r $LOCAL_MODEL_DIR/tokenizer.json $TRTLLM_ENGINE_DIR/tokenizer.json
!cp -r $LOCAL_MODEL_DIR/tokenizer_config.json $TRTLLM_ENGINE_DIR/tokenizer_config.json
!cp -r $LOCAL_MODEL_DIR/special_tokens_map.json $TRTLLM_ENGINE_DIR/special_tokens_map.json

In [None]:
# Verify the engine was created
!ls -Rlh $TRTLLM_ENGINE_DIR

## Example 4: Deploy TensorRT-LLM Engine with NIM

Finally, let's deploy the fully optimized TensorRT-LLM engine:



In [None]:
# Deploy TensorRT-LLM engine with NIM
print("Deploying optimized TensorRT-LLM engine with NIM...")

!docker run -it --rm \
  --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=16GB \
  -e NIM_MODEL_NAME="/opt/models/my_model" \
  -e NIM_SERVED_MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct" \
  -e NIM_MODEL_PROFILE="tensorrt_llm" \
  -v $TRTLLM_ENGINE_DIR:/opt/models/my_model \
  -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
  -u $(id -u) \
  -p 8000:8000 \
  -d \
  $NIM_IMAGE

In [None]:
check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True)

### Test TensorRT-LLM Engine Deployment



In [None]:
# Test the deployed TensorRT-LLM engine
import time

# Warm up the engine
print("Warming up the TensorRT-LLM engine...")
generate_text(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    prompt="Hello",
    max_tokens=10
)

result = generate_text(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    prompt="Write a Python function to implement binary search",
)

print("TensorRT-LLM Engine Result:")
print("=" * 50)
print(result if result else "Failed to generate text")


## Cleanup



In [None]:
# Final cleanup
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"
print("✓ Container stopped successfully")

## Summary

This notebook demonstrated the complete TensorRT-LLM workflow:

1. **Checkpoint Conversion**: Converting Hugging Face safetensors to TensorRT-LLM checkpoint format
2. **Checkpoint Deployment**: Deploying checkpoints with NIM for development and testing
3. **Engine Compilation**: Creating optimized TensorRT-LLM engines for production
4. **Engine Deployment**: Deploying optimized engines for maximum performance

**Key Benefits of TensorRT-LLM:**
- **Performance**: Up to 4x faster inference compared to standard frameworks
- **Memory Efficiency**: Optimized memory usage and KV-cache management
- **Flexibility**: Support for various optimization techniques and hardware configurations

**Next Steps:**
- Experiment with different optimization settings for your use case
- Try quantization techniques (INT8, FP8) for further performance gains
- Explore multi-GPU deployments for larger models

For more advanced optimization techniques, refer to the [TensorRT-LLM documentation](https://nvidia.github.io/TensorRT-LLM/).