# Deploy Any LLM with NIM

This notebook demonstrates how to deploy almost any Large Language Model (LLMs) using NVIDIA NIM. NIM provides a streamlined way to deploy and serve LLMs with optimized performance and flexibility.

## Introduction

Deploying various LLMs often involves working with multiple inference frameworks and manual optimization, which can be time-consuming. NIM simplifies this by providing a consistent interface and automatically handling model analysis, backend selection, and configuration.

This tutorial covers:
*   Understanding how NIM handles different model formats.
*   Deploying models directly from Hugging Face.
*   Listing available backend options for a model.
*   Customizing deployments.
*   Deploying models from local storage.


## Prerequisites

### Clone repository and install software

1. **Clone** <name> Git repository

In [None]:
!git clone ssh://git@github.com:NVIDIA-AI-Blueprints/Universal-LLM-NIM.git

2. Verify the Driver and CUDA version to be the following:
- Driver Version: 535.x.x
- CUDA Version 12.2

In [None]:
!nvidia-smi

If the driver version doesn't match in the above step:
- Update the Driver to 535
- Reboot the system
- Set ```NGC_API_KEY``` again

In [None]:
# !sudo apt install nvidia-driver-535 -y
# !sudo reboot now

### Get a API Keys

#### Let's start by logging into the NVIDIA Container Registry. 
 
The NVIDIA NGC API Key is a mandatory key that is required to use this blueprint. This is needed to log into the NVIDIA container registry, nvcr.io, and to pull secure container images used in this NVIDIA NIM Blueprint.
Refer to [Generating NGC API Keys](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) in the NVIDIA NGC User Guide for more information.



Authenticate with the NVIDIA Container Registry with the following command:

In [None]:
import os

os.environ["NGC_API_KEY"] = "*****" # Replace with your key

In [None]:
%%bash
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

You'll also need a [Huggingface Token](https://huggingface.co/settings/tokens) to download the models in this notebook.

In [None]:
import getpass
import os

os.environ["HF_USERNAME"] = "*****" # Replace with your HuggingFace username

In [None]:
if not os.environ.get("HF_TOKEN", "").startswith("hf_"):
    hf_token = getpass.getpass("Enter your Huggingface Token: ")
    assert hf_token.startswith("hf_"), "Not a valid key"
    os.environ["HF_TOKEN"] = hf_token

Updating the docker storage path to Ephemeral storage

In [None]:
import json, subprocess, time

storage_path = "/ephemeral/cache/docker"

daemon_file = "/etc/docker/daemon.json" #update the path if required
config = {}
try:
    config = json.load(open(daemon_file)) if os.path.exists(daemon_file) else {}
except PermissionError:
    print("Cannot read the file. Try running with elevated privileges or check docker deamon file path.")

config["data-root"] = storage_path
config_str = json.dumps(config, indent=4)

subprocess.run(f"echo '{config_str}' | sudo tee {daemon_file} > /dev/null", shell=True, check=True)
subprocess.run("sudo systemctl restart docker", shell=True, check=True)

time.sleep(5)

# Verify new storage location
print(subprocess.run("docker info | grep 'Docker Root Dir'", shell=True, capture_output=True, text=True).stdout)

### Downloading model to local storage

You will use Qwen2.5-0.5B, a lightweight LLM, later in Example 4.

In [None]:
!mkdir -p /ephemeral/models/Qwen2.5-0.5B

In [None]:
!git clone https://$HF_USERNAME:$HF_TOKEN@huggingface.co/Qwen/Qwen2.5-0.5B \
    /ephemeral/models/Qwen2.5-0.5B

## Deployment Examples

Let's explore different ways to deploy models using NIM.

### Example 1: Basic Deployment from Hugging Face

This example shows how to deploy Codestral-22B, a powerful code generation model, directly from Hugging Face. Note that you need to accept the model's access agreement before you can use this model. To accept the agreement, you may visit https://huggingface.co/mistralai/Codestral-22B-v0.1.

In [None]:
!sudo chown -R $(whoami) /ephemeral/cache

In [None]:
os.environ["CONTAINER_NAME"] = "Universal-LLM-NIM"
os.environ['NIM_IMAGE'] = "***" # TODO: Need to change to public URL
os.environ["LOCAL_NIM_CACHE"] = os.path.expanduser("/ephemeral/cache/nim")
os.makedirs(os.environ["LOCAL_NIM_CACHE"], exist_ok=True)

In [None]:
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

After running the following cell, you should be able to see the `Universal-LLM-NIM` container running.

In [None]:
!docker ps

While the LLM NIM service is getting ready, you may run the following cell to see live logs.

<div class="alert alert-block alert-success">
    <b>Note:</b>  LLM NIM service could take several miniutes to pull the model from Hugging Face and to get ready.
 </div>

In [None]:
# # Uncomment the entire cell to see live logs if interested. Manually stop the cell once LLM NIM service is ready.
# !docker logs -f $CONTAINER_NAME

Below cell ensures that the LLM NIM is running before proceeding.

In [None]:
import requests

def check_service_ready():
    url = 'http://localhost:8000/v1/health/ready'  # make sure the LLM NIM port is correct
    headers = {'accept': 'application/json'}
    
    while True:
        try:
            response = requests.get(url, headers=headers)
            if response.status_code == 200 and response.json().get("message") == "Service is ready.":
                print("Service is ready.")
                break
            else:
                print("Service is not ready. Waiting for 30 seconds...")
        except requests.ConnectionError:
            print("Service is not ready. Waiting for 30 seconds...")
        time.sleep(30)

check_service_ready()

Once your model is deployed, you can interact with it using the REST API. Here's an example of how to make requests:

In [None]:
import requests

def generate_text(model, prompt, max_tokens=250):
    response = requests.post(
        f"http://localhost:8000/v1/completions",
        json={
            "model": model,
            "prompt": prompt, 
            "max_tokens": max_tokens
        }
    )
    return response.json()

# Example usage
result = generate_text(model="mistralai/Codestral-22B-v0.1",
                       prompt="Write me a function that computes fibonacci in Rust")
print(result['choices'][0]['text'])

Before we move onto the next example, let's stop the LLM NIM service.

In [None]:
!docker stop $CONTAINER_NAME

### Example 2: Deployment Using Available Backend Options

NIM supports multiple backends for model deployment. Let's see how to specify different backends:

In [None]:
# Using TensorRT-LLM backend by specifying the NIM_MODEL_PROFILE parameter
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_MODEL_PROFILE="tensorrt_llm" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# # For live logs
# !docker logs -f $CONTAINER_NAME

Below cell ensures that the LLM NIM is running before proceeding.

<div class="alert alert-block alert-success">
    <b>Note:</b>  LLM NIM service could take several miniutes to get ready.
 </div>

In [None]:
check_service_ready()

Let's try out the LLM NIM service backed by TRT-LLM.

In [None]:
result = generate_text(model="mistralai/Codestral-22B-v0.1",
                       prompt="Write me a function that computes fibonacci in Python")
print(result['choices'][0]['text'])

Before we move onto the next example, let's stop the LLM NIM service.

In [None]:
!docker stop $CONTAINER_NAME

In [None]:
# Using vLLM backend by specifying the NIM_MODEL_PROFILE parameter
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_MODEL_PROFILE="vllm" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# # For live logs
# !docker logs -f $CONTAINER_NAME

Below cell ensures that the LLM NIM is running before proceeding.

<div class="alert alert-block alert-success">
    <b>Note:</b>  LLM NIM service could take several miniutes to get ready.
 </div>

In [None]:
check_service_ready()

Let's try out the LLM NIM service backed by TRT-LLM.

In [None]:
result = generate_text(model="mistralai/Codestral-22B-v0.1",
                       prompt="Write me a function that computes fibonacci in C++")
print(result['choices'][0]['text'])

Before we move onto the next example, let's stop the LLM NIM service.

In [None]:
!docker stop $CONTAINER_NAME

### Example 3: Customizing Model Parameters

You can customize various model parameters to optimize performance and resource usage. Here are some common parameters you might adjust:

* `NIM_TENSOR_PARALLEL_SIZE`: Number of tensor parallel size to use. Increasing this can improve performance but will require more GPU memory.
* `NIM_MAX_BATCH_SIZE`: Maximum number of samples to process in a single batch. Larger batch sizes can improve throughput but will also require more memory.
* `NIM_MAX_INPUT_LENGTH`: Maximum length of input sequences. Adjusting this can help manage memory usage and processing time, especially for very long inputs.
* `NIM_MAX_OUTPUT_LENGTH`: Maximum length of output sequences. This helps control the length of generated outputs, which can be important for tasks like text generation.

In [None]:
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_TENSOR_PARALLEL_SIZE=2 \
 -e NIM_MAX_BATCH_SIZE=16 \
 -e NIM_MAX_INPUT_LENGTH=2048 \
 -e NIM_MAX_OUTPUT_LENGTH=512 \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# For live logs
!docker logs -f $CONTAINER_NAME

Below cell ensures that the LLM NIM is running before proceeding.

<div class="alert alert-block alert-success">
    <b>Note:</b>  LLM NIM service could take several miniutes to get ready.
 </div>

In [None]:
check_service_ready()

Let's try out the LLM NIM service with custom parameters.

In [None]:
result = generate_text(model="mistralai/Codestral-22B-v0.1",
                       prompt="Write me a function that computes fibonacci in Javascript")
print(result['choices'][0]['text'])

Before we move onto the next example, let's stop the LLM NIM service.

In [None]:
!docker stop $CONTAINER_NAME

### Example 4: Deployment from Local Model

This example shows how to deploy Qwen2.5-0.5B, a lightweight language model, from local model that we downloaded before.

Check that we have the model files in the correct directory.

In [None]:
!ls /ephemeral/models/Qwen2.5-0.5B

In [None]:
os.environ["LOCAL_MODEL_DIR"] = "/ephemeral/models/Qwen2.5-0.5B"

In [None]:
!docker run -it --rm --name="Universal-LLM-NIM" \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e NIM_MODEL_NAME="/opt/models/Qwen2.5-0.5B" \
 -e NIM_SERVED_MODEL_NAME="Qwen/Qwen2.5-0.5B" \
 -v "$LOCAL_MODEL_DIR:/opt/models/Qwen2.5-0.5B" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8001:8000 \
 -d \
 $IMG_NAME

In [None]:
# # For live logs
# !docker logs -f $CONTAINER_NAME

Below cell ensures that the LLM NIM is running before proceeding.

<div class="alert alert-block alert-success">
    <b>Note:</b>  LLM NIM service could take several miniutes to get ready.
 </div>

In [None]:
check_service_ready()

Let's try out the LLM NIM service deployed with a local model.

In [None]:
result = generate_text(model="Qwen/Qwen2.5-0.5B",
                       prompt="Once upon a time ")
print(result['choices'][0]['text'])

Before we finish, let's stop the LLM NIM service.

In [None]:
!docker stop $CONTAINER_NAME