# Deploy Any LLM with NIM

This notebook demonstrates how to deploy almost any Large Language Model (LLMs) using NVIDIA NIM. NIM provides a streamlined way to deploy and serve LLMs with optimized performance and flexibility.

## Introduction

Deploying various LLMs often involves working with multiple inference frameworks and manual optimization, which can be time-consuming. NIM simplifies this by providing a consistent interface and automatically handling model analysis, backend selection, and configuration.

This tutorial covers:
*   Understanding how NIM handles different model formats.
*   Deploying models directly from Hugging Face.
*   Listing available backend options for a model.
*   Deploying models from local storage.
*   Customizing deployments.


# Getting Started
>[Prerequisites](#Prerequisites)  
>[Spin Up Blueprint](#Spin-Up-Blueprint)  
>[Download Sample Data](#Download-Sample-Data)  
>[Validate Deployment](#Validate-Deployment)  
>[API Reference](#API-Reference)  
>[Next Steps](#Next-Steps)  
>[Shutting Down Blueprint](#Stopping-Services-and-Cleaning-Up)  
>[Appendix](#Appendix)  
________________________


## Prerequisites

### Clone repository and install software

1. **Clone** <name> Git repository

In [None]:
!git clone ssh://git@github.com:NVIDIA-AI-Blueprints/Universal-LLM-NIM.git

2. Install **[Docker](https://docs.docker.com/engine/install/ubuntu/)**

3. Install **[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-the-nvidia-container-toolkit)** to configure Docker for GPU-accelerated containers, like NVIDIA NIM.
 If you are using a system deployed with Brev you can skip this step since Brev systems come with NVIDIA Container Toolkit preinstalled. 



<div class="alert alert-block alert-info">
    <b>Note:</b> After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.

### Get a API Keys

#### Let's start by logging into the NVIDIA Container Registry. 
 
The NVIDIA NGC API Key is a mandatory key that is required to use this blueprint. This is needed to log into the NVIDIA container registry, nvcr.io, and to pull secure container images used in this NVIDIA NIM Blueprint.
Refer to [Generating NGC API Keys](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) in the NVIDIA NGC User Guide for more information.



Authenticate with the NVIDIA Container Registry with the following command:

In [None]:
!docker login nvcr.io

<div class="alert alert-block alert-info">
    <b>Note:</b> Use oauthtoken as the username and your API key as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation. 

You'll also need a [Huggingface Token](https://huggingface.co/settings/tokens) to download some of the models in this notebook.

In [None]:
import getpass
import os
if not os.environ.get("HF_TOKEN", "").startswith("hf_"):
    hf_token = getpass.getpass("Enter your Huggingface Token: ")
    assert hf_token.startswith("hf_"), "Not a valid key"
    os.environ["HF_TOKEN"] = hf_token

## Deployment Examples

Let's explore different ways to deploy models using NIM.


### Example 1: Basic Model Deployment from Hugging Face or Local Filesystem

This example shows how to deploy Codestral-22B, a powerful code generation model, directly from Hugging Face. Note that you need to accept the model's access agreement before you can use this model.

In [None]:
# Deploying directly from Hugging Face
!docker run --rm --gpus all \
  --network=host \
  -u $(id -u) \
  -v $(pwd)/nim_cache:/opt/nim/.cache \
  -v $(pwd):$(pwd) \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_TENSOR_PARALLEL_SIZE=1 \
  $NIM_IMAGE nim-run --model "hf://mistralai/Codestral-22B-v0.1"

If your model is already downloaded locally, you can simply point nim-run to where it exists on your filesystem:

In [None]:
# Deploying a model that is available on the local filesystem
!docker run --rm --gpus all \
  --network=host \
  -u $(id -u) \
  -v $(pwd)/nim_cache:/opt/nim/.cache \
  -v $(pwd):$(pwd) \
  -v /path/to/model/dir:/path/to/model/dir \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_TENSOR_PARALLEL_SIZE=1 \
  $NIM_IMAGE nim-run --model "/path/to/model/dir/mistralai-Codestral-22B-v0.1"

Once the model is downloaded or loaded from the local filesystem, NIM recognizes it as a full-precision Mistral model, selects the optimal backend (typically TensorRT-LLM for best performance), and configures the server optimally for your hardware. Additionally, we are specifying tensor parallelism to be 1, which you can change if you intend to use multiple GPUs for deployment. You can inspect the full list of supported arguments by running `nim-run --help` in the container. The deployed model will be available at http://localhost:8000 for use.

### Example 2: Exploring Available Backend Options

NIM supports multiple backends for model deployment. Let's see how to specify different backends:

In [None]:
# Using TensorRT-LLM backend
!docker run --rm --gpus all \
  --network=host \
  -u $(id -u) \
  -v $(pwd)/nim_cache:/opt/nim/.cache \
  -v $(pwd):$(pwd) \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_TENSOR_PARALLEL_SIZE=1 \
  $NIM_IMAGE nim-run --model "hf://mistralai/Codestral-22B-v0.1" --backend tensorrt-llm

In [None]:
# Using vLLM backend
!docker run --rm --gpus all \
  --network=host \
  -u $(id -u) \
  -v $(pwd)/nim_cache:/opt/nim/.cache \
  -v $(pwd):$(pwd) \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_TENSOR_PARALLEL_SIZE=1 \
  $NIM_IMAGE nim-run --model "hf://mistralai/Codestral-22B-v0.1" --backend vllm

### Example 3: Customizing Model Parameters

You can customize various model parameters to optimize performance and resource usage:

In [None]:
# Deploying with custom parameters
!docker run --rm --gpus all \
  --network=host \
  -u $(id -u) \
  -v $(pwd)/nim_cache:/opt/nim/.cache \
  -v $(pwd):$(pwd) \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_TENSOR_PARALLEL_SIZE=1 \
  -e NIM_MAX_BATCH_SIZE=32 \
  -e NIM_MAX_INPUT_LENGTH=2048 \
  -e NIM_MAX_OUTPUT_LENGTH=512 \
  $NIM_IMAGE nim-run --model "hf://mistralai/Codestral-22B-v0.1"

## Using the Deployed Model

Once your model is deployed, you can interact with it using the REST API. Here's an example of how to make requests:

In [None]:
import requests

def generate_text(prompt, max_tokens=100):
    response = requests.post(
        "http://localhost:8000/v1/completions",
        json={
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": 0.7,
            "top_p": 0.9
        }
    )
    return response.json()

# Example usage
result = generate_text("Write a Python function to calculate fibonacci numbers:")
print(result['choices'][0]['text'])

## API Reference

For detailed API references, please refer to the following locations in the Blueprint repository:
- Summary & Conversation APIs:
`./docs/api_references/analytics_server.json`

- Generate API:
`./docs/api_references/agent_server.json`


## Conclusion

NIM significantly simplifies deploying a wide variety of LLMs by automating model analysis, backend selection, and optimization. It provides a consistent and efficient workflow for AI builders, enabling rapid experimentation and deployment.

## Further Reading

*   [NIM documentation](https://docs.nvidia.com/nim/)
*   [Supported model architectures](https://docs.nvidia.com/nim/supported-models)
*   [Backend selection details](https://docs.nvidia.com/nim/backends)
*   [NVIDIA AI forums](https://forums.developer.nvidia.com/c/ai-deep-learning/nemo-and-generative-ai/)