## Deploy NVIDIA NIM to GCP Vertex AI

### Objective

NVIDIA NIM is a set of easy-to-use microservices designed to accelerate the deployment of generative AI models across the cloud, data center, and workstations. NIMs are categorized by model family and a per model basis. For example, NVIDIA NIM for large language models (LLMs) brings the power of state-of-the-art LLMs to enterprise applications, providing unmatched natural language processing and understanding capabilities.

In this notebook, you learn to how to run NVIDIA NIM container on Google Cloud Vertex AI, make inference to get customized responses, and deploy model to Vertex AI endpoint.

This tutorial uses the following NVIDIA NIM and Vertex AI services:

- NVIDIA NIM Container
- Vertex AI Model resource
- Vertex AI Model Registry
- Vertex AI Endpoint resource
- Vertex AI Prediction
- Vertex AI Artifact Registry
- Vertex AI Cloud Storage

The steps performed include:

- Pull NVIDIA NIM container from NGC.
- Push NVIDIA NIM container to Artifact Registry.
- Run NIM container to make inference within interface.
- Upload NIM container as a Vertex AI Model resource.
- Create a Vertex AI Endpoint resource.
- Deploy the Model resource to the Endpoint resource.
- Generate prediction responses from Endpoint resource.


### Install and Import packages

In [None]:
# ! pip3 install --upgrade --user google-cloud-aiplatform

In [None]:
# ! pip3 install -r requirements.txt

In [None]:
# Restart kernel after installs so that the environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

In [None]:
import google.cloud.aiplatform as aiplatform
import google.cloud.aiplatform_v1beta1 as aip_beta

from google.cloud.aiplatform import Endpoint, Model
from google.api_core.exceptions import InvalidArgument

### Authenticate to Google Cloud
Depending on Jupyter environment, please follow instructions below to authenticate to Google Cloud.

* Vertex AI Workbench

In [None]:
! gcloud auth login
! gcloud auth application-default login

* Colab

In [None]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

### Set Up

The example provided is `llama3-8b-instruct` NIM, on Vertex AI Workbench Notebook `g2-standard-24` instance with NVIDIA L4 GPU.

IAM role requirements:
* Artifact Registry Repository Administrator `(roles/artifactregistry.repoAdmin)` 
* Storage Admin `(roles/storage.admin)`

In [None]:
# Get account name
import requests
gcloud_token = !gcloud auth print-access-token
gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
account_email = gcloud_tokeninfo['email']
account_name = gcloud_tokeninfo['email'].split('@')[0]
print(account_email)
print(account_name)

In [None]:
# NIM: llama3-8b-instruct
region = "us-central1" # please set here
project_id = None # please set here
public_repository = None # please set here
private_repository = account_name
bucket_url = f"gs://{account_name}"

nim_model = "nim:llama3-8b-instruct-1.0.0"
# NIM in NGC
ngc_nim_image = "nvcr.io/nim/meta/llama3-8b-instruct:1.0.0"
# NIM in Artifact Registry
public_nim_image = f"{region}-docker.pkg.dev/{project_id}/{public_repository}/{nim_model}"
private_nim_image = f"{region}-docker.pkg.dev/{project_id}/{private_repository}/{nim_model}"

va_model_name = "nim-llama3-8b-instruct"

selected_profile = "vllm-fp16-tp2"
machine_type = "g2-standard-24"
accelerator_type = "NVIDIA_L4"
accelerator_count =2

endpoint_name = va_model_name+"_endpoint"
payload_model = "meta/llama3-8b-instruct"

If Cloud Storage Bucket or Artifact Registry repository doesn't already exist: Run the following cell to create your bucket or repository.

In [None]:
! gsutil mb -l {region} -p  {project_id} {bucket_url}
! gcloud artifacts repositories create {public_repository} --repository-format=docker --location={region}
! gcloud artifacts repositories add-iam-policy-binding {public_repository} --location={region} --member=allUsers --role=roles/artifactregistry.repoAdmin
! gcloud artifacts repositories create {private_repository} --repository-format=docker --location={region}

Initialize Vertex AI SDK for Python

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=project_id, location=region, staging_bucket=bucket_url)

GCP Configuration

In [None]:
def run_bash_cmd(cmd):
    import subprocess

    if isinstance(cmd, str):
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True, text=True)
    elif isinstance(cmd, list):
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=False, text=True)
        
    output, error = process.communicate()
    if error:
        raise Exception(error)
    else:
        print(output)

In [None]:
bash_cmd = f"""
    export region={region}
    gcloud config set ai_platform/region {region}
    gcloud config set project {project_id}
    gcloud auth configure-docker {region}-docker.pkg.dev
    """
run_bash_cmd(bash_cmd)

### NIM Container

* **NGC_API_KEY**

To access NIM container from NGC catalog, `NGC_API_KEY` is required.

The credentail will be used in Vertex AI as an environment variable during model uploading, and will show on Model Registry Version Details UI. **Attention: the credential will be visible for all Vertex AI users in the same project.**

Please upload a json file to Cloud Storage Bucket to use `read_key()` function below, format  `"{NGC_API_KEY": Your Key}"`.

Reference: [NGC User Guide](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html)

* **Artifact Registry**

We will first pull NIM image from NGC, then push to a public AR repository. This allows all accounts in the project able to access NIM.

Then we pull NIM image from the public AR and push to a private AR repository. This allows modification of NIM image without affecting the origin. (Optional)


In [None]:
from google.cloud import storage
import json

def read_key(bucket_name, blob_name, key_name):
    """Write and read a blob from GCS using file-like IO"""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"

    # The ID of your new GCS object
    # blob_name = "storage-object-name"
    
    # The ID of your NGC key
    # key_name = "NGC_API_KEY"

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    with blob.open("r") as f:
        data = json.loads(f.read())

    return data[key_name]

In [None]:
# Set NGC API KEY
import json
json_file = None # please set here
NGC_API_KEY = read_key(account_name, json_file, "NGC_API_KEY")

assert NGC_API_KEY is not None, "NGC API KEY is not set. Please set the NGC_API_KEY variable. It's required for running NIM."

Pull NIM from NGC and Push to GCP AR

In [None]:
# Login to NGC
from pathlib import Path
container_name="llama3-8B-Instruct"
local_nim_cache=str(Path(".cache/nim").absolute())

bash_cmd = f"""
    sudo apt-get install -y nvidia-docker2
    export NGC_API_KEY={NGC_API_KEY}
    echo "export NGC_API_KEY={NGC_API_KEY}" >> ~/.bashrc
    echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

    export LOCAL_NIM_CACHE={local_nim_cache}
    mkdir -p "$LOCAL_NIM_CACHE"
    echo "Local NIM cache created"
    """

run_bash_cmd(bash_cmd)

# Pull NIM image from NGC and run container
docker_cmd = [
    "docker", "run", "-d", "--rm",
    f"--name={container_name}",
    "--gpus", "all",
    "-e", f"{NGC_API_KEY}",
    "-v", f"{local_nim_cache}:/opt/nim/.cache",
    "-p", "8000:8000",
    ngc_nim_image
]

print(f"NIM image {ngc_nim_image} pulled from NGC successfully, running container is")
run_bash_cmd(docker_cmd)

# Push NIM image to public AR repository
bash_cmd = f"""
    docker tag {ngc_nim_image} {public_nim_image}

    docker push {public_nim_image}
    """

run_bash_cmd(bash_cmd)
print(f"NIM image {ngc_nim_image} pushed to Artifact Registry {public_nim_image} successfully")

# Optional
# Push NIM image to private AR repository
bash_cmd = f"""
    docker tag {public_nim_image} {private_nim_image}

    docker push {private_nim_image}
    """

run_bash_cmd(bash_cmd)
print(f"NIM image {public_nim_image} pushed to Artifact Registry {private_nim_image} successfully")

### Run NIM Container Within Interface

Run NIM container in **Terminal** or another notebook, keep the container active, then inference with Python OpenAI API or CLI command to get model responses in the Notebook interface.

In [None]:
# Run NIM container
! docker run -it --rm --name={container_name} \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY={NGC_API_KEY} \
  -e NIM_MODEL_PROFILE={selected_profile} \
  -v {local_nim_cache}":/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  {private_nim_image}

In [None]:
! docker images

In [None]:
! docker ps 
! echo ""
CONTAINER_ID = !docker ps | awk 'NR>1 {print $1}'
CONTAINER_ID = CONTAINER_ID[0]
! echo 'Running Container is' $CONTAINER_ID
! echo 'IP Address'
# ! docker inspect $CONTAINER_ID
IPAddress= !docker exec $CONTAINER_ID sh -c "hostname --ip-address" 
IPAddress=IPAddress[0]
! echo $IPAddress
! echo ""
! echo "NIM Model and Profile"
! docker inspect $CONTAINER_ID |grep -i model

#### Make Inference within Interface
After running NIM container and keeping it active, we could make inference to model and get response. NIM on Vertex AI Workbench supports both OpenAI Python API and CLI.

With the `completions` endpoint, `prompt` could be set as input strings to give instructions to the model, it could also be in the form of `messages` with roles and contents for multi-turn conversation. Other model parameters could adjust output length, temperature, etc. 

*Note: May need to change IP address of URL when make request (e.g. http://172.18.0.2:8000/v1/completions)*

In [None]:
! curl -X 'POST' \
        'http://0.0.0.0:8000/v1/completions' \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        -d '{ "model": "meta/llama3-8b-instruct", \
              "prompt": "Once upon a time","max_tokens": 100}'

In [None]:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
prompt = "Once upon a time"
response = client.completions.create(
    model=payload_model,
    prompt=prompt,
    max_tokens=100,
    stream=False
)
completion = response.choices[0].text
print(completion)

In [None]:
! curl -X 'POST' \
    'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{"model": "meta/llama3-8b-instruct", \
        "messages": [ \
            {"role":"user", \
            "content":"Hello! How are you?"}, \
            {"role":"assistant", \
            "content":"Hi! I am quite well, how can I help you today?"}, \
            {"role":"user", \
            "content":"Write a short limerick about the wonders of GPU computing."} \
            ], \
        "max_tokens": 512 \
        }'

In [None]:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Write a short limerick about the wonders of GPU computing."}
]
chat_response = client.chat.completions.create(
    model=payload_model,
    messages=messages,
    max_tokens=512,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Stop NIM container

In [None]:
! docker stop $CONTAINER_ID

### Endpoint Deployment

Then we could proceed to endpoint deloyment, this will allow the model endpoint available on Vertex AI Online Prediction.

Steps are as follows:

* Upload NIM container as a Vertex AI Model resource.
* Create a Vertex AI Endpoint resource.
* Deploy the Model resource to the Endpoint resource.
* Generate raw prediction requests and get responses.

#### Upload NIM as a Vertex AI Model resource

First, we upload the NIM image as a Vertex AI model resource using the `upload()` method, with the following parameters:

*  `display_name`: The human readable name for the Model resource.
*  `artifact_uri`: The Cloud Storage location of the model artifacts. If the container image includes the model artifacts that you need to serve predictions, there is no need to load files from Cloud Storage.

*  `serving_container_image`: The serving container image to use when the model is deployed to a Vertex AI

*  `serving_container_command`: The serving binary (HTTP Server) to start up.

*  `serving_container_shared_memory_size_mb`: The shared memory is an Inter-process communication (IPC) mechanism that allows multiple processes to access and manipulate a common block of memory. The default shared memory size is 64MB. Model servers such as vLLM or Nvidia Triton, use shared memory to cache internal data during model inferences. Also, because shared memory can be used for cross GPU communication, using more shared memory can improve performance for accelerators without NVLink capabilities (for example, L4), if the model container requires communication across GPUs. NIM generally requires a larger shared memory size than default. 

*  `serving_container_environment_variables`: The environment variables specify container required settings such as authentication key. 

*  `serving_container_args`: The arguments to pass to the serving binary. For example:

      -- `model_name`: The human readable name to assign to the model.

      -- `model_base_name`: Where to store the model artifacts in the container. The Vertex service sets the variable `AIP_STORAGE_URI` to where the service installed the model artifacts in the container.

      -- `rest_api_port`: The port to which to send REST based prediction requests. NIM uses `8000`.

      -- `port`: The port to which to send gRPC based prediction requests. NIM uses `8000`.

*  `serving_container_health_route`: The URL for the service to periodically ping for a response to verify that the serving binary is running. For NIM, this will be `/v1/health/ready`.

*  `serving_container_predict_route`: The URL for the service to route REST-based prediction requests to. For NIM, this will be `/v1/chat/completions` or `/v1/completions`.

*  `serving_container_ports`: A list of ports for the HTTP server to listen for requests. 

*  `sync`: Whether to wait for the process to complete, or return immediately (async).

Uploading a model into a Vertex Model resource may take a few moments. After completion, model will show up in Vertex AI Model Registry.

Reference: [NIM API](https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html) 

In [None]:
from google.api_core.future.polling import DEFAULT_POLLING
from google.cloud.aiplatform import Endpoint, Model
DEFAULT_POLLING._timeout = 360000

models = Model.list(filter=f'displayName="{va_model_name}"')

if models:
    model = models[0]
else:
    model = aiplatform.Model.upload(
                display_name=va_model_name,
                serving_container_image_uri=private_nim_image,
                serving_container_predict_route="/v1/chat/completions",
                serving_container_health_route="/v1/health/ready",
                serving_container_environment_variables={"NGC_API_KEY": NGC_API_KEY, "PORT": "8000", "shm-size":"16GB"},
                serving_container_shared_memory_size_mb=16000,
                serving_container_ports=[8000],
                sync=True,
            )
model.wait()

print("Model:")
print(f"\tDisplay name: {model.display_name}")
print(f"\tResource name: {model.resource_name}")

In [None]:
! gcloud ai models list --region=$region --filter="DISPLAY_NAME ~ .*nim.*"

In [None]:
MODEL_ID = !gcloud ai models list --region=$region --filter="DISPLAY_NAME ~ .*nim.*" | awk 'NR>1 {print $1}'
MODEL_ID = MODEL_ID[1]
MODEL_ID

#### Create a Vertex AI Endpoint resource

In [None]:
endpoints = Endpoint.list(filter=f'displayName="{endpoint_name}"')
if endpoints:
    endpoint = endpoints[0]
else:
    print(f"Endpoint {endpoint_name} doesn't exist, creating...")
    endpoint = aiplatform.Endpoint.create(display_name=endpoint_name)
print("Endpoint:")
print(f"\tDisplay name: {endpoint.display_name}")
print(f"\tResource name: {endpoint.resource_name}")

In [None]:
! gcloud ai endpoints list --region=$region --filter="DISPLAY_NAME ~ .*nim.*"

In [None]:
ENDPOINT_ID = !gcloud ai endpoints list --region=$region --filter="DISPLAY_NAME ~ .*nim.*" | awk 'NR>1 {print $1}'
ENDPOINT_ID = ENDPOINT_ID[1]
ENDPOINT_ID

#### Deploy the Vertex AI model resource to a Vertex AI endpoint resource

Next, deploy the Vertex AI model resource to the endpoint resource with the following parameters:

* `deploy_model_display`: The human reable name for the deployed model.

* `traffic_split`: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs.
    * If only one model, then specify `{ "0": 100 }`, where "0" refers to this model being uploaded and 100 means 100% of the traffic.
    * If there are existing models on the endpoint, for which the traffic is split, then use model_id to specify `{ "0": percent, model_id: percent, ... }`, where model_id is the ID of an existing deployed model on the endpoint. The percentages must add up to 100.

* `machine_type`: The machine type for each VM node instance.

* `min_replica_count`: The minimum number of nodes to provision for auto-scaling.

* `max_replica_count`: The maximum number of nodes to provision for auto-scaling.

* `accelerator_type`: The type, if any, of GPU accelators per provisioned node.

* `accelrator_count`: The number, if any, of GPU accelators per provisioned node.

After successful deployment, the endpoint and associated deloyed model will be available on Vertex AI Online Prediction.

In [None]:
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=va_model_name,
    traffic_percentage=100,
    machine_type=machine_type,
    min_replica_count=1,
    max_replica_count=1,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    enable_access_logging=True,
    sync=True,
)
print(f"Model {model.display_name} deployed at endpoint {endpoint.display_name}.")

In [None]:
print(endpoint.gca_resource)
endpoint_name = endpoint.resource_name
print(endpoint_name)
print(endpoint.list_models())

#### Endpoint Inference

Use the Endpoint object's `rawPredict` function to get responses from the deployed model, which takes the following parameters:

* `instances`: A list of messages or prompts instances. Each instance should be an array of strings. 
* `parameters`: A list of LLM model parameteres, e.g. temperature, max_tokens, top_p, stream.

NIM on Vertex AI Workbench supports both OpenAI Python API and CLI. Streaming the response on/off option is supported.

In [None]:
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Write a short limerick about the wonders of GPU Computing."}
]

payload = {
  "model": payload_model,
  "messages": messages,
  "temperature": 0.2,  # Temperature controls the degree of randomness in token selection.
  "max_tokens": 512,  # Token limit determines the maximum amount of text output.
  "top_p": 0.8,  # Tokens are selected from most probable to least until the sum of their probabilities equals the top_p value.
}

with open("request.json", "w") as outfile: 
    json.dump(payload, outfile)

# Streaming
payload_s = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 512,
  "stream": True
}

with open("request_stream.json", "w") as outfile: 
    json.dump(payload_s, outfile)


Python SDK

In [None]:
import json
from pprint import pprint
from google.api import httpbody_pb2
from google.cloud import aiplatform_v1

http_body = httpbody_pb2.HttpBody(
    data=json.dumps(payload).encode("utf-8"),
    content_type="application/json",
)

req = aiplatform_v1.RawPredictRequest(
    http_body=http_body, endpoint=endpoint.resource_name
)

print('Request')
print(req)
pprint(json.loads(req.http_body.data))
print()

API_ENDPOINT = "{}-aiplatform.googleapis.com".format(region)
client_options = {"api_endpoint": API_ENDPOINT}

pred_client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)

response = pred_client.raw_predict(req)
print("--------------------------------------------------------------------------------------")
print('Response')
pprint(json.loads(response.data))

In [None]:
# Streaming

import json
from pprint import pprint
from google.api import httpbody_pb2
from google.cloud import aiplatform_v1

http_body = httpbody_pb2.HttpBody(
    data=json.dumps(payload_s).encode("utf-8"),
    content_type="application/json",
)

req = aiplatform_v1.RawPredictRequest(
    http_body=http_body, endpoint=endpoint.resource_name
)

print('Request')
print(req)
pprint(json.loads(req.http_body.data))
print()

API_ENDPOINT = "{}-aiplatform.googleapis.com".format(region)
client_options = {"api_endpoint": API_ENDPOINT}

pred_client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)

response = pred_client.raw_predict(req)
print("--------------------------------------------------------------------------------------")
print('Response')
print(response.data.decode('utf-8'))

CLI

In [None]:
! curl \
    --request POST \
    --header "Authorization: Bearer $(gcloud auth print-access-token)" \
    --header "Content-Type: application/json" \
    https://us-central1-prediction-aiplatform.googleapis.com/v1/projects/$project_id/locations/$region/endpoints/$ENDPOINT_ID:rawPredict \
    --data "@request.json"

In [None]:
# Streaming
! curl \
    --request POST \
    --header "Authorization: Bearer $(gcloud auth print-access-token)" \
    --header "Content-Type: application/json" \
    https://us-central1-prediction-aiplatform.googleapis.com/v1/projects/$project_id/locations/$region/endpoints/$ENDPOINT_ID:rawPredict \
    --data "@request_stream.json"

### Clean Up

In [None]:
delete_endpoint = True
delete_model = True
delete_image = True
delete_art_repo = False
delete_bucket = False

# Undeploy model and delete endpoint
try:
    if delete_endpoint:
        endpoint.undeploy_all(sync=True)
        endpoint.delete()
        print(f"Deleted endpoint {endpoint.display_name}")
except Exception as e:
    print(e)

# Delete the model resource
try:
    if delete_model:
        model.delete()
        print(f"Deleted model {model.display_name}")
except Exception as e:
    print(e)

# Delete the container image from Artifact Registry
if delete_image:
    !gcloud artifacts docker images delete --quiet --delete-tags {private_nim_image}

# Delete the Artifact Repository
if delete_art_repo:
    ! gcloud artifacts repositories delete {private_repository} --location={region} -q

# Delete the Cloud Storage bucket
if delete_bucket:
    ! gsutil rm -rf {bucket_url}