In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# **NVIDIA NIM on Google Cloud Vertex AI**

<table align="left">
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fgenerative_ai%2Fnvidia_nim_vertexai.ipynb">
      <img width="32px" src="https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/nvidia_nim_vertexai.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>  
</table>

[Vertex AI](https://cloud.google.com/vertex-ai/docs/start/introduction-unified-platform) is Google Cloud's unified machine learning platform. It streamlines the process of building, training, and deploying AI models, making it easier to bring your AI projects to life.

[NVIDIA Inference Microservices (NIM)](https://www.nvidia.com/en-us/ai/) are pre-trained and optimized AI models packaged as microservices. They're designed to simplify the deployment of high-performance, production-ready AI into applications.

This Colab notebook provides a demonstration of deploying a meta/llama-3.1-8b-instruct NIM on Vertex AI, leveraging NVIDIA GPUs. We will illustrate how to perform inference tasks using both batch and streaming modes. To execute this, you can utilize Colab Enterprise within Vertex AI. The NVIDIA NIM is available as container images, which you'll need to pull into your Google Cloud environment and subsequently deploy to a Vertex AI endpoint. These endpoints, accessible via REST, can then be integrated into your applications for various use cases.


## Prerequisites

### Hardware
<a name="hardware"></a>
To run the meta/llama-3.1-8b-instruct NIM, you will need a Google Cloud G2 VM family with 2 `g2-standard-24` VMs, which provides access to the required [NVIDIA L4 GPU](https://cloud.google.com/compute/docs/gpus#l4-gpus) accelerator.

### Software
<a name="software"></a>
1. [Google Cloud Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project) with a billing ID
2. [Vertex AI](https://cloud.google.com/vertex-ai/docs/start/introduction-unified-platform)
 - [Vertex AI Model Resource](https://cloud.google.com/vertex-ai/docs/model-registry/introduction)
 - [Vertex AI Endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment)
3. [Colab Enterprise](https://cloud.google.com/colab/docs/create-console-quickstart) using a default Runtime
4. [Artifact Registry](https://cloud.google.com/artifact-registry)
5. NVIDIA NGC API Key

  **Note:** Please sign up for the [NVIDIA Developer Program](https://developer.nvidia.com/developer-program) which provides developers with tools and resources to build more efficiently and quickly using NVIDIA technology. You will get a NGC API Key that is required to access NIM.

### Security roles and permissions
<a name="security"></a>

To successfully run this colab, your user account needs specific permissions. Request the following roles from your administrator:
 - Colab Enterprise Admin (*roles/aiplatform.colabEnterpriseAdmin*)
 - Vertex AI Platform User (*roles/aiplatform.user*)

Additionally, the Vertex AI Workbench instance operates under the default Compute Engine service account [<PROJECT_NUMBER>-compute@developer.gserviceaccount.com]. To ensure proper functionality, ask your administrator to assign the following role(s) to this service account:
 - Artifact Registry Writer (*roles/artifactregistry.writer*)
 - Compute Network Admin (*roles/compute.networkAdmin*) [Optional, if there is no default network]


## Outline
<a name="outline"></a>

1. [Getting started](#step1): To deploy the NIM container image to Vertex AI Workbench, first download the image to the Artifact Registry. Vertex AI Workbench instances come with Docker pre-installed, simplifying the process of pulling, tagging, and pushing images to repositories like Artifact Registry.

 *If you already have the NIM image in Google's Artifact Registry, you can move to [Step 3](#step3).*

2. [Prerequisites](#step2): Enable API's for Google Cloud Products and authenticate for subsequent steps.

4. [Configure](#step3): Parameters such as GPU Accelerator types, machine types are configured to host the NIM in Vertex AI

5. [Deploy](#step4): The model needs to be uploaded to Vertex AI Model and deployed to a Vertex AI endpoint.

6. [Test inference](#step7): Use sample prompts to test the model inferencing in [batch](#step7a) and [streaming](#step7b) mode.

6. [Teardown](#step8): Teardown all the resources such as Vertex AI Endpoint, Model, Artifact registry repo and Workbench instance.  

## Step 1: Getting Started
<a name="step1"></a>

### a. Enable APIs
Enable the APIs listed below from the Google Cloud Console.

  - [Artifact Registry](https://console.cloud.google.com/flows/enableapi?apiid=artifactregistry.googleapis.com&redirect=https://console.cloud.google.com&_ga=2.153348347.214183506.1726972544-2083916923.1726802364)
  - [Compute Engine](https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com&redirect=https://console.cloud.google.com&_ga=2.153677051.214183506.1726972544-2083916923.1726802364)
  - [Dataform](https://console.cloud.google.com/flows/enableapi?apiid=dataform.googleapis.com&redirect=https://console.cloud.google.com&_ga=2.203088210.214183506.1726972544-2083916923.1726802364)
  - [Notebook](https://console.cloud.google.com/flows/enableapi?apiid=notebooks.googleapis.com&redirect=https://console.cloud.google.com&_ga=2.153677051.214183506.1726972544-2083916923.1726802364)
  - [Vertex AI](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com&redirect=https://console.cloud.google.com&_ga=2.153348347.214183506.1726972544-2083916923.1726802364)

### b. Colab Notebook

Follow [this link](https://cloud.google.com/colab/docs/create-runtime#create) to create a runtime in Colab Enterprise. Make sure to [connect to the runtime](https://cloud.google.com/colab/docs/connect-to-runtime#existing).

### c. Setup environment
Set up your environment by installing the required Python packages as detailed below.

<sub><p align="right">[go to top](#outline)</p></sub>


In [None]:
! pip install --upgrade --user --quiet \
    google-cloud-aiplatform \
    google-cloud-artifact-registry

In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
ZONE = "us-central1-a"  # @param ["us-central1-a"] {"allow-input":true}
LOCATION = "-".join(ZONE.split("-")[:-1])

REPOSITORY_NAME = "nim"

WORKBENCH_NAME = "wb-nim"  # @param {"type":"string"}
MACHINE_TYPE = "e2-standard-4"  # @param {"type":"string"}

In [None]:
! gcloud artifacts repositories create {REPOSITORY_NAME} \
  --repository-format=docker \
  --location={LOCATION} \
  --project={PROJECT_ID}

In [None]:
import google.cloud.aiplatform as aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION)

## Step 2: Create Vertex AI Workbench instance
<a name="step2"></a>

A Vertex AI Workbench instance is temporarily created to retrieve the Docker image from the NVIDIA GPU Catalog (NGC) using the NGC API Key. You have the option to create and specify a default network if one is not already set up.

**Note:**
Click on the link below that opens up the new Vertex AI Workbench instance.
<sub><p align="right">[go to top](#outline)</p></sub>

In [None]:
# Optional to create network and specify if there is no default network created
NETWORK_NAME = f"vpc-{WORKBENCH_NAME}"

! gcloud compute networks create {NETWORK_NAME} \
  --project={PROJECT_ID} \
  --subnet-mode=auto

In [None]:
! gcloud workbench instances create {WORKBENCH_NAME} \
  --project={PROJECT_ID} \
  --location={ZONE} \
  --machine-type={MACHINE_TYPE}
  #--network={NETWORK_NAME}

from IPython.display import HTML

url = !gcloud workbench instances describe {WORKBENCH_NAME} \
  --project={PROJECT_ID} \
  --location={ZONE} \
  --format="value(gceSetup.metadata.proxy-url)"

url = url[0]
display(HTML(f'<a href="https://{url}" target="_blank">{url}</a>'))

## Step 3: Push image to Artifact registry
<a name="step3"></a>

This code comprises two sections:
 - [Configure NIM models](#step3a): Execute this in both **this Colab notebook** and **the Vertex AI workbench**.
 - [Pull and push Docker image](#step3b): Execute this section exclusively within the **Vertex AI workbench only**. You will need your NGC API Key. This section will fail if executed in Colab. To execute within the Vertex AI Workbench instance:
   - Navigate to Notebooks -> Python3 (ipykernel)
   - Paste the code from both cells
   - If not already defined, provide values for `PROJECT_ID` and `LOCATION`
   - Execute the code

**Note:** The image size will determine how long this step takes, which may range from 20 to 30 minutes.

<sub><p align="right">[go to top](#outline)</p></sub>

### a. Configure NIM model
<a name="step3a"></a>

Note:: Execute both in Colab and the Vertex AI Workbench notebook.

In [None]:
LOCATION = "[GCP region from Step 1c]"
PROJECT_ID = "[GCP project_id from Step 1c]"

NVCR_REGISTRY = "nvcr.io"
REPOSITORY_NAME = "nim"
NIM_MODEL = "meta/llama-3.1-8b-instruct:1.2.2"  # @param ["meta/llama-3.1-8b-instruct:1.2.2","meta/llama-3.1-70b-instruct:1.2.1","meta/llama-3.1-405b-instruct:1.2.0","meta/llama3-8b-instruct:1.0.3","meta/llama3-70b-instruct:1.0.3"] {"allow-input":true}
NGC_API_KEY = "[nvidia-ngc-api-key]"  # @param {"type":"string"}

NIM_IMAGE = NVCR_REGISTRY + "/" + REPOSITORY_NAME + "/" + NIM_MODEL
NIM_IMAGE_GAR = (
    LOCATION + "-docker.pkg.dev/" + PROJECT_ID + "/" + REPOSITORY_NAME + "/" + NIM_MODEL
)

### b. Pull and push Docker image
<a name="step3b"></a>

**Note:**: Execute only in the Vertex AI Workbench notebook.

In [None]:
! gcloud auth configure-docker {LOCATION}-docker.pkg.dev --quiet

! docker login -u '$oauthtoken' --password-stdin nvcr.io <<< "$NGC_API_KEY"
! docker pull $NIM_IMAGE
! docker tag $NIM_IMAGE $NIM_IMAGE_GAR
! docker push $NIM_IMAGE_GAR

## Step 4. Upload NIM to Vertex AI
<a name="step4"></a>
Uploading a NIM to Vertex AI provides a streamlined and efficient way to deploy and manage your generative AI models.

**Note:** The image size will determine how long this step takes, which may range from 20 to 30 minutes.

<sub><p align="right">[go to top](#outline)</p></sub>

In [None]:
MACHINE_TYPE = "g2-standard-24"  # @param {type:"string"}
GPU_ACCELERATOR_TYPE = "NVIDIA_L4"  # @param {type:"string"}
GPU_ACCELERATOR_COUNT = 2  # @param {"type":"number"}

SELECTED_PROFILE = "vllm-fp16-tp2"  # @param {type:"string"}
API_ENDPOINT = "{}-aiplatform.googleapis.com".format(LOCATION)

endpoint_name = NIM_MODEL.replace(":", "_")
model_wo_tag = NIM_MODEL.split(":")[0]

In [None]:
from google.api_core.future.polling import DEFAULT_POLLING
from google.cloud.aiplatform import Endpoint, Model

DEFAULT_POLLING._timeout = 360000
model = None
models = Model.list(filter=f'displayName="{NIM_MODEL}"')

if models:
    model = models[0]
else:
    try:
        model = aiplatform.Model.upload(
            display_name=f"{NIM_MODEL}",
            serving_container_image_uri=f"{NIM_IMAGE_GAR}",
            serving_container_predict_route="/v1/chat/completions",
            serving_container_health_route="/v1/health/ready",
            serving_container_environment_variables={
                "NGC_API_KEY": f"{NGC_API_KEY}",
                "PORT": "8000",
                "shm-size": "16GB",
            },
            serving_container_shared_memory_size_mb=16000,
            serving_container_ports=[8000],
            sync=True,
        )
        model.wait()

    except Exception as e:
        print(f"An error occurred: {str(e)}")

if model:
    print("Model:")
    print(f"\tDisplay name: {model.display_name}")
    print(f"\tResource name: {model.resource_name}")
    MODEL_ID = model.resource_name

## Step 5. Create Vertex Endpoint
<a name="step5"></a>
The Vertex AI Endpoint components expose the functionalities of the Vertex AI Model through an Endpoint resource.
<sub><p align="right">[go to top](#outline)</p></sub>

In [None]:
endpoints = Endpoint.list(filter=f'displayName="{endpoint_name}"')
print(endpoints)
if endpoints:
    endpoint = endpoints[0]
else:
    print(f"Endpoint {endpoint_name} doesn't exist, creating...")
    endpoint = aiplatform.Endpoint.create(display_name=endpoint_name)

if endpoint:
    print("Endpoint:")
    print(f"\tDisplay name: {endpoint.display_name}")
    print(f"\tResource name: {endpoint.resource_name}")

    ENDPOINT_ID = endpoint.resource_name

## Step 6. Deploy NIM
<a name="step6"></a>

To use models for online predictions, they need to be deployed to an endpoint.

**Note:** This step can take 20-30 minutes.
<sub><p align="right">[go to top](#outline)</p></sub>


In [None]:
try:
    model.deploy(
        endpoint=endpoint,
        deployed_model_display_name=f"{NIM_MODEL}",
        traffic_percentage=100,
        machine_type=f"{MACHINE_TYPE}",
        min_replica_count=1,
        max_replica_count=1,
        accelerator_type=f"{GPU_ACCELERATOR_TYPE}",
        accelerator_count=GPU_ACCELERATOR_COUNT,
        enable_access_logging=True,
        sync=True,
    )
    print(f"Model {model.display_name} deployed at endpoint {endpoint.display_name}.")
except Exception as e:
    print(f"An error occurred: {str(e)}")

## Step 7. Run Inference
<a name="step7"></a>
To execute a sample inference, structure your input instance in JSON format.

<sub><p align="right">[go to top](#outline)</p></sub>

### a. Create Payload

In [None]:
import json

messages = [
    {
        "content": "You are a polite and respectful chatbot helping people plan a vacation.",
        "role": "system",
    },
    {"content": "What should I do for a 4 day vacation in Spain?", "role": "user"},
]

payload = {"model": model_wo_tag, "messages": messages, "max_tokens": 4096, "top_p": 1}

with open("request.json", "w") as outfile:
    json.dump(payload, outfile)

# Streaming
payload_s = {
    "model": model_wo_tag,
    "messages": messages,
    "max_tokens": 4096,
    "top_p": 1,
    "stream": True,
}

with open("request_stream.json", "w") as outfile:
    json.dump(payload_s, outfile)

### b. Test Inference
<a name="step7a"></a>

In [None]:
import json
from pprint import pprint

from google.api import httpbody_pb2
from google.cloud import aiplatform_v1

client_options = {"api_endpoint": API_ENDPOINT}

http_body = httpbody_pb2.HttpBody(
    data=json.dumps(payload).encode("utf-8"),
    content_type="application/json",
)

try:
    req = aiplatform_v1.RawPredictRequest(
        http_body=http_body, endpoint=endpoint.resource_name
    )

    print("Request:")
    pprint(json.loads(req.http_body.data))

    pred_client = aiplatform.gapic.PredictionServiceClient(
        client_options=client_options
    )

    response = pred_client.raw_predict(req)

    print(
        "--------------------------------------------------------------------------------------"
    )
    print("Response:")
    pprint(json.loads(response.data))
except Exception as e:
    print(f"An error occurred: {str(e)}")

### c. Test Inferencing (Streaming)
<a name="step7b"></a>

In [None]:
# Streaming

import json
from pprint import pprint

from google.api import httpbody_pb2
from google.cloud import aiplatform_v1

client_options = {"api_endpoint": API_ENDPOINT}

http_body = httpbody_pb2.HttpBody(
    data=json.dumps(payload_s).encode("utf-8"),
    content_type="application/json",
)

try:
    req = aiplatform_v1.RawPredictRequest(
        http_body=http_body, endpoint=endpoint.resource_name
    )

    print("Request:")
    pprint(json.loads(req.http_body.data))

    pred_client = aiplatform.gapic.PredictionServiceClient(
        client_options=client_options
    )

    response = pred_client.raw_predict(req)
    print(
        "--------------------------------------------------------------------------------------"
    )
    print("Response:")
    print(response.data.decode("utf-8"))
except Exception as e:
    print(f"An error occurred: {str(e)}")

## Step 8. Teardown (optional)
<a name="step8"></a>

The infrastructure provisioned in the previous steps can be deleted in this optional step.
<sub><p align="right">[go to top](#outline)</p></sub>


In [None]:
if endpoint:
  print(f"Deleting endpoint {endpoint.display_name}")
  endpoint.undeploy_all()
  endpoint.delete()

if model:
  print(f"Deleting model {model.display_name}")
  model.delete()

! gcloud artifacts docker images delete \
  --delete-tags {NIM_IMAGE_GAR} \
  --quiet

! gcloud artifacts repositories delete {REPOSITORY_NAME} \
  --location={LOCATION} \
  --quiet

! gcloud workbench instances delete {WORKBENCH_NAME} \
  --project={PROJECT_ID} \
  --location={ZONE}

! gcloud compute networks delete {NETWORK_NAME} \
  --project={PROJECT_ID} \
  --quiet