# Deploy NVIDIA NIM on Amazon SageMaker from S3 Storage

NVIDIA NIM, a component of NVIDIA AI Enterprise, enhances your applications with the power of state-of-the-art large language models (LLMs), providing unmatched natural language processing and understanding capabilities. Whether you're developing chatbots, content analyzers, or any application that needs to understand and generate human language, NVIDIA NIM for LLMs has you covered.

To deploy a NVIDIA NIM, the NIM profiles are typically downlaoded from [NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/). The model profiles typically includes model weights and the optimizations based on the GPU hardware the NIM is deployed on. When the VPC configuration is private with no internet connectivity, the NIM assets can be stored in S3 and retrieved there during deployment using S3 VPC endpoints time instead of fetching them directly from NGC. This can also offer improved latency since traffic only traverses within the AWS network.

The steps here shows how to leverage NIM profile assets stored on Amazon S3 to deploy a NIM on Amazon SageMaker

Please check out the [NIM docs](https://docs.nvidia.com/nim/index.html) for more information.

## Setup

Installs the dependencies and setup roles required to package the model and create SageMaker endpoint. 

In [None]:
import boto3, json, sagemaker, time, os
from sagemaker import get_execution_role
from pathlib import Path

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
region = sess.region_name
sts_client = sess.client('sts')
account_id = sts_client.get_caller_identity()['Account']

### Define Arguments

Examples are provided below for NIMs to be tested on SageMaker. Remove the model you do **not** want to deploy.

In [55]:
public_nim_image = "public.ecr.aws/nvidia/nim:llama3.2-nv-embedqa-1b-v2-1.3.0"
nim_model = "llama-3.2-nv-embedqa-1b-v2"
sm_model_name = "llama-3-2-nv-embedqa-1b-v2"
instance_type = "ml.g5.12xlarge"

Since the NIM artifacts should have been deployed to s3 in the prerequisite, we will specify the S3 prefix where the model files were stored

In [56]:
s3_uri = 's3://<ENTER S3 BUCKET NAME>/llama3.2-nv-embedqa-1b-v2-1.3.0/'

### NIM Container

We first pull the NIM image from public ECR and then push it to private ECR repo within your account for deploying on SageMaker endpoint. Note:
  - NIM ECR image is currently available only in `us-east-1` region
  - You must have `ecr:CreateRepository` and appropriate push permissions associated with your execution role

In [None]:
import subprocess

# Get AWS account ID
result = subprocess.run(['aws', 'sts', 'get-caller-identity', '--query', 'Account', '--output', 'text'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    print(f"Error getting AWS account ID: {result.stderr}")
else:
    account = result.stdout.strip()
    print(f"AWS account ID: {account}")

bash_script = f"""
echo "Public NIM Image: {public_nim_image}"
docker pull {public_nim_image}


echo "Resolved account: {account}"
echo "Resolved region: {region}"

nim_image="{account}.dkr.ecr.{region}.amazonaws.com/{nim_model}"

# Ensure the repository name adheres to AWS constraints
repository_name=$(echo "{nim_model}" | tr '[:upper:]' '[:lower:]' | tr -cd '[:alnum:]._/-')

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "$repository_name" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "$repository_name" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin "{account}.dkr.ecr.{region}.amazonaws.com"

docker tag {public_nim_image} $nim_image
docker push $nim_image
echo -n $nim_image
"""
nim_image=f"{account}.dkr.ecr.{region}.amazonaws.com/{nim_model}"
# Run the bash script and capture real-time output
process = subprocess.Popen(bash_script, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

while True:
    output = process.stdout.readline()
    if output == b'' and process.poll() is not None:
        break
    if output:
        print(output.decode().strip())

stderr = process.stderr.read().decode()
if stderr:
    print("Errors:", stderr)


We print the private ECR NIM image in your account that we will be using for SageMaker deployment. 
- Should be similar to  `"<ACCOUNT ID>.dkr.ecr.<REGION>.amazonaws.com/<NIM_MODEL>:latest"`

In [None]:
print(nim_image)

### Create SageMaker Endpoint

We define sagemaker model from the NIM container. We are also configuring S3 as the model data source, this prompts SageMaker to download the NIM files from the provided S3 Prefix when setting up the environment to deploy the NIM. In addition we are configuring the NIM cache location as `/opt/ml/model/` because this is the directory SageMaker will store the NIM files it fetches from S3

In [58]:
sm_model_name = sm_model_name + "-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_model_response = sm.create_model(
    ModelName=sm_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": nim_image,
        "Environment": {"NIM_CACHE_PATH": "/opt/ml/model/"},
        "ModelDataSource": {
            "S3DataSource": {
                "S3Uri": s3_uri,
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
            },
        },
    }
)

In [None]:
print("Model Arn: " + create_model_response["ModelArn"])

Next we create endpoint configuration, here we are deploying the LLama3-3B Instruct model on the specified instance type.

In [None]:
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
            'ModelDataDownloadTimeoutInSeconds': 3600, # Specify the model download timeout in seconds.
            'ContainerStartupHealthCheckTimeoutInSeconds': 3600, # Specify the health checkup timeout in seconds
            "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-2"
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [None]:
endpoint_name = sm_model_name

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### Run Inference

Once we have the endpoint's status as `InService`, we can use a sample text to do an inference request. For inference request format, currently NIM on SageMaker supports the OpenAI API inference protocol. For explanation of supported parameters please see [this link](https://docs.api.nvidia.com/nim/reference/nvidia-llama-3_2-nv-embedqa-1b-v2-infer). 

<div class="alert alert-block alert-info">
<b>IMPORTANT:</b> Model name in inference request payload needs to be the name of NIM model. Please DON'T change it below. 
</div>

In [None]:
payload_model = "nvidia/llama-3.2-nv-embedqa-1b-v2"

messages = ["Hello world"]

payload = {
    "input": messages,
    "model": payload_model,
    "input_type": "query"
}

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(json.dumps(output, indent=2))

### Terminate endpoint and clean up artifacts

In [None]:
sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)