# Deploy NVIDIA NIM from AWS Marketplace

NVIDIA NIM, a component of NVIDIA AI Enterprise, enhances your applications with the power of state-of-the-art large language models (LLMs), providing unmatched natural language processing and understanding capabilities. Whether you're developing chatbots, content analyzers, or any application that needs to understand and generate human language, NVIDIA NIM for LLMs has you covered.

In this example we show how to deploy Nemotron-15B with NIM on Amazon SageMaker. NVIDIA Nemotron-4 15B NIM is a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens that demonstrates strong performance when assessed on English, multilingual, and coding tasks. It outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks. The NIM is built on robust foundations including inference engines like Triton Inference Server, TensorRT, TensorRT-LLM, and PyTorch. NIM provides features like low latency, high throughput, function calling, metrics export, standard API, optimized profiles & enterprise support. 

Please check out the [NIM LLM docs](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) for more information.

## Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to one of the models listed above.


## Subscribe to the model package
To subscribe to the model package:
1. Open the model package listing page, for example [Nemotron-15B NIM](https://aws.amazon.com/marketplace/pp/prodview-cjge44tau4g36)
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [1]:
import boto3, json, sagemaker, time, os
from sagemaker import get_execution_role, ModelPackage

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
region = sess.region_name

# replace the arn below with the model package arn you want to deploy
nim_package = "nvidia-nemotron-4-15b-nim-65097a83ca3c3246be10f8f04fb75749"

# Mapping for Model Packages
model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{nim_package}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{nim_package}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{nim_package}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{nim_package}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{nim_package}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{nim_package}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{nim_package}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{nim_package}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{nim_package}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{nim_package}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{nim_package}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{nim_package}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{nim_package}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{nim_package}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{nim_package}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{nim_package}",
}

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]
model_package_arn

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


'arn:aws:sagemaker:us-east-1:865070037744:model-package/nvidia-nemotron-4-15b-nim-65097a83ca3c3246be10f8f04fb75749'

## Create the SageMaker Model
Use the ModelPackage class to create a model using the specified ModelPackageArn.

In [3]:
model = ModelPackage(
    role=role,
    model_package_arn=model_package_arn,
    sagemaker_session=sagemaker_session
)

### Configure the Endpoint

Define an endpoint configuration. Specify the instance type for deployment (for example, ml.g5.12xlarge).

In [5]:
endpoint_name = "nim-nemotron-15b"  # Choose a unique name for your endpoint
instance_type = "ml.g5.12xlarge"  # or ml.g5.24xlarge

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name
)

---------------!

### Run Inference

Once we have the model deployed we can use a sample text to do a chat completion inference request using json as the payload format. For inference request format, currently NIM on SageMaker supports the OpenAI API chat completions inference protocol. For explanation of supported parameters please see [this link](https://platform.openai.com/docs/api-reference/chat). 

In [10]:
payload_model = "nvidia/nemotron-4-15b-instruct-128k"
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Write a short limerick about the wonders of GPU Computing."}
]
payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 100
}

response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(json.dumps(output, indent=2))

{
  "id": "chat-5abee8504e384d44942cab2a2db8a345",
  "object": "chat.completion",
  "created": 1731038760,
  "model": "nvidia/nemotron-4-15b-instruct-128k",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "There once was a GPU, so grand,\n Speeding computation across the land.\n With floating point charm, and parallel might,\n It works through the night, guiding us through the light.\n\nComplex equations danced, as if by magic,\nOn this marvel of silicon, started by NVIDIA.\nWith GDDR for memory, and cores galore,\n GPU computing, we're forever in store.\n\nNo need for fear, of long, long delays,\nOn GPUs,"
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 51,
    "total_tokens": 151,
    "completion_tokens": 100
  }
}


### Try streaming inference

NIM on SageMaker also supports streaming inference and you can enable that by setting **`"stream"` as `True`** in the payload and by using [`invoke_endpoint_with_response_stream`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint_with_response_stream.html) method.

In [11]:
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Write a short limerick about the wonders of GPU Computing."}
]
payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 100,
  "stream": True
}


response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We have some postprocessing code for the streaming output.

In [12]:
event_stream = response['Body']
accumulated_data = ""
start_marker = 'data:'
end_marker = '"finish_reason":null}]}'

for event in event_stream:
    try:
        payload = event.get('PayloadPart', {}).get('Bytes', b'')
        if payload:
            data_str = payload.decode('utf-8')

            accumulated_data += data_str

            # Process accumulated data when a complete response is detected
            while start_marker in accumulated_data and end_marker in accumulated_data:
                start_idx = accumulated_data.find(start_marker)
                end_idx = accumulated_data.find(end_marker) + len(end_marker)
                full_response = accumulated_data[start_idx + len(start_marker):end_idx]
                accumulated_data = accumulated_data[end_idx:]

                try:
                    data = json.loads(full_response)
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', "")
                    if content:
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    continue
    except Exception as e:
        print(f"\nError processing event: {e}", flush=True)
        continue

There once was a GPU, so grand,
 Speeding computation across the land.
 With floating point charm, and parallel might,
 It works through the night, guiding us through the light.

Complex equations danced, as if by magic,
On this marvel of silicon, started by NVIDIA.
With GDDR for memory, and cores galore,
 GPU computing, we're forever in store.

No need for fear, of long, long delays,
On GPUs

### Terminate endpoint and clean up artifacts

In [13]:
sm.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '49897845-0773-40f8-bd1b-6dc694a56a13',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '49897845-0773-40f8-bd1b-6dc694a56a13',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 06 Nov 2024 02:43:43 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}