# Deploy NVIDIA NIM on Amazon SageMaker

NVIDIA NIM, a component of NVIDIA AI Enterprise, enhances your applications with the power of state-of-the-art large language models (LLMs), providing unmatched natural language processing and understanding capabilities. Whether you're developing chatbots, content analyzers, or any application that needs to understand and generate human language, NVIDIA NIM for LLMs has you covered.

In this example we show how to deploy `NVIDIA Nemotron Nano 9b v2` with NIM on Amazon SageMaker.

<div class="alert alert-block alert-info">
<b>IMPORTANT:</b> To run NIM on SageMaker you will need to have your NGC API KEY because it's required to access NGC resources. Check out <a href="https://build.nvidia.com/meta/llama3-70b?signin=true"> this LINK</a> to learn how to get NGC API KEY. 
</div>

Please check out the [NIM LLM docs](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) for more information.

**⚠️ Disclaimer**

Reasoning models often require longer inference times, which may exceed the default 60-second timeout limit for **AWS SageMaker's non-streaming endpoints**. This notebook shows examples for both the streaming and non-streaming endpoints 

To avoid inference failures due to timeout:
- It is **recommended** to use a **SageMaker streaming endpoint** for this model.
- If your use case **requires** using a **non-streaming endpoint**, you must first contact **AWS Support** to request an increased timeout limit for your **AWS Account and Region** to avoid unexpected errors.


## Setup

Installs the dependencies and setup roles required to package the model and create SageMaker endpoint. 

In [1]:
import boto3, json, sagemaker, time, os
from sagemaker import get_execution_role
from pathlib import Path

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
region = sess.region_name
sts_client = sess.client('sts')
account_id = sts_client.get_caller_identity()['Account']

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


### Define Arguments

In [2]:
public_nim_image = "public.ecr.aws/nvidia/nim:nvidia-nemotron-nano-9b-v2"
nim_model = "nvidia-nemotron-nano-9b-v2"
sm_model_name = "nvidia-nemotron-nano-9b-v2"
instance_type = "ml.p4d.24xlarge"
payload_model = "nvidia/nvidia-nemotron-nano-9b-v2"

### NIM Container

We first pull the NIM image from public ECR and then push it to private ECR repo within your account for deploying on SageMaker endpoint. Note:
  - NIM ECR image is currently available only in `us-east-1` region
  - You must have `ecr:CreateRepository` and appropriate push permissions associated with your execution role

In [3]:
import subprocess

# Get AWS account ID
result = subprocess.run(['aws', 'sts', 'get-caller-identity', '--query', 'Account', '--output', 'text'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    print(f"Error getting AWS account ID: {result.stderr}")
else:
    account = result.stdout.strip()
    print(f"AWS account ID: {account}")

bash_script = f"""
echo "Public NIM Image: {public_nim_image}"
docker pull {public_nim_image}


echo "Resolved account: {account}"
echo "Resolved region: {region}"

nim_image="{account}.dkr.ecr.{region}.amazonaws.com/{nim_model}"

# Ensure the repository name adheres to AWS constraints
repository_name=$(echo "{nim_model}" | tr '[:upper:]' '[:lower:]' | tr -cd '[:alnum:]._/-')

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "$repository_name" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "$repository_name" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin "{account}.dkr.ecr.{region}.amazonaws.com"

docker tag {public_nim_image} $nim_image
docker push $nim_image
echo -n $nim_image
"""
nim_image=f"{account}.dkr.ecr.{region}.amazonaws.com/{nim_model}"
# Run the bash script and capture real-time output
process = subprocess.Popen(bash_script, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

while True:
    output = process.stdout.readline()
    if output == b'' and process.poll() is not None:
        break
    if output:
        print(output.decode().strip())

stderr = process.stderr.read().decode()
if stderr:
    print("Errors:", stderr)


AWS account ID: 492681118881
Public NIM Image: public.ecr.aws/nvidia/nim:nvidia-nemotron-nano-9b-v2
nvidia-nemotron-nano-9b-v2: Pulling from nvidia/nim
65d6848aa6be: Pulling fs layer
ddc9da18b513: Pulling fs layer
4a39b63a208f: Pulling fs layer
8378c496babf: Pulling fs layer
ed0e2082d1bb: Pulling fs layer
b61659d9f609: Pulling fs layer
efaeba21701f: Pulling fs layer
d0ef6a820a7a: Pulling fs layer
b53078d42f1b: Pulling fs layer
9188cf7c8d41: Pulling fs layer
0154c8c7b419: Pulling fs layer
09693755eb54: Pulling fs layer
53afbb9356e9: Pulling fs layer
adedb551814d: Pulling fs layer
d502bdcaf3c6: Pulling fs layer
a0bda0fbe791: Pulling fs layer
36958f672d5a: Pulling fs layer
d5f8005d7dbc: Pulling fs layer
d95e964e9e83: Pulling fs layer
9671628a37fb: Pulling fs layer
a54a593b2866: Pulling fs layer
312f91995407: Pulling fs layer
b400957fb4c3: Pulling fs layer
161b59c42a08: Pulling fs layer
26738a387089: Pulling fs layer
ed0e2082d1bb: Waiting
b6504d77d244: Pulling fs layer
b61659d9f609: Waitin

We print the private ECR NIM image in your account that we will be using for SageMaker deployment. 
- Should be similar to  `"<ACCOUNT ID>.dkr.ecr.<REGION>.amazonaws.com/<NIM_MODEL>:latest"`

In [4]:
print(nim_image)

492681118881.dkr.ecr.us-east-1.amazonaws.com/nvidia-nemotron-nano-9b-v2


### Create SageMaker Endpoint

**Before proceeding further, please set your NGC API Key.**

In [5]:
# SET ME
NGC_API_KEY = None

In [6]:
assert NGC_API_KEY is not None, "NGC API KEY is not set. Please set the NGC_API_KEY variable. It's required for running NIM."

We define sagemaker model from the NIM container making sure to pass in **NGC_API_KEY**

In [7]:
container = {
    "Image": nim_image,
    "Environment": {"NGC_API_KEY": NGC_API_KEY}
}
create_model_response = sm.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Model Arn: arn:aws:sagemaker:us-east-1:492681118881:model/nvidia-nemotron-nano-9b-v2


Next we create endpoint configuration, here we are deploying the LLama3-70B model on the specified instance type.

In [8]:
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
            "ContainerStartupHealthCheckTimeoutInSeconds": 1800,
            "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-2"
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint Config Arn: arn:aws:sagemaker:us-east-1:492681118881:endpoint-config/nvidia-nemotron-nano-9b-v2


Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [9]:
endpoint_name = sm_model_name

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-east-1:492681118881:endpoint/nvidia-nemotron-nano-9b-v2


In [10]:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:492681118881:endpoint/nvidia-nemotron-nano-9b-v2
Status: InService


## Run Inference

Once we have the endpoint's status as `InService` we can use a sample text to do a chat completion inference request using json as the payload format. For inference request format, currently NIM on SageMaker supports the OpenAI API chat completions inference protocol. For explanation of supported parameters please see [this link](https://platform.openai.com/docs/api-reference/chat). 

<div class="alert alert-block alert-info">
<b>IMPORTANT:</b> Model name in inference request payload needs to be the name of NIM model. Please DON'T change it below. 
</div>

### Non Reasoning Mode

In [11]:
payload = {
  "model": payload_model,
  "messages": [
    {   
      "role": "system",
      "content": "detailed thinking off"
    },
    {
      "role":"user",
      "content":"Explain how a transformer neural network works."
    }
  ],
  "max_tokens": 3000

}


response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(json.dumps(output, indent=2))

{
  "id": "chatcmpl-bb91a457-1780-461e-9785-cf9a18646bc1",
  "object": "chat.completion",
  "created": 1758063086,
  "model": "nvidia/nvidia-nemotron-nano-9b-v2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Okay, I need to explain how a transformer neural network works. Let me start by recalling what I know about transformers. They're a type of neural network architecture introduced in the paper \"Attention Is All You Need\" by Vaswani et al. in 2017. Unlike previous models like RNNs or CNNs, transformers use attention mechanisms instead of recurrence or convolutions to process input data.\n\nFirst, I should explain the overall structure. Transformers consist of an encoder and a decoder. The encoder processes the input sequence, and the decoder generates the output sequence. Both parts have multiple layers, each containing self-attention and feed-forward neural networks.\n\nNow, the key component here is the attention mecha

### Reasoning Mode

In [12]:
payload = {
  "model": payload_model,
  "messages": [
    {   
      "role": "system",
      "content": "detailed thinking on"
    },
    {
      "role":"user",
      "content": "Explain how a transformer neural network works"
    }
  ],
  "max_tokens": 3000
}


response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(json.dumps(output, indent=2))

{
  "id": "chatcmpl-6d13a11a-ac24-493a-be6f-fa551a251d9f",
  "object": "chat.completion",
  "created": 1758063108,
  "model": "nvidia/nvidia-nemotron-nano-9b-v2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Okay, so I need to explain how a transformer neural network works. Hmm, I remember that transformers are a type of neural network architecture, right? They're really important in natural language processing now, like in models such as BERT or GPT. But how exactly do they work? Let me think step by step.\n\nFirst, I know that traditional NLP models like RNNs and LSTMs process data sequentially, which can be slow and have issues with long-term dependencies. Transformers were introduced to handle this better. The key idea must be something about self-attention. I think that's the core component of a transformer. Self-attention allows the model to weigh the importance of different words in a sentence relative to each other. 

## Streaming inference

NIM on SageMaker also supports streaming inference and you can enable that by setting **`"stream"` as `True`** in the payload and by using [`invoke_endpoint_with_response_stream`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint_with_response_stream.html) method.

### Non Reasoning Mode

In [13]:
payload = {
  "model": payload_model,
  "messages": [
    {   
      "role": "system",
      "content": "detailed thinking off"
    },
    {
      "role":"user",
      "content":"Explain how a transformer neural network works."
    }
  ],
  "max_tokens": 3000,
  "stream": True

}

response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We have some postprocessing code for the streaming output.

In [14]:
event_stream = response['Body']
accumulated_data = ""
start_marker = 'data:'
end_marker = '"finish_reason":null}]}'

for event in event_stream:
    try:
        payload = event.get('PayloadPart', {}).get('Bytes', b'')
        if payload:
            data_str = payload.decode('utf-8')

            accumulated_data += data_str

            # Process accumulated data when a complete response is detected
            while start_marker in accumulated_data and end_marker in accumulated_data:
                start_idx = accumulated_data.find(start_marker)
                end_idx = accumulated_data.find(end_marker) + len(end_marker)
                full_response = accumulated_data[start_idx + len(start_marker):end_idx]
                accumulated_data = accumulated_data[end_idx:]

                try:
                    data = json.loads(full_response)
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', "")
                    if content:
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    continue
    except Exception as e:
        print(f"\nError processing event: {e}", flush=True)
        continue

Okay, so I need to explain how a transformer neural network works. Hmm, I remember that transformers are a type of neural network architecture, right? They were introduced in a paper called "Attention Is All You Need" in 2017. That's what made me think of them when I was trying to remember. But how exactly do they work compared to other models like RNNs?

Alright, let me start from the basics. Traditional models like RNNs process sequences step by step, maintaining a hidden state that gets updated with each input. But transformers don't use recurrence or convolutions; they rely entirely on attention mechanisms. That must be the key point. So the main idea is using attention to weigh the importance of different words in a sentence when predicting the next one.

Wait, but how does attention work exactly? Attention, in this context, probably refers to the self-attention mechanism. Self-attention allows the model to relate different positions of the input sequence. For example, in the sent

### Reasoning Mode

In [15]:
payload = {
  "model": payload_model,
  "messages": [
    {   
      "role": "system",
      "content": "detailed thinking on"
    },
    {
      "role":"user",
      "content":"Explain how a transformer neural network works."
    }
  ],
  "max_tokens": 3000,
  "stream": True

}

response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We have some postprocessing code for the streaming output.

In [16]:
event_stream = response['Body']
accumulated_data = ""
start_marker = 'data:'
end_marker = '"finish_reason":null}]}'

for event in event_stream:
    try:
        payload = event.get('PayloadPart', {}).get('Bytes', b'')
        if payload:
            data_str = payload.decode('utf-8')

            accumulated_data += data_str

            # Process accumulated data when a complete response is detected
            while start_marker in accumulated_data and end_marker in accumulated_data:
                start_idx = accumulated_data.find(start_marker)
                end_idx = accumulated_data.find(end_marker) + len(end_marker)
                full_response = accumulated_data[start_idx + len(start_marker):end_idx]
                accumulated_data = accumulated_data[end_idx:]

                try:
                    data = json.loads(full_response)
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', "")
                    if content:
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    continue
    except Exception as e:
        print(f"\nError processing event: {e}", flush=True)
        continue

Okay, so I need to explain how a transformer neural network works. Hmm, I remember that transformers are a type of neural network architecture used in NLP, right? They were introduced in that paper "Attention Is All You Need" from 2017. But I'm a bit fuzzy on the exact details. Let me start by recalling what I know.

First, transformers don't use recurrent layers like LSTMs or GRUs. Instead, they rely entirely on attention mechanisms to process input data. That's a key point. The transformer architecture processes all tokens in a sequence simultaneously rather than sequentially, which makes training faster because it can leverage parallel processing. But wait, how exactly does the attention mechanism work?

Attention allows the model to weigh different parts of the input differently when making predictions. For example, in a sentence, certain words might be more relevant than others when predicting the next word. The self-attention mechanism computes relationships between all words in 

## Agent implementation with Tool Calling

### Introduction

Based on the user input, the agent invokes **a tool from the pool of available tools**. The agent will decide and invoke the required **tool(s)** to get the response back to the user.

In [17]:
class ToolError(Exception):
    pass

### Define tools for the agents

In [18]:
def get_current_weather(input) -> int:
    """Get the current temperature from a city, in Fahrenheit"""
    
    city = input["city"].lower()
    country = input["country"].lower()
    
    # Hardcoded temperature data
    weather = {
        "us": {
            "new york": 82,
            "los angeles": 76,
            "chicago": 70
        },
        "gb": {
            "london": 68
        },
        "ca": {
            "toronto": 65,
            "vancouver": 60
        }
    }

    # Look up the temperature
    try:
        return weather[country][city]
    except KeyError:
        raise ValueError(f"No temperature found for {city.title()}, {country.upper()}")

def get_difference(input) -> int:
    """Get the difference between two numbers"""
    
    minuend = input["minuend"]
    subtrahend = input["subtrahend"]
    return minuend - subtrahend


tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current temperature from a city, in Fahrenheit",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City"
                    },
                    "country": {
                        "type": "string",
                        "description": "Country Code (e.g. US, GB, CA)"
                    }
                },
                "required": ["city", "country"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_difference",
            "description": "Get the difference between two numbers",
            "parameters": {
                "type": "object",
                "properties": {
                    "minuend": {
                        "type": "integer",
                        "description": "The number from which another number is to be subtracted"
                    },
                    "subtrahend": {
                        "type": "integer",
                        "description": "The number to be subtracted"
                    }
                },
                "required": ["minuend", "subtrahend"]
            }
        }
    }
]

### Define helper function to call SageMaker

In [19]:
def call_sagemaker(messages, tools):
    """Call SageMaker endpoint with OpenAI API format"""
    
    payload = {
        "model": payload_model,
        "messages": messages,
        "max_tokens": 2000,
        "temperature": 0,
        "tool_choice": "auto",
        "tools": tools
    }
    
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload)
    )
    
    # Parse response
    response_body = json.loads(response['Body'].read().decode('utf-8'))
    return response_body

### Define helper function to invoke a given tool

In [20]:
def get_tool_result(tool_call):
    """Execute a tool call and return the result"""
    tool_name = tool_call['function']['name']
    tool_args = json.loads(tool_call['function']['arguments'])
    
    print(f"Using tool `{tool_name}` with args `{tool_args}`")
    func = globals()[tool_name]
    try:
        return func(tool_args)  # Pass the full args dict, not just 'query'
    except Exception as e:
        raise ToolError(f"Something went wrong: {e}")

### Define function to handle the raw responses from SageMaker

In [21]:
def handle_model_response(response):
    """Handle tool calls in the model response"""
    
    message = response['choices'][0]['message']
    
    # Check if there are tool calls
    if not message.get('tool_calls'):
        return None, message
    
    tool_messages = []
    
    for tool_call in message['tool_calls']:
        try:
            tool_result = get_tool_result(tool_call)
            
            tool_message = {
                "role": "tool",
                "tool_call_id": tool_call['id'],
                "content": json.dumps(tool_result)
            }
            tool_messages.append(tool_message)
            
        except ToolError as e:
            tool_message = {
                "role": "tool",
                "tool_call_id": tool_call['id'],
                "content": f"Error: {str(e)}"
            }
            tool_messages.append(tool_message)
    
    return tool_messages, message

### Define function that implements the Agent Loop till final response is received

In [22]:
def run_agent(messages, tools):
    """Run the agent loop until completion"""
    MAX_LOOPS = 10
    loop_count = 0

    while loop_count < MAX_LOOPS:
        loop_count += 1
        
        # Call the model
        response = call_sagemaker(messages, tools)
        
        # Handle the response
        tool_messages, assistant_message = handle_model_response(response)
        
        # Add assistant message to conversation
        messages.append(assistant_message)
        
        # If no tool calls, we're done
        if tool_messages is None:
            final_output = assistant_message.get('content', '')
            break
        
        # Add tool results to conversation
        messages.extend(tool_messages)
    
    else:
        final_output = "Maximum loops reached"
    
    return messages, final_output

### Define agent executor

In [23]:
def weather_agent_executor(input_prompt):
    
    system_prompt = system_prompt = """You are a helpful weather assistant. You have access to tools that return:
    - get_current_weather: Returns temperature in Fahrenheit for a given city
    - get_difference: Returns the numerical difference between two numbers
    
    When you make tool calls and receive results, use those results directly in your answer. 
    The tool results correspond exactly to the tool calls you made in the same order.

    Make the answer brief, do not mention tool calls in your answer
    """
    detailed_thinking = "off"
    messages = [
        {
            "role": "system",
            "content": f"detailed thinking {detailed_thinking}"
        },
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": input_prompt
        }
    ]

    workflow, output = run_agent(messages, tools)
    return workflow, output

### Test agent

In [24]:
workflow, output = weather_agent_executor("Where is it warmest: New york, london or Toronto? And by how much is it warmer than the other cities?")
print("\n========Output=============\n")
print(output)
print("\n========Tool calling details=============\n")
print(workflow)

Using tool `get_current_weather` with args `{'city': 'New york', 'country': 'US'}`
Using tool `get_current_weather` with args `{'country': 'US', 'city': 'New York'}`
Using tool `get_current_weather` with args `{'country': 'GB', 'city': 'London'}`


Okay, let me try to figure out how to answer the user's question. They want to know which city is the warmest among New York, London, and Toronto, and by how much it's warmer than the others.

First, I need to get the current temperatures for all three cities. The user already provided some tool responses. Let me check the history. 

In the first tool call, the assistant asked for New York, US, and the response was 82°F. Then, the next tool call was for London, GB, which returned 68°F. Now, the user's last message shows a tool response of 68, which I assume is for Toronto, CA. Wait, but the user hasn't explicitly mentioned Toronto yet. Let me make sure. 

Wait, looking at the conversation, the user's initial question includes Toronto. The as

## Terminate endpoint and clean up artifacts

In [25]:
sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '2eed26ab-078d-433c-b9b4-8106f9fbd280',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '2eed26ab-078d-433c-b9b4-8106f9fbd280',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 16 Sep 2025 22:53:21 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}