In [1]:
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_merlin_getting-started-movielens-01-download-convert/nvidia_logo.png" style="width: 90px; float: right;">

# Training and Serving Merlin on AWS SageMaker

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container.
Note that AWS libraries in this notebook require AWS credentials, and if you are running this notebook in a container, you might need to restart the container with the AWS credentials mounted, e.g., `-v $HOME/.aws:$HOME/.aws`.


With AWS Sagemaker, you can package your own models that can then be trained and deployed in the SageMaker environment. This notebook shows you how to use Merlin for training and inference in the SageMaker environment.

To run this notebook, you need to be able to run [AWS CLI](https://aws.amazon.com/cli/) and also have [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) installed.

In [2]:
! python3 -m pip -q install sagemaker

## Part 1: Preparing your Merlin model

## Testing your algorithm on your local machine

In this notebook, we use the synthetic train and test datasets generated by mimicking the real [Ali-CCP](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408#1): Alibaba Click and Conversion Prediction dataset to build our recommender system ranking models. The Ali-CCP is a dataset gathered from real-world traffic logs of the recommender system in Taobao, the largest online retail platform in the world.

If you would like to use real Ali-CCP dataset instead, you can download the training and test datasets on [tianchi.aliyun.com](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408#1). You can then use [get_aliccp()](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/datasets/ecommerce/aliccp/dataset.py#L43) function to curate the raw csv files and save them as parquet files.

In [3]:
import os

from merlin.datasets.synthetic import generate_data

DATA_FOLDER = os.environ.get("DATA_FOLDER", "/workspace/data/")
NUM_ROWS = os.environ.get("NUM_ROWS", 1000000)
SYNTHETIC_DATA = eval(os.environ.get("SYNTHETIC_DATA", "True"))
BATCH_SIZE = int(os.environ.get("BATCH_SIZE", 512))

if SYNTHETIC_DATA:
    train, valid = generate_data("aliccp-raw", int(NUM_ROWS), set_sizes=(0.7, 0.3))
    # save the datasets as parquet files
    train.to_ddf().to_parquet(os.path.join(DATA_FOLDER, "train"))
    valid.to_ddf().to_parquet(os.path.join(DATA_FOLDER, "valid"))



Before you run your algorithm on SageMaker, you probably want to test and train your training algorithm locally first to make sure that it's working correctly.
The training script [train.py](./train.py) in this example starts with the synthethic dataset we have created in the previous cell and produces a ranking model by performing the following tasks:
- Perform feature engineering and preprocessing with [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular). NVTabular implements common feature engineering and preprocessing operators in easy-to-use, high-level APIs.
- Use [Merlin Models](https://github.com/NVIDIA-Merlin/models/) to train [Facebook's DLRM model](https://arxiv.org/pdf/1906.00091.pdf) in Tensorflow.
- Prepares [ensemble models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) for serving on [Triton Inference Server](https://github.com/triton-inference-server/server).
The training script outputs the final ensemble models to `model_dir`. You want to make sure that your script generates any artifacts within `model_dir`, since SageMaker packages any files in this directory into a compressed tar archive and made available at the S3 location. Ensemble models that are uploaded to S3 will be used later to handle predictions in Triton inference server later in this notebook.

In [4]:
! python3 train.py \
    --train_dir={DATA_FOLDER}/train/ \
    --valid_dir={DATA_FOLDER}/valid/ \
    --model_dir=/tmp/ \
    --batch_size=512 \
    --epochs=1

2022-10-20 17:49:52.902606: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-20 17:49:53.887953: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-20 17:49:53.888286: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-20 17:49:53.888376: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-20 17:49:53.897977: I tensorflow/core/

### The `Dockerfile`

The `Dockerfile` describes the image that will be used on SageMaker for training and inference.
We start from the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) docker image and install the [sagemaker-training-toolkit](https://github.com/aws/sagemaker-training-toolkit) library, which makes the image compatible with Sagemaker for training models.

In [5]:
! cat container/Dockerfile

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.09

RUN pip3 install sagemaker-training


### Building and registering the container

The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. This code is available as the shell script `build_and_push_image.sh`. If you are running this notebook inside the [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) docker container, you probably want to execute the script outside the container.

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this is the region where the notebook instance was created). If the repository doesn't exist, the script will create it.

Note that running the following script requires permissions to create new repositories on Amazon ECR.

In [6]:
! cat ./build_and_push_image.sh

#!/bin/bash

set -euo pipefail

# The name of our algorithm
ALGORITHM_NAME=sagemaker-merlin-tensorflow
REGION=us-east-1

cd container

ACCOUNT=$(aws sts get-caller-identity --query Account --output text --region ${REGION})

# Get the region defined in the current configuration (default to us-west-2 if none defined)

REPOSITORY="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com"
IMAGE_URI="${REPOSITORY}/${ALGORITHM_NAME}:latest"

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${REPOSITORY}

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${ALGORITHM_NAME}" --region ${REGION} > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${ALGORITHM_NAME}" --region ${REGION} > /dev/null
fi

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${ALGORI

## Part 2: Training your Merlin model on Sagemaker

Once you have tested your script that creates a Merlin ensemble graph, you can use it to train it on Sagemaker.

Here, we create a Sagemaker session that we will use to perform our Sagemaker operations, specify the bucket to use, and the role for working with Sagemaker.

In [7]:
import sagemaker

sess = sagemaker.Session()

# S3 prefix
prefix = "DEMO-merlin-tensorflow-aliccp"

role = sagemaker.get_execution_role()

print(role)

NameError: name 'DATA_DIRECTORY' is not defined

We can use the Sagemaker Python SDK to upload the Ali-CCP synthetic data to our S3 bucket.

In [None]:
data_location = sess.upload_data(DATA_DIRECTORY, key_prefix=prefix)

print(data_location)

### Training on Sagemaker using the Python SDK

Sagemaker provides the Python SDK for training a model on Sagemaker.

Here, we start by using the ECR image URL of the image we pushed in the previous section.

In [8]:
import boto3

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]

my_session = boto3.session.Session()
region = my_session.region_name

algorithm_name = "sagemaker-merlin-tensorflow"

ecr_image = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(account, region, algorithm_name)

print(ecr_image)

843263297212.dkr.ecr.us-east-1.amazonaws.com/sagemaker-merlin-tensorflow:latest


We can call `Estimator.fit()` to start training on Sagemaker. Here, we use a `g4dn` GPU instance that are equipped with NVIDIA T4 GPUs.
Our training script `train.py` is passed to the Estimator through the `entry_point` parameter, and we can adjust our hyperparameters in the `hyperparameters`.
We have uploaded our training dataset to our S3 bucket in the previous code cell, and the S3 URLs to our training and validation sets are passed into the `fit()` method.

In [9]:
import os
from sagemaker.estimator import Estimator


training_instance_type = "ml.g4dn.xlarge"  # GPU instance, T4

estimator = Estimator(
    role=role,
    instance_count=1,
    instance_type=training_instance_type,
    image_uri=ecr_image,
    entry_point="train.py",
    hyperparameters={
        "batch_size": 1_024,
        "epoch": 10, 
    },
)

estimator.fit(
    {
        "train": f"{data_location}/train/",
        "valid": f"{data_location}/valid/",
    }
)

2022-10-18 23:05:16 Starting - Starting the training job...
2022-10-18 23:05:41 Starting - Preparing the instances for trainingProfilerReport-1666134315: InProgress
......
2022-10-18 23:06:52 Downloading - Downloading input data...
[34m== Triton Inference Server Base ==[0m
[34mNVIDIA Release 22.08 (build 42766143)[0m
[34mCopyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.[0m
[34mVarious files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.[0m
[34mThis container image and its contents are governed by the NVIDIA Deep Learning Container License.[0m
[34mBy pulling and using the container, you accept the terms and conditions of this license:[0m
[34mhttps://developer.nvidia.com/ngc/nvidia-deep-learning-container-license[0m
[34mNOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.7 driver version 515.65.01 with kernel driver version 510.47.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for deta

In [10]:
print(estimator.model_data)

s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-10-18-23-05-14-440/output/model.tar.gz


In [11]:
! aws s3 cp {estimator.model_data} /tmp/ensemble/

download: s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-10-18-23-05-14-440/output/model.tar.gz to ../../../../tmp/ensemble/model.tar.gz


In [12]:
! tar xvzf /tmp/ensemble/model.tar.gz

1_predicttensorflow/
1_predicttensorflow/1/
1_predicttensorflow/1/model.savedmodel/
1_predicttensorflow/1/model.savedmodel/saved_model.pb
1_predicttensorflow/1/model.savedmodel/variables/
1_predicttensorflow/1/model.savedmodel/variables/variables.index
1_predicttensorflow/1/model.savedmodel/variables/variables.data-00000-of-00001
1_predicttensorflow/1/model.savedmodel/keras_metadata.pb
1_predicttensorflow/1/model.savedmodel/assets/
1_predicttensorflow/config.pbtxt
ensemble_model/
ensemble_model/1/
ensemble_model/config.pbtxt
0_transformworkflow/
0_transformworkflow/1/
0_transformworkflow/1/model.py
0_transformworkflow/1/workflow/
0_transformworkflow/1/workflow/metadata.json
0_transformworkflow/1/workflow/workflow.pkl
0_transformworkflow/1/workflow/categories/
0_transformworkflow/1/workflow/categories/unique.user_consumption_2.parquet
0_transformworkflow/1/workflow/categories/unique.item_category.parquet
0_transformworkflow/1/workflow/categories/unique.user_profile.parquet
0_transformwo

## Part 3: Retrieving Recommendations from Triton Inference Server

Although we use the Sagemaker Python SDK to train our model, here we will use `boto3` to launch our inference endpoint as it offers more low-level control than the Python SDK.

The model artificat `model.tar.gz` uploaded to S3 from the Sagemaker training job contained three directories: `0_transformworkflow` for the NVTabular workflow, `1_predicttensorflow` for the Tensorflow model, and `ensemble_model` for the ensemble graph that we can use in Triton.

```shell
/tmp/ensemble/
├── 0_transformworkflow
│   ├── 1
│   │   ├── model.py
│   │   └── workflow
│   │       ├── categories
│   │       │   ├── unique.item_brand.parquet
│   │       │   ├── unique.item_category.parquet
│   │       │   ├── unique.item_id.parquet
│   │       │   ├── unique.item_shop.parquet
│   │       │   ├── unique.user_age.parquet
│   │       │   ├── unique.user_brands.parquet
│   │       │   ├── unique.user_categories.parquet
│   │       │   ├── unique.user_consumption_2.parquet
│   │       │   ├── unique.user_gender.parquet
│   │       │   ├── unique.user_geography.parquet
│   │       │   ├── unique.user_group.parquet
│   │       │   ├── unique.user_id.parquet
│   │       │   ├── unique.user_intentions.parquet
│   │       │   ├── unique.user_is_occupied.parquet
│   │       │   ├── unique.user_profile.parquet
│   │       │   └── unique.user_shops.parquet
│   │       ├── metadata.json
│   │       └── workflow.pkl
│   └── config.pbtxt
├── 1_predicttensorflow
│   ├── 1
│   │   └── model.savedmodel
│   │       ├── assets
│   │       ├── keras_metadata.pb
│   │       ├── saved_model.pb
│   │       └── variables
│   │           ├── variables.data-00000-of-00001
│   │           └── variables.index
│   └── config.pbtxt
├── ensemble_model
│   ├── 1
│   └── config.pbtxt
└── model.tar.gz
```

We specify that we only want to use `ensemble_model` in Triton by passing the environment variable `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME`.

In [13]:
import time

import boto3

sm_client = boto3.client(service_name="sagemaker")

container = {
    "Image": ecr_image,
    "ModelDataUrl": estimator.model_data,
    "Environment": {
        "SAGEMAKER_TRITON_TENSORFLOW_VERSION": "2",
        "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble_model",
    },
}

model_name = "model-triton-merlin-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_model_response = sm_client.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

model_arn = create_model_response["ModelArn"]

print(f"Model Arn: {model_arn}")

Model Arn: arn:aws:sagemaker:us-east-1:843263297212:model/model-triton-merlin-ensemble-2022-10-18-23-17-19


We again use the `g4dn` GPU instance that are equipped with NVIDIA T4 GPUs for launching the Triton inference server.

In [14]:
endpoint_instance_type = "ml.g4dn.xlarge"

endpoint_config_name = "endpoint-config-triton-merlin-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": endpoint_instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

endpoint_config_arn = create_endpoint_config_response["EndpointConfigArn"]

print(f"Endpoint Config Arn: {endpoint_config_arn}")

Endpoint Config Arn: arn:aws:sagemaker:us-east-1:843263297212:endpoint-config/endpoint-config-triton-merlin-ensemble-2022-10-18-23-17-20


In [15]:
endpoint_name = "endpoint-triton-merlin-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

endpoint_arn = create_endpoint_response["EndpointArn"]

print(f"Endpoint Arn: {endpoint_arn}")

Endpoint Arn: arn:aws:sagemaker:us-east-1:843263297212:endpoint/endpoint-triton-merlin-ensemble-2022-10-18-23-17-21


In [16]:
status = sm_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
print(f"Endpoint Creation Status: {status}")

while status == "Creating":
    time.sleep(60)
    rv = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = rv["EndpointStatus"]
    print(f"Endpoint Creation Status: {status}")

endpoint_arn = rv["EndpointArn"]

print(f"Endpoint Arn: {endpoint_arn}")
print(f"Endpoint Status: {status}")

Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: InService
Endpoint Arn: arn:aws:sagemaker:us-east-1:843263297212:endpoint/endpoint-triton-merlin-ensemble-2022-10-18-23-17-21
Endpoint Status: InService


### Send a Request to Triton Inference Server to Transform a Raw Dataset

Once we have an endpoint running, we can test it by sending requests.
Here, we use the raw validation set and transform it using the saved VTabular workflow we have downloaded from S3 in the previous section.

In [17]:
from merlin.schema.tags import Tags
from merlin.core.dispatch import get_lib
from nvtabular.workflow import Workflow

df_lib = get_lib()

original_data_path = DATA_DIRECTORY
workflow = Workflow.load("/tmp/ensemble/0_transformworkflow/1/workflow/")

label_columns = workflow.output_schema.select_by_tag(Tags.TARGET).column_names
workflow.remove_inputs(label_columns)

# read in data for request
batch = df_lib.read_parquet(
    os.path.join(original_data_path, "valid", "part.0.parquet"),
    columns=workflow.input_schema.column_names
)
print(batch)

RuntimeError: Failed to dlopen libcuda.so

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3553, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 424, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so


RuntimeError: Function "cuInit" not found

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found


RuntimeError: Function "cuInit" not found

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found


RuntimeError: Function "cuInit" not found

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found


RuntimeError: Function "cuInit" not found

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found


RuntimeError: Function "cuInit" not found

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found


RuntimeError: Function "cuInit" not found

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found


RuntimeError: Function "cuInit" not found

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found


RuntimeError: Function "cuInit" not found

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found


                     user_id  item_id  item_category  item_shop  item_brand  \
__null_dask_index__                                                           
700000                    23       23             66       4590        1581   
700001                    11       10             27       1878         647   
700002                    30       25             72       5007        1725   

                     user_shops  user_profile  user_group  user_gender  \
__null_dask_index__                                                      
700000                     1024             1           1            1   
700001                      466             1           1            1   
700002                     1349             2           1            1   

                     user_age  user_consumption_2  user_is_occupied  \
__null_dask_index__                                                   
700000                      1                   1                 1   
700001              

In the following code cell, we use a utility function provided in [Merlin Systems](https://github.com/NVIDIA-Merlin/systems) to convert our dataframe to the payload format that can be used as inference request format for Triton.

In [18]:
from merlin.systems.triton import convert_df_to_triton_input
import tritonclient.http as httpclient

inputs = convert_df_to_triton_input(workflow.input_schema, batch, httpclient.InferInput)

request_body, header_length = httpclient.InferenceServerClient.generate_request_body(inputs)

print(request_body)

b'{"inputs":[{"name":"user_id","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"item_id","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"item_category","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"item_shop","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"item_brand","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"user_shops","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"user_profile","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"user_group","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"user_gender","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"user_age","shape":[3,1],"datatype":"INT32","parameters":{"binary_data_size":12}},{"name":"user_consumption_2","shape":[3,1],"datatype":"INT32","paramet

Triton uses the [KServe community standard inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/protocol/README.md).
Here, we use the [binary+json format](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) for optimal performance in the inference request.

In order for Triton to correctly parse the binary payload, we have to specify the length of the request metadata in the header `json-header-size`.

In [19]:
runtime_sm_client = boto3.client("sagemaker-runtime")

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=f"application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}",
    Body=request_body,
)

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]

# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
    response["Body"].read(), header_length=int(header_length_str)
)
output_data = result.as_numpy("click/binary_classification_task")
print("predicted sigmoid result:\n", output_data)

[[0.5257045]
 [0.5127169]
 [0.4523193]]


## Terminate endpoint and clean up artifacts

Don't forget to clean up artifacts and terminate the endpoint, or the endpoint will continue to incur costs.

In [20]:
sm_client.delete_model(ModelName=model_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'a3731da7-55d9-49be-886d-9f77d550a312',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'a3731da7-55d9-49be-886d-9f77d550a312',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Tue, 18 Oct 2022 23:24:30 GMT'},
  'RetryAttempts': 0}}