### Permissions

Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because it creates new repositories on Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.

In [1]:
! cat container/Dockerfile

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.09

RUN pip3 install sagemaker-training

COPY train.py /opt/ml/code/train.py
COPY serve /opt/ml/code/serve

ENV SAGEMAKER_PROGRAM train.py

EXPOSE 8080


In [2]:
! cat build_and_push_image.sh

#!/bin/bash

set -euo pipefail

# The name of our algorithm
ALGORITHM_NAME=sagemaker-merlin-tensorflow
REGION=us-east-1

cd container

ACCOUNT=$(aws sts get-caller-identity --query Account --output text --region ${REGION})

# Get the region defined in the current configuration (default to us-west-2 if none defined)

REPOSITORY="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com"
IMAGE_URI="${REPOSITORY}/${ALGORITHM_NAME}:latest"

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${REPOSITORY}

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${ALGORITHM_NAME}" --region ${REGION} > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${ALGORITHM_NAME}" --region ${REGION} > /dev/null
fi

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${ALGORI

In [3]:
from sagemaker import get_execution_role

role = get_execution_role()

print(role)

Couldn't call 'get_role' to get Role ARN from role name AWSOS-AD-Engineer to get Role path.


arn:aws:iam::843263297212:role/AWSOS-AD-Engineer


We use the synthetic train and test datasets generated by mimicking the real Ali-CCP: Alibaba Click and Conversion Prediction dataset to build our recommender system ranking models.

If you would like to use real Ali-CCP dataset instead, you can download the training and test datasets on tianchi.aliyun.com. You can then use get_aliccp() function to curate the raw csv files and save them as parquet files.


```python
from merlin.datasets.synthetic import generate_data

DATA_FOLDER = os.environ.get("DATA_FOLDER", "/workspace/data/aliccp-raw-synthetic")
NUM_ROWS = os.environ.get("NUM_ROWS", 1000000)
SYNTHETIC_DATA = eval(os.environ.get("SYNTHETIC_DATA", "True"))
BATCH_SIZE = int(os.environ.get("BATCH_SIZE", 512))

if SYNTHETIC_DATA:
    train, valid = generate_data("aliccp-raw", int(NUM_ROWS), set_sizes=(0.7, 0.3))
    # save the datasets as parquet files
    train.to_ddf().to_parquet(os.path.join(DATA_FOLDER, "train"))
    valid.to_ddf().to_parquet(os.path.join(DATA_FOLDER, "valid"))
```

In [4]:
DATA_DIRECTORY = "/workspace/data/aliccp-raw-synthetic/"

In [5]:
! ls {DATA_DIRECTORY}

train  valid


In [6]:
# S3 prefix
prefix = "DEMO-merlin-tensorflow-aliccp"

In [7]:
import sagemaker as sage

sess = sage.Session()

In [8]:
data_location = sess.upload_data(DATA_DIRECTORY, key_prefix=prefix)

In [9]:
print(data_location)

s3://sagemaker-us-east-1-843263297212/DEMO-merlin-tensorflow-aliccp


In [10]:
! python3 container/train.py \
    --train_dir=/workspace/data/aliccp_raw_synthetic/train/ \
    --valid_dir=/workspace/data/aliccp_raw_synthetic/valid/ \
    --model_dir=/tmp \
    --batch_size=512 \
    --epochs=2

Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3553, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 424, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3553, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 424, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found
Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInit'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3556, in cuda._cuda.ccuda._cuInit
RuntimeErro

In [11]:
import boto3

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]

my_session = boto3.session.Session()
region = my_session.region_name

algorithm_name = "sagemaker-merlin-tensorflow"

ecr_image = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(account, region, algorithm_name)

print(ecr_image)

843263297212.dkr.ecr.us-east-1.amazonaws.com/sagemaker-merlin-tensorflow:latest


In [16]:
import os
from sagemaker.estimator import Estimator

instance_type = "ml.p3.2xlarge"  # GPU instance, V100
#instance_type = "ml.g4dn.xlarge"

estimator = Estimator(
    role=role,
    instance_count=1,
    instance_type=instance_type,
    image_uri=ecr_image,
    hyperparameters={
        "batch_size": 1_024,
        "epoch": 10, 
    },
)

estimator.fit(
    {
        "train": f"{data_location}/train/",
        "valid": f"{data_location}/valid/",
    }
)

2022-10-15 04:25:17 Starting - Starting the training job...
2022-10-15 04:25:40 Starting - Insufficient capacity error from EC2 while launching instances, retrying!ProfilerReport-1665807916: InProgress
......
2022-10-15 04:26:42 Starting - Preparing the instances for training.........
2022-10-15 04:28:26 Downloading - Downloading input data
2022-10-15 04:28:26 Training - Downloading the training image.......................................
[34m== Triton Inference Server Base ==[0m
[34mNVIDIA Release 22.08 (build 42766143)[0m
[34mCopyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.[0m
[34mVarious files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.[0m
[34mThis container image and its contents are governed by the NVIDIA Deep Learning Container License.[0m
[34mBy pulling and using the container, you accept the terms and conditions of this license:[0m
[34mhttps://developer.nvidia.com/ngc/nvidia-deep-learning-containe

In [17]:
print(estimator.model_data)

s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-10-15-04-25-15-157/output/model.tar.gz


In [18]:
! aws s3 cp {estimator.model_data} /tmp/.

download: s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-10-15-04-25-15-157/output/model.tar.gz to ../../../../tmp/model.tar.gz


In [19]:
! tar -tf /tmp/model.tar.gz

ensemble/
ensemble/0_transformworkflow/
ensemble/0_transformworkflow/1/
ensemble/0_transformworkflow/1/model.py
ensemble/0_transformworkflow/1/workflow/
ensemble/0_transformworkflow/1/workflow/metadata.json
ensemble/0_transformworkflow/1/workflow/workflow.pkl
ensemble/0_transformworkflow/1/workflow/categories/
ensemble/0_transformworkflow/1/workflow/categories/unique.user_brands.parquet
ensemble/0_transformworkflow/1/workflow/categories/unique.user_shops.parquet
ensemble/0_transformworkflow/1/workflow/categories/unique.user_group.parquet
ensemble/0_transformworkflow/1/workflow/categories/unique.user_intentions.parquet
ensemble/0_transformworkflow/1/workflow/categories/unique.user_profile.parquet
ensemble/0_transformworkflow/1/workflow/categories/unique.user_geography.parquet
ensemble/0_transformworkflow/1/workflow/categories/unique.item_shop.parquet
ensemble/0_transformworkflow/1/workflow/categories/unique.user_is_occupied.parquet
ensemble/0_transformworkflow/1/workflow/categories/uniq

In [20]:
import time

container = {
    "Image": ecr_image,
    "ModelDataUrl": estimator.model_data,
    #"Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble_dali_inception"},
}

sm_model_name = "triton-merlin-tensorflow-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

model_arn = create_model_response["ModelArn"]

print(f"Model Arn: {model_arn}")

NameError: name 'sm_client' is not defined

In [None]:
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

endpoint_config_arn = create_endpoint_config_response["EndpointConfigArn"]

print(f"Endpoint Config Arn: {endpoint_config_arn}")