# Bring Your Own Container (BYOC) SageMaker Inference
In this example we take a look at how to implement a BYOC approach. This approach can be used if you want to bring your own serving stack and don't want to use one of the available managed SageMaker/AWS images. Here we implement a Flask/Guincorn based serving stack, but you can bring your own serving engine and enable it to listen on the right port (8080) and accept requests on the paths that are defined to be compatible with SageMaker AI.

In this example we will train a sample SKLearn model locally and then adopt it using the BYOC approach, you can substitute this with your own model or whatever frameworks/packages you are using.

## Environment
Can run on a ml.t3.medium SM Classic Notebook Instance, if elsewhere ensure to have Docker installed and running.

## Additional Resources
Official Docs: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html

## Local Model Training
Let's first train a sample SKLearn model and then setup our Dockerfile and serving stack to serve this model.

In [13]:
!pip install sagemaker boto3 numpy scikit-learn==1.7.1 --upgrade --quiet

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Generate dummy data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 * X + 1 + 0.1 * np.random.randn(100, 1)  

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Save the trained model to a file
import joblib
model_filename = "model.joblib"
joblib.dump(model, model_filename)

In [15]:
serialized_model = joblib.load(model_filename)

In [None]:
# sample inference
payload = [[0.5]]
res = serialized_model.predict(payload).tolist()[0]
res

## Constructs for BYOC
We need a 
- <b>Dockerfile</b> to install the dependecies we need and point towards our webserver
- <b>predictor.py</b> implements our Flask Web Server, your model's inference is here, adjust depending on your model and serving logic
- <b>serve</b> and <b>nginx</b> you can keep as is, adjust this if you want to change your serving logic


We will create these files then build our image.

In [None]:
%%writefile Dockerfile
# Build an image that can do training and inference in SageMaker
# This is a Python 3 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FROM python:3.10-slim

MAINTAINER Amazon AI <sage-learner@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python3 \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Here we get all python packages.
# There's substantial overlap between scipy and numpy that we eliminate by
# linking them together. Likewise, pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN pip --no-cache-dir install numpy pandas flask gunicorn boto3 joblib scikit-learn==1.7.1

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

# Set up the program in the image
COPY regressor /opt/program
WORKDIR /opt/program

In [None]:
%%writefile nginx.conf
worker_processes 1;
daemon off; # Prevent forking


pid /tmp/nginx.pid;
error_log /var/log/nginx/error.log;

events {
  # defaults
}

http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /var/log/nginx/access.log combined;
  
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 5m;

    keepalive_timeout 5;
    proxy_read_timeout 1200s;

    location ~ ^/(ping|invocations) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_pass http://gunicorn;
    }

    location / {
      return 404 "{}";
    }
  }
}

In [None]:
%%writefile predictor.py
from flask import Flask
import flask
import os
import json
import logging
import joblib

model = joblib.load(os.path.join("model.joblib"))

# The flask app for serving predictions
app = Flask(__name__)
@app.route('/ping', methods=['GET'])
def ping():
    # Check if the classifier was loaded correctly
    health = model is not None
    status = 200 if health else 404
    return flask.Response(response= '\n', status=status, mimetype='application/json')


@app.route('/invocations', methods=['POST'])
def transformation():
    
    #Process input
    input_json = flask.request.get_json()
    if "input" not in input_json:
        return flask.Response(json.dumps({"error":"missing 'input'"}), status=400, mimetype="application/json")
    resp = input_json['input']
    
    #model inference
    output = model.predict(resp).tolist()[0]

    # Transform predictions to JSON
    result = {
        'output': output
        }

    resultjson = json.dumps(result)
    return flask.Response(response=resultjson, status=200, mimetype='application/json')

In [None]:
%%writefile serve
#!/usr/bin/env python

# This file implements the scoring service shell. You don't necessarily need to modify it for various
# algorithms. It starts nginx and gunicorn with the correct configurations and then simply waits until
# gunicorn exits.
#
# The flask server is specified to be the app object in wsgi.py
#
# We set the following parameters:
#
# Parameter                Environment Variable              Default Value
# ---------                --------------------              -------------
# number of workers        MODEL_SERVER_WORKERS              the number of CPU cores
# timeout                  MODEL_SERVER_TIMEOUT              60 seconds

import multiprocessing
import os
import signal
import subprocess
import sys

cpu_count = multiprocessing.cpu_count()

model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))

def sigterm_handler(nginx_pid, gunicorn_pid):
    try:
        os.kill(nginx_pid, signal.SIGQUIT)
    except OSError:
        pass
    try:
        os.kill(gunicorn_pid, signal.SIGTERM)
    except OSError:
        pass

    sys.exit(0)

def start_server():
    print('Starting the inference server with {} workers.'.format(model_server_workers))


    # link the log streams to stdout/err so they will be logged to the container logs
    subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
    subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])

    nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf'])
    gunicorn = subprocess.Popen(['gunicorn',
                                 '--timeout', str(model_server_timeout),
                                 '-k', 'sync',
                                 '-b', 'unix:/tmp/gunicorn.sock',
                                 '-w', str(model_server_workers),
                                 'wsgi:app'])

    signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))

    # If either subprocess exits, so do we.
    pids = set([nginx.pid, gunicorn.pid])
    while True:
        pid, _ = os.wait()
        if pid in pids:
            break

    sigterm_handler(nginx.pid, gunicorn.pid)
    print('Inference server exiting')

# The main routine just invokes the start function.

if __name__ == '__main__':
    start_server()

In [None]:
%%writefile wsgi.py
import predictor as myapp

# This is just a simple wrapper for gunicorn to find your app.
# If you want to change the algorithm file, simply change "predictor" above to the
# new file.

app = myapp.app

In [23]:
%%sh
mkdir container
cd container
mkdir regressor
cd ..
mv Dockerfile container/
mv model.joblib nginx.conf predictor.py serve wsgi.py container/regressor/

## Build Docker Image
Here we can build our Docker image that will install the necessary requirements and point towards the web server you implemented.

In [None]:
%%sh

# Name of algo -> ECR
algorithm_name=sm-pretrained-sklearn-byoc

cd container

#make serve executable
chmod +x regressor/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Region, defaults to us-west-2
region=$(aws configure get region)
region=${region:-us-east-1}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

## SageMaker RT Endpoint Constructs
- SageMaker Model: Model Data & Container
- SageMaker EPC: Hardware Details & Variants
- SageMaker EP: REST Endpoint to invoke

Full Video Explanation: https://www.youtube.com/watch?v=omFOOr4elnc

In [25]:
import boto3
from sagemaker import get_execution_role

sm_client = boto3.client(service_name='sagemaker')
runtime_sm_client = boto3.client(service_name='sagemaker-runtime')

account_id = boto3.client('sts').get_caller_identity()['Account']
region = boto3.Session().region_name

#not really used in this use case, use when need to store model artifacts (Ex: MME)
s3_bucket = 'regressor-sagemaker-BYOC'

role = get_execution_role()

In [None]:
from time import gmtime, strftime

model_name = 'regressor-byoc-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_url = 's3://{}/regressor/'.format(s3_bucket) ## MODEL S3 URL
# replace string with your ECR name, algorithm_name in shell file above
container = '{}.dkr.ecr.{}.amazonaws.com/sm-pretrained-sklearn-byoc:latest'.format(account_id, region)
instance_type = 'ml.c5d.xlarge'

print('Model name: ' + model_name)
print('Model data Url: ' + model_url)
print('Container image: ' + container)

container = {
    'Image': container
}

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    Containers = [container])

print("Model Arn: " + create_model_response['ModelArn'])

In [None]:
endpoint_config_name = 'regressor-byoc-config' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': instance_type,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic'}])

print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

In [None]:
%%time

import time

endpoint_name = 'regressor-byoc-endpoint' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Endpoint Status: " + status)

print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
waiter = sm_client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)

## Invoke Endpoint
Boto3 Invoke_EP: https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint

In [None]:
import json
content_type = "application/json"
request_body = {"input": [[0.5]]}

#Serialize data for endpoint
data = json.loads(json.dumps(request_body))
payload = json.dumps(data)

#Endpoint invocation
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=content_type,
    Body=payload)

#Parse results
result = json.loads(response['Body'].read().decode())['output']
result

## Cleanup

In [None]:
sm_client.delete_endpoint(EndpointName = endpoint_name)