## Develop, Train, Optimize and Deploy Scikit-Learn Random Forest to predict Fifa results

* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client

In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML model (Random Forest). More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset



In [1]:
import datetime
import time
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston


sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print("Using bucket " + bucket)

Using bucket sagemaker-us-east-1-600839245357


## Prepare data
We load a dataset from sklearn, split it and send it to S3

In [2]:
dataframe = pd.read_csv('fifa_dataset.csv')

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
dataframe.sample(5)

Unnamed: 0,beers_tomer,beers_zach,tod,winner
80,2,2,morning,tomer
53,4,1,morning,tomer
64,2,5,morning,tomer
3,1,4,mid,zach
88,5,5,morning,tomer


In [5]:
dataframe.winner.describe()

count       100
unique        2
top       tomer
freq         58
Name: winner, dtype: object

## Categorical to numeric

In [6]:
cleanup_nums = {"tod":     {"morning": 1, "mid": 2, "evening": 3},
                "winner":  {"tomer": 1, "zach": 2, }}
dataframe = dataframe.replace(cleanup_nums)

In [7]:
dataframe.sample(20)

Unnamed: 0,beers_tomer,beers_zach,tod,winner
76,2,4,2,2
20,4,2,3,1
53,4,1,1,1
23,1,2,1,2
92,1,5,2,1
98,6,4,1,2
80,2,2,1,1
63,4,1,1,2
75,6,5,1,2
9,1,5,3,1


In [8]:
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

(100, 3) (100,)


## Split into train an test data

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(67, 3) (33, 3) (67,) (33,)


In [10]:
fratures = dataframe.columns.to_list()[:-1]
fratures

['beers_tomer', 'beers_zach', 'tod']

In [11]:
trainX = pd.DataFrame(X_train, columns=fratures)
trainX["target"] = y_train

testX = pd.DataFrame(X_test, columns=fratures)
testX["target"] = y_test

In [12]:
trainX.head()

Unnamed: 0,beers_tomer,beers_zach,tod,target
0,4,4,3,1
1,5,5,1,1
2,6,2,2,2
3,1,5,1,1
4,4,3,2,1


In [13]:
trainX.to_csv("fifa_train.csv")
testX.to_csv("fifa_test.csv")

In [14]:
# send data to S3. SageMaker will take training data from s3
trainpath = sess.upload_data(
    path="fifa_train.csv", bucket=bucket, key_prefix="sagemaker/sklearncontainer"
)

testpath = sess.upload_data(
    path="fifa_test.csv", bucket=bucket, key_prefix="sagemaker/sklearncontainer"
)

## Writing a *Script Mode* script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [15]:
%%writefile script.py


import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor


# inference function 
# ---------------------
# Before a model can be served, it must be loaded. The SageMaker Scikit-learn model 
# server loads your model by invoking a model_fn function that you must provide in your script. The model_fn should have the following signature.
# SageMaker will inject the directory where your model files and sub-directories, saved by save, have been mounted. 
# Your model function should return a model object that can be used for model serving.
# More details: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#load-a-model

def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf



# See: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#serve-a-model

if __name__ == "__main__":

    print("extracting arguments")
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument("--n-estimators", type=int, default=10)
    parser.add_argument("--min-samples-leaf", type=int, default=3)

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="fifa_train.csv")
    parser.add_argument("--test-file", type=str, default="fifa_test.csv")
    parser.add_argument("--features", type=str)  # in this script we ask user to explicitly name features
    parser.add_argument("--target", type=str)  # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    print("reading data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    print("building training and testing datasets")
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]

    # train
    print("training model")
    model = RandomForestRegressor(
        n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1
    )

    model.fit(X_train, y_train)

    # print abs error
    print("validating model")
    abs_err = np.abs(model.predict(X_test) - y_test)

    # print couple perf metrics
    for q in [10, 50, 90]:
        print("AE-at-" + str(q) + "th-percentile: " + str(np.percentile(a=abs_err, q=q)))

    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print("model persisted at " + path)
    print(args.min_samples_leaf)


Writing script.py


## Local training
Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

In [16]:
! python script.py --n-estimators 100 \
                   --min-samples-leaf 2 \
                   --model-dir ./ \
                   --train ./ \
                   --test ./ \
                   --features 'beers_tomer beers_zach tod' \
                   --target target

extracting arguments
reading data
building training and testing datasets
training model
validating model
AE-at-10th-percentile: 0.1548619047619046
AE-at-50th-percentile: 0.4226825396825402
AE-at-90th-percentile: 0.723268253968254
model persisted at ./model.joblib
2


## SageMaker Training

### Launching a training job with the Python SDK

In [17]:
LOCAL_MODE = True  # see: https://github.com/aws-samples/amazon-sagemaker-local-mode

DUMMY_IAM_ROLE = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'

In [18]:
instance_type="local" if LOCAL_MODE else 'ml.m4.xlarge' # "ml.c5.xlarge",
role=DUMMY_IAM_ROLE if LOCAL_MODE else get_execution_role()
    

In [19]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
    entry_point="script.py",
    role=role,
    instance_count=1,
    instance_type=instance_type,
    framework_version=FRAMEWORK_VERSION,
    base_job_name="fifa-scikit",
    metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 10,
        "min-samples-leaf": 3,
        "features": 'beers_tomer beers_zach tod',
        "target": "target",
    },
)

In [20]:
# launch training job, with asynchronous call
sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)

Creating g6q26y7lds-algo-1-dfyh9 ... 
Creating g6q26y7lds-algo-1-dfyh9 ... done
Attaching to g6q26y7lds-algo-1-dfyh9
[36mg6q26y7lds-algo-1-dfyh9 |[0m 2021-06-21 08:35:33,153 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36mg6q26y7lds-algo-1-dfyh9 |[0m 2021-06-21 08:35:33,156 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mg6q26y7lds-algo-1-dfyh9 |[0m 2021-06-21 08:35:33,167 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36mg6q26y7lds-algo-1-dfyh9 |[0m 2021-06-21 08:35:33,348 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mg6q26y7lds-algo-1-dfyh9 |[0m 2021-06-21 08:35:33,363 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mg6q26y7lds-algo-1-dfyh9 |[0m 2021-06-21 08:35:33,376 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mg6q26y7lds-algo-1-dfyh9 |[0m

# Bring your own custom model 
### Anatomy of an Amazon SageMaker container

<img src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2018/03/07/scikit-sagemaker-1.gif" width=1000 height=1000 />

In [None]:
#### More details at: https://aws.amazon.com/blogs/machine-learning/train-and-host-scikit-learn-models-in-amazon-sagemaker-by-building-a-scikit-docker-container/

### Optional - Launching a tuning job with the Python SDK

In [None]:
# we use the Hyperparameter Tuner
from sagemaker.tuner import IntegerParameter

# Define exploration boundaries
hyperparameter_ranges = {
    "n-estimators": IntegerParameter(20, 100),
    "min-samples-leaf": IntegerParameter(2, 6),
}

# create Optimizer
Optimizer = sagemaker.tuner.HyperparameterTuner(
    estimator=sklearn_estimator,
    hyperparameter_ranges=hyperparameter_ranges,
    base_tuning_job_name="FIFA-RF-tuner",
    objective_type="Minimize",
    objective_metric_name="median-AE",
    metric_definitions=[
        {"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}
    ],  # extract tracked metric from logs with regexp
    max_jobs=3,
    max_parallel_jobs=2,
)

In [None]:
Optimizer.fit({"train": trainpath, "test": testpath})

In [None]:
# get tuner results in a df
results = Optimizer.analytics().dataframe()
while results.empty:
    time.sleep(1)
    results = Optimizer.analytics().dataframe()
results.head()

## Deploy to a real-time endpoint

In [21]:
predictor = sklearn_estimator.deploy(initial_instance_count = 1,
                                     instance_type          = instance_type, 
                                     endpoint_name          = 'FIFA-PREDICTOR',
                                     entry_point            = "script.py")

Attaching to axj68dtz4i-algo-1-u8jaj
[36maxj68dtz4i-algo-1-u8jaj |[0m 2021-06-21 08:47:56,897 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)
[36maxj68dtz4i-algo-1-u8jaj |[0m 2021-06-21 08:47:56,899 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)
[36maxj68dtz4i-algo-1-u8jaj |[0m 2021-06-21 08:47:56,901 INFO - sagemaker-containers - nginx config: 
[36maxj68dtz4i-algo-1-u8jaj |[0m worker_processes auto;
[36maxj68dtz4i-algo-1-u8jaj |[0m daemon off;
[36maxj68dtz4i-algo-1-u8jaj |[0m pid /tmp/nginx.pid;
[36maxj68dtz4i-algo-1-u8jaj |[0m error_log  /dev/stderr;
[36maxj68dtz4i-algo-1-u8jaj |[0m 
[36maxj68dtz4i-algo-1-u8jaj |[0m worker_rlimit_nofile 4096;
[36maxj68dtz4i-algo-1-u8jaj |[0m 
[36maxj68dtz4i-algo-1-u8jaj |[0m events {
[36maxj68dtz4i-algo-1-u8jaj |[0m   worker_connections 2048;
[36maxj68dtz4i-algo-1-u8jaj |[0m }
[36maxj68dtz4i-algo-1-u8jaj |[0m 
[36maxj68dtz4i-algo-1-u8jaj |[0m http {
[36maxj68dtz

### Invoke with the Python SDK

In [22]:
testX.values[:, :-1]

array([[2, 2, 1],
       [6, 4, 1],
       [2, 2, 3],
       [4, 2, 1],
       [4, 1, 2],
       [4, 4, 1],
       [4, 6, 2],
       [5, 3, 3],
       [5, 6, 3],
       [1, 4, 1],
       [1, 5, 2],
       [1, 5, 3],
       [4, 3, 2],
       [2, 5, 3],
       [6, 3, 3],
       [3, 4, 2],
       [1, 6, 3],
       [4, 1, 2],
       [2, 4, 2],
       [6, 6, 3],
       [2, 4, 2],
       [2, 4, 3],
       [1, 6, 2],
       [4, 4, 2],
       [4, 1, 3],
       [5, 2, 3],
       [1, 4, 3],
       [2, 1, 2],
       [4, 5, 1],
       [1, 2, 2],
       [2, 4, 3],
       [5, 1, 2],
       [3, 1, 3]])

In [23]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[fratures]))

[36maxj68dtz4i-algo-1-u8jaj |[0m 2021-06-21 08:48:46,385 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)
[1.71876623 1.52059524 1.45626623 1.52277778 1.52172619 1.58571429
 1.57079365 1.40616883 1.34271645 1.64880952 1.28555556 1.05333333
 1.47535714 1.07       1.40616883 1.69845238 1.05333333 1.52172619
 1.70845238 1.41354978 1.70845238 1.38702381 1.27603175 1.68809524
 1.19875    1.17857143 1.38035714 1.61704004 1.21678571 1.60876623
 1.38702381 1.62589286 1.46215909]
[36maxj68dtz4i-algo-1-u8jaj |[0m 172.18.0.1 - - [21/Jun/2021:08:48:47 +0000] "POST /invocations HTTP/1.1" 200 392 "-" "python-urllib3/1.26.4"


### Alternative: invoke with `boto3`

In [None]:
runtime = boto3.client("sagemaker-runtime")

#### Option 1: `csv` serialization

In [None]:
# csv serialization
response = runtime.invoke_endpoint(
    EndpointName=predictor.endpoint,
    Body=testX[data.feature_names].to_csv(header=False, index=False).encode("utf-8"),
    ContentType="text/csv",
)

print(response["Body"].read())

#### Option 2: `npy` serialization

In [None]:
# npy serialization
from io import BytesIO


# Serialise numpy ndarray as bytes
buffer = BytesIO()
# Assuming testX is a data frame
np.save(buffer, testX[data.feature_names].values)

response = runtime.invoke_endpoint(
    EndpointName=predictor.endpoint, Body=buffer.getvalue(), ContentType="application/x-npy"
)

print(response["Body"].read())

## Scaling our endpoint

In [None]:
import boto3

#Let us define a client to play with autoscaling options
client = boto3.client('application-autoscaling')

In [None]:
resource_id='endpoint/' + predictor.endpoint_name + '/variant/' + 'AllTraffic' # This is the format in which application autoscaling references the endpoint


In [None]:
response = client.register_scalable_target(
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=3
)

In [None]:

# GPUUtilization metric
# Or what metric is the inference logic sensitive to (such as GPUUtilization, CPUUtilization, MemoryUtilization, or Invocations) per instance?
response = client.put_scaling_policy(
    PolicyName='CPUUtil-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 40.0,
        'CustomizedMetricSpecification':
        {
            'MetricName': 'CPUUtilization',  # TODO: why our model is sensetive to CPU (in addition to GPU)
            'Namespace': '/aws/sagemaker/Endpoints',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': predictor.endpoint_name},
                {'Name': 'VariantName','Value': 'AllTraffic'}
            ],
            'Statistic': 'Average', # Possible - 'Statistic': 'Average'|'Minimum'|'Maximum'|'SampleCount'|'Sum'
            'Unit': 'Percent'
        },
        'ScaleInCooldown': 30,
        'ScaleOutCooldown': 1
    }
)

print(response)

## Don't forget to delete the endpoint !

In [None]:
sm_boto3.delete_endpoint(EndpointName=predictor.endpoint)