## Download data 
Download the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris), which is the data used to trained the model in this demo.

In [1]:
import boto3
import pandas as pd
import numpy as np

s3 = boto3.client("s3")
s3.download_file(f"sagemaker-sample-files", "datasets/tabular/iris/iris.data", "iris.data")

df = pd.read_csv(
    "iris.data", header=None, names=["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
)
df.head()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Prepare data
Next, we prepare the data for training by first converting the labels from string to integers. Then we split the data into a train dataset (80% of the data) and test dataset (the remaining 20% of the data) before saving them into CSV files. Then, these files are uploaded to S3 where the SageMaker SDK can access and use them to train the model.

In [2]:
# Convert the three classes from strings to integers in {0,1,2}
df["class_cat"] = df["class"].astype("category").cat.codes
categories_map = dict(enumerate(df["class"].astype("category").cat.categories))
print(categories_map)
df.head()

{0: 'Iris-setosa', 1: 'Iris-versicolor', 2: 'Iris-virginica'}


Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,class,class_cat
0,5.1,3.5,1.4,0.2,Iris-setosa,0
1,4.9,3.0,1.4,0.2,Iris-setosa,0
2,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5.0,3.6,1.4,0.2,Iris-setosa,0


In [3]:
# Split the data into 80-20 train-test split
num_samples = df.shape[0]
split = round(num_samples * 0.8)
train = df.iloc[:split, :]
test = df.iloc[split:, :]
print("{} train, {} test".format(split, num_samples - split))

120 train, 30 test


In [4]:
# Write train and test CSV files
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)

## Train a Random Forest model locally

**Note: Training a model directly on SageMaker jupyter notebook is NOT recommended AT ALL**. I did it here is only for demo purpose to get a sklearn model that's not trained using SageMaker managed training

In [5]:
!python ./train.py

Create a floder to put the model artifact and deployment code in following structure:

/deployment

    |__model.joblib
    |__code
      |__inferece.py
      |__requirements.txt

tar zip all files in folder ./deployment as model.tar.gz, and put it under main directory

## Upload the model.tar.gz to an s3 bucket

In [21]:
!aws s3 cp ./model.tar.gz s3://customer-data-demo-sherryd-us-east-1/cardinality-sklearn-byom/  # change to your bucket name

upload: ./model.tar.gz to s3://customer-data-demo-sherryd-us-east-1/cardinality-sklearn-byom/model.tar.gz


## Create a Sagemaker SKLearn model artifact using boto3 Sagemaker client’s API creat_model()

In [22]:
import sagemaker
image = sagemaker.image_uris.retrieve(framework="sklearn", region=boto3.Session().region_name, version="0.23-1")
print(image)

model_data = "s3://customer-data-demo-sherryd-us-east-1/cardinality-sklearn-byom/model.tar.gz"
role = sagemaker.get_execution_role()

683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3


In [23]:
from time import gmtime, strftime
model_name = 'sklearn-random-forest-byom-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

sagemaker = boto3.client("sagemaker")
primary_container = {
    "Image": image, 
    "ModelDataUrl": model_data,
    'Environment': {
        'SAGEMAKER_PROGRAM': 'inference.py',  # the file where inference starts if there are more than one .py files
        'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/code',  # directory starts from /opt/ml, then the path to the folder having inference.py in model.tar.gz
    },
}
create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)
print(create_model_response["ModelArn"])

arn:aws:sagemaker:us-east-1:240487350066:model/sklearn-random-forest-byom-2022-04-28-01-10-42


## Upload test data to s3

In [11]:
df_test = pd.read_csv("./test.csv", sep=",")
df_test.drop(["class", "class_cat"], axis=1).to_csv('test_s3.csv', index=False, header=False)
!aws s3 cp ./test_s3.csv s3://customer-data-demo-sherryd-us-east-1/cardinality-sklearn-byom/

upload: ./test_s3.csv to s3://customer-data-demo-sherryd-us-east-1/cardinality-sklearn-byom/test_s3.csv


## Create a batch transform job using boto3 Sagemaker client’s API create_transform_job()

In [24]:
batch_job_name = "sklearn-batch-transform-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = 's3://customer-data-demo-sherryd-us-east-1/cardinality-sklearn-byom/output'  # change to your output location in s3

batch_transform_response = sagemaker.create_transform_job(
    TransformJobName = batch_job_name,
    ModelName = model_name,
    TransformInput={
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': 's3://customer-data-demo-sherryd-us-east-1/cardinality-sklearn-byom/test_s3.csv',  # change to your testing data in s3
            }
        },
        'ContentType': 'text/csv',
        'CompressionType': 'None',
        'SplitType': 'Line'
    },
    TransformOutput={
        'S3OutputPath': output_location,  
        'Accept': 'text/csv',
        'AssembleWith': 'Line'
    },
    TransformResources={
        'InstanceType': 'ml.m4.xlarge',
        'InstanceCount': 1
    }
)

In [25]:
# Check if the job has been finished
import time
while True:
    response = sagemaker.describe_transform_job(TransformJobName=batch_job_name)
    status = response["TransformJobStatus"]
    print("Transform job status: " + status)
    if status != "InProgress":
        if status == "Failed":
            message = response["FailureReason"]
            print("Transform failed with the following error: {}".format(message))
            raise Exception("Transform job failed")
        break
    else:
        time.sleep(60)

Transform job status: InProgress
Transform job status: InProgress
Transform job status: InProgress
Transform job status: InProgress
Transform job status: InProgress
Transform job status: InProgress
Transform job status: InProgress
Transform job status: Completed


## Inspect the output of the Batch Transform job in S3

In [26]:
import re

def get_csv_output_from_s3(s3uri, batch_file):
    file_name = "{}.out".format(batch_file)
    match = re.match("s3://([^/]+)/(.*)", "{}/{}".format(s3uri, file_name))
    output_bucket, output_prefix = match.group(1), match.group(2)
    s3.download_file(output_bucket, output_prefix, file_name)
    return pd.read_csv(file_name, sep=",", header=None)

output_df = get_csv_output_from_s3(output_location, 'test_s3.csv')
output_df.head(5)

Unnamed: 0,0
0,2.0
1,2.0
2,2.0
3,2.0
4,2.0
