# Sagemaker Walkthrough

In this notebook, we will go through the steps of training and deploying a model using AWS Sagemaker.

We will be using the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) module throughout this notebook. 

We will use two different types of models for our predictions, a model built using scikit-learn, and one using an implementation of XGBoost available from sagemaker, including performing hyperparameter tuning.

## Step 1: Initial Setup

First, let's bring our data over. It is the familiar King County Housing dataset, and is currently sitting in an s3 bucket with path *s3://nss-ds3/datasets/kc_house_data.csv*

Let's use the *boto3* library to fetch our data. We'll do some tricks and read it directly to a pandas dataframe.

In [1]:
import pandas as pd
import boto3
import io

In [2]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='nss-ds3', Key='datasets/kc_house_data.csv')
housing = pd.read_csv(io.BytesIO(obj['Body'].read()))

In [3]:
housing.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Now that we have our data, we need to set create a sagemaker session and get our execution role. We will need to pass these as arguments when we create our models later.

In [4]:
# S3 prefix
prefix = 'sagemaker_example'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

## Preparing Data for Modeling

We'll do some minor preprocessing of our data and then export the results to a csv. Notice that we will put the column we want to predict at the front, because this is what the XGBoost model will expect.

In [5]:
housing.zipcode = housing.zipcode.astype('category')

housing = pd.get_dummies(housing[['price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']])

In [6]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(housing, test_size = 0.2)

In [7]:
import os

os.makedirs('./data', exist_ok=True)
train.to_csv('data/train.csv', index = False)

Now, we are going to put our training data into the s3 bucket for our sagemaker instance. The following cell will take the contents of the data folder (currently our train.csv) and copy them over to a folder sm-housing/data in s3.

In [8]:
prefix = 'sm-housing'

WORK_DIRECTORY = 'data'

train_input = sagemaker_session.upload_data(WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY) )

Now, we are going to fit our scikit-learn model. For this to work, we need to upload the training script to our notebook instance, as the SKLearn class will be looking for it to use as an entry point.

In [17]:
from sagemaker.sklearn.estimator import SKLearn

script_path = 'housing_script_rf.py'

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session)

Now, we can fit our model, telling it where to look for the training data.

In [18]:
sklearn.fit({'train': train_input})

2020-03-14 00:36:05 Starting - Starting the training job...
2020-03-14 00:36:06 Starting - Launching requested ML instances......
2020-03-14 00:37:13 Starting - Preparing the instances for training...
2020-03-14 00:37:56 Downloading - Downloading input data...
2020-03-14 00:38:10 Training - Downloading the training image...
2020-03-14 00:38:51 Failed - Training job failed
..

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-03-14-00-36-04-771: Failed. Reason: ClientError: Cannot pull algorithm container. Either the image does not exist or its permissions are incorrect.

Now that our model is trained, we have two options. First, we can retrieve the actual model (which can also be downloaded to your local machine). Second, we can deploy the model. For our scikit-learn model, will go with the first option. We'll look at the second one for the xgboost model that we'll do next.

Let's see where the model output is stored:

In [19]:
boto3.client('sagemaker').describe_training_job(
    TrainingJobName=sklearn.latest_training_job.job_name)['ModelArtifacts']['S3ModelArtifacts']

KeyError: 'ModelArtifacts'

In [13]:
## Sub in what you see above and then uncomment the following lines
#s3.download_file(Bucket = 'sagemaker-us-east-1-339692866702', 
#                 Key = 'sagemaker-scikit-learn-2020-03-14-00-30-07-534/output/model.tar.gz',
#                 Filename = 'model.tar.gz')

In [14]:
import tarfile

tar = tarfile.open('model.tar.gz')
tar.extractall()
tar.close()

In [15]:
from sklearn.externals import joblib

In [16]:
rf_model = joblib.load('model.joblib')



Let's see how it does on the test data.

In [20]:
from sklearn.metrics import mean_absolute_error

In [21]:
y_pred = rf_model.predict(test.iloc[:,1:])

Recall that the mean absolute error tells us how far off our predictions are, on average, from the true values.

In [22]:
mean_absolute_error(test.iloc[:,0], y_pred)

116571.19533835458

## Section 2: Using sagemaker's XGBoost Algorithm

Now, we will see how to use the XGBoost algorithm.

We need to get the location of the xgboost model. This is done with the get_image_uri method.

In [23]:
region = boto3.Session().region_name    
smclient = boto3.Session().client('sagemaker')

bucket = sagemaker.Session().default_bucket()   
prefix = 'housing_xgb'

In [24]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(region, 'xgboost', repo_version='0.90-1')

First, create an Estimator instance, pointing to the xgboost container.

In [25]:
xgb = sagemaker.estimator.Estimator(
    container,
    role, 
    train_instance_count=1, 
    train_instance_type='ml.m4.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    sagemaker_session=sagemaker_session
)

Set the hyperparameters for our model. A list of available hyperparameters and what they represent is available here:https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html

In [26]:
xgb.set_hyperparameters(
    num_round=100,
    rate_drop=0.3,
    alpha = 0.25)

Move our training data to s3.

In [27]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('data/train.csv')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')

In [28]:
xgb.fit({'train': s3_input_train})

2020-03-14 00:43:29 Starting - Starting the training job...
2020-03-14 00:43:32 Starting - Launching requested ML instances......
2020-03-14 00:44:36 Starting - Preparing the instances for training......
2020-03-14 00:45:38 Downloading - Downloading input data...
2020-03-14 00:46:30 Training - Training image download completed. Training in progress...[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[00:46:33] 17291x85 matrix with 1469735 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34mINFO:root:Single node training.[0m
[34mINFO:root:Train matrix has 17291 rows[0m
[34m[0]#011train-rmse:471337[0m
[34m

Now let's deploy our model. This will create an endpoint holding our model.

In [29]:
predictor = xgb.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

---------------!

Now that we have our model deployed, we can use it to make predictions.

Notice that it does take a little bit of work to ensure that we are sending our data to the model in the correct format.

In [30]:
import numpy as np
from sagemaker.predictor import csv_serializer

In [None]:
predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None

def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

In [31]:
y_pred = predict(test.values[:, 1:])

In [32]:
mean_absolute_error(test.iloc[:,0], y_pred)

71036.91989792968

Warning: if you are making a large number of predictions, you will have to do it in batches, because is a cap on how much data you can pass to an endpoint at a time.