First, lets create our Sagemaker session and role, and create a S3 prefix to use for the notebook example.

In [36]:
# S3 prefix
prefix = 'ems_call_volume'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

Everything up-to-date


In [56]:
!git status

On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   .ipynb_checkpoints/EMS call prediction-checkpoint.ipynb[m
	[31mmodified:   EMS call prediction.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


## Upload the data for training <a class="anchor" id="upload_data"></a>

I performed the following query to get the data. Right now, I'm just saving the dataframe as shown below.

In [None]:
#Generating the dataframe from NFORS
es = Elasticsearch()
s = Search(using=es,index='*-fire-incident-*')
response = s.source(['description.event_opened',
                     'weather.daily.precipIntensity',
                     'weather.daily.precipType',
                     'description.day_of_week',
                     'weather.daily.temperatureHigh',
                    'NFPA.type',]).query('match',fire_department__firecares_id='93345')


#Performing the query and converting to pandas dataframe
df = pd.DataFrame((d.to_dict() for d in response.scan()))
json_struct = json.loads(df.to_json(orient="records"))
df = pd.io.json.json_normalize(json_struct)
df.to_csv('./data/query_results.csv')

Once we have the data locally, we can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [13]:
WORK_DIRECTORY = 'data'

train_input = sagemaker_session.upload_data(WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY) )

## Create SageMaker Scikit Estimator <a class="anchor" id="create_sklearn_estimator"></a>

To run our Scikit-learn training script on SageMaker, we construct a `sagemaker.sklearn.estimator.sklearn` estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.
* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters.

To see the code for the SKLearn Estimator, see here: https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/sklearn

In [14]:
from sagemaker.sklearn.estimator import SKLearn

script_path = 'ems_call_prediction.py'

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={'n_estimators': 1000})

## Train SKLearn Estimator on EMS data <a class="anchor" id="train_sklearn"></a>
Training is very simple, just call `fit` on the Estimator! This will start a SageMaker Training job that will download the data for us, invoke our scikit-learn code (in the provided script file), and save any model artifacts that the script creates.

In [15]:
sklearn.fit({'train': train_input})

2019-09-05 20:21:12 Starting - Starting the training job...
2019-09-05 20:21:13 Starting - Launching requested ML instances...
2019-09-05 20:22:09 Starting - Preparing the instances for training......
2019-09-05 20:23:13 Downloading - Downloading input data
2019-09-05 20:23:13 Training - Downloading the training image...
2019-09-05 20:23:42 Uploading - Uploading generated training model
2019-09-05 20:23:42 Completed - Training job completed

[31m2019-09-05 20:23:26,635 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[31m2019-09-05 20:23:26,637 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-09-05 20:23:26,649 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[31m2019-09-05 20:23:26,923 sagemaker-containers INFO     Module ems_call_prediction does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-09-05 20:23:26,923 sagemaker-containers INFO     Generat

## Using the trained model to make inference requests <a class="anchor" id="inference"></a>

### Deploy the model <a class="anchor" id="deploy"></a>

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count and instance type.

In [None]:
predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

### Choose some data and use it for a prediction <a class="anchor" id="prediction_request"></a>

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [43]:
import itertools
import pandas as pd
import numpy as np

df = pd.read_csv("data/query_results.csv")

#I know this is redundant, but I wanted to put as many data processing steps into the script

#Converting date
df['date'] = df['description.event_opened'].apply(lambda x: x[:10])
#Aggregation function
def myagg(x):

    #First need to group
    d = {
        'ems_calls': np.sum(x['NFPA.type']=='EMS'),
        'snow': 'snow' in x['weather.daily.precipType'].values,
        'rain': 'rain' in x['weather.daily.precipType'].values,
        'high_temp': np.mean(x['weather.daily.temperatureHigh'])
    }

    return pd.Series(d,index=d.keys())

#Day aggregation
features = df.groupby('date').apply(myagg).reset_index()
#Removing the outlier days
features = features[features['ems_calls']>10]

#Adding day of week
features = features.merge(df[['date','description.day_of_week']].drop_duplicates(), on='date')
#Renaming the day of week column to make it shorter
features = features.rename(columns={'description.day_of_week':'day'})
features['month'] = features.apply(lambda x: x['date'][5:7], axis=1)
#No longer need the date since we have all the information we need (day of week and month)
features = features.drop('date',axis=1)
#Using one hot encoding for categorical variables. Ask me if you want me to explain this further.
features = pd.get_dummies(features)

#Splitting the data into features (predictors) and labels (the quantity we want to predict)
labels = features['ems_calls']
features = features.drop('ems_calls',axis=1)


a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]


test_X = features.loc[indices,:]
test_y = labels.loc[indices]



Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The output from the endpoint return an numerical representation of the classification prediction; in the original dataset, these are flower names, but in this example the labels are numerical. We can compare against the original label that we parsed.

In [29]:
print(predictor.predict(test_X.values))
print(test_y.values)

NameError: name 'predictor' is not defined

### Endpoint cleanup <a class="anchor" id="endpoint_cleanup"></a>

When you're done with the endpoint, you'll want to clean it up.

In [None]:
sklearn.delete_endpoint()

## Batch Transform <a class="anchor" id="batch_transform"></a>
We can also use the trained model for asynchronous batch inference on S3 data using SageMaker Batch Transform.

In [None]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = sklearn.transformer(instance_count=1, instance_type='ml.m4.xlarge')

### Prepare Input Data <a class="anchor" id="prepare_input_data"></a>
We will extract 10 random samples of 100 rows from the training data, then split the features (X) from the labels (Y). Then upload the input data to a given location in S3.

In [None]:
%%bash
# Randomly sample the ems dataset 10 times, then split X and Y
mkdir -p batch_data/XY batch_data/X batch_data/Y
for i in {0..9}; do
    cat data/ems.csv | shuf -n 100 > batch_data/XY/ems_sample_${i}.csv
    cat batch_data/XY/ems_sample_${i}.csv | cut -d',' -f2- > batch_data/X/ems_sample_X_${i}.csv
    cat batch_data/XY/ems_sample_${i}.csv | cut -d',' -f1 > batch_data/Y/ems_sample_Y_${i}.csv
done

In [None]:
# Upload input data from local filesystem to S3
batch_input_s3 = sagemaker_session.upload_data('batch_data/X', key_prefix=prefix + '/batch_input')

### Run Transform Job <a class="anchor" id="run_transform_job"></a>
Using the Transformer, run a transform job on the S3 input data.

In [None]:
# Start a transform job and wait for it to finish
transformer.transform(batch_input_s3, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

### Check Output Data  <a class="anchor" id="check_output_data"></a>
After the transform job has completed, download the output data from S3. For each file "f" in the input data, we have a corresponding file "f.out" containing the predicted labels from each input row. We can compare the predicted labels to the true labels saved earlier.

In [None]:
# Download the output data from S3 to local filesystem
batch_output = transformer.output_path
!mkdir -p batch_data/output
!aws s3 cp --recursive $batch_output/ batch_data/output/
# Head to see what the batch output looks like
!head batch_data/output/*

In [None]:
%%bash
# For each sample file, compare the predicted labels from batch output to the true labels
for i in {1..9}; do
    diff -s batch_data/Y/ems_sample_Y_${i}.csv \
        <(cat batch_data/output/ems_sample_X_${i}.csv.out | sed 's/[["]//g' | sed 's/, \|]/\n/g') \
        | sed "s/\/dev\/fd\/63/batch_data\/output\/ems_sample_X_${i}.csv.out/"
done