# Machine Learning Engineer Nanodegree
## Capstone Project - Starbucks app data

In [1]:
import os
import numpy as np
import pandas as pd

In [72]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

b_session = boto3.session.Session(region_name='eu-central-1')

session = sagemaker.Session(boto_session=b_session)
role = 'AmazonSageMaker-ExecutionRole-20191105T072928'

bucket = session.default_bucket()

# Part 4 - Data preprocessing

Before running some Machine Learning models, we have to go through some preparation steps on the data, as
- impute missing values
- encode categorical variables (*gender*)
- normalize feature distributions

To do so, we use the classical **scikit-learn** API and the **SKLearnProcessor** feature of AWS Sagemaker to run the job on the cloud.

In [80]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

output_configs = {}

for tgt in ['bogo', 'discount', 'info']:
    print(tgt)
    
    sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                         role=role,
                                         instance_type='ml.m5.xlarge',
                                         instance_count=1)
    sklearn_processor.run(
        code='lib/preprocessing.py', #entrypoint for processing
        inputs=[ProcessingInput(os.path.join('s3://', bucket, f'Capstone_Starbucks/{tgt}.csv'), '/opt/ml/processing/input')],
        outputs=[
            ProcessingOutput(source=f'/opt/ml/processing/output/',
                             output_name=f'{tgt}_data')
        ],
        arguments=['--target', tgt]
     )

    preprocessing_job_description = sklearn_processor.jobs[-1].describe()
    output_configs[tgt] = preprocessing_job_description['ProcessingOutputConfig']

bogo

Job Name:  sagemaker-scikit-learn-2020-01-12-18-15-35-216
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-eu-central-1-601949536922/Capstone_Starbucks/bogo.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-eu-central-1-601949536922/sagemaker-scikit-learn-2020-01-12-18-15-35-216/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'bogo_data', 'S3Output': {'S3Uri': 's3://sagemaker-eu-central-1-601949536922/sagemaker-scikit-learn-2020-01-12-18-15-35-216/output/bogo_data', 'LocalPath': '/opt/ml/processing/output/', 'S3UploadMode': 'EndOfJob'}}]
..................
[34mCollecting joblib
  Downloading https://fil

In [96]:
preprocessed_data = {}
for output in output_configs:
            preprocessed_data[output] = output_configs[output]['Outputs'][0]['S3Output']['S3Uri']

{'bogo': 's3://sagemaker-eu-central-1-601949536922/sagemaker-scikit-learn-2020-01-12-18-15-35-216/output/bogo_data',
 'discount': 's3://sagemaker-eu-central-1-601949536922/sagemaker-scikit-learn-2020-01-12-18-18-49-729/output/discount_data',
 'info': 's3://sagemaker-eu-central-1-601949536922/sagemaker-scikit-learn-2020-01-12-18-22-04-732/output/info_data'}

In [98]:
s3_client = boto3.client('s3')
for k in preprocessed_data:
    for f in ['train.csv', 'val.csv', 'test.csv', 'transformer.joblib']:
        copy_source = {'Bucket': bucket, 'Key': '/'.join(preprocessed_data[k][5:].split('/')[1:] + [f'{k}_{f}'])}
        s3_client.copy_object(CopySource=copy_source, Bucket=bucket, Key=f'Capstone_Starbucks/{k}/{k}_{f}')

# Part 5 - Machine Learning models

## XGBoost model

In [70]:
prefix = 'bogo'

In [103]:
container = get_image_uri(session.boto_region_name, 'xgboost', '0.90-1')

xgb = sagemaker.estimator.Estimator(container,
                                    role,
                                    train_instance_count=1,
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/Capstone_Starbucks/{}/model'.format(bucket, prefix),
                                    sagemaker_session=session)

Before beginning the hyperparameter tuning, we should make sure to set any model specific hyperparameters that we wish to have default values. There are quite a few that can be set when using the XGBoost algorithm, below are just a few of them. If you would like to change the hyperparameters below or modify additional ones you can find additional information on the [XGBoost hyperparameter page](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)

In [104]:
xgb.set_hyperparameters(objective='binary:logistic',
                        max_depth=4,
                        eta=0.1,
                        gamma=4,
                        min_child_weight=6,
                        colsample_bytree=0.5,
                        subsample=0.6,
                        early_stopping_rounds=10,
                        num_round=200,
                        seed=1123)

Now that we have our estimator object completely set up, it is time to create the hyperparameter tuner. To do this we need to construct a new object which contains each of the parameters we want SageMaker to tune. In this case, we wish to find the best values for the `max_depth`, `eta`, `min_child_weight`, `subsample`, and `gamma` parameters. Note that for each parameter that we want SageMaker to tune we need to specify both the *type* of the parameter and the *range* of values that parameter may take on.

In addition, we specify the *number* of models to construct (`max_jobs`) and the number of those that can be trained in parallel (`max_parallel_jobs`). In the cell below we have chosen to train `20` models, of which we ask that SageMaker train `3` at a time in parallel. Note that this results in a total of `20` training jobs being executed which can take some time, in this case almost a half hour. With more complicated models this can take even longer so be aware!

In [105]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

xgb_hyperparameter_tuner = HyperparameterTuner(estimator=xgb,
                                               objective_metric_name='validation:error',
                                               objective_type='Minimize',
                                               max_jobs=20,
                                               max_parallel_jobs=4,
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(2, 6),
                                                    'eta'      : ContinuousParameter(0.01, 0.5),
                                                    'gamma': ContinuousParameter(0, 10),
                                                    'min_child_weight': IntegerParameter(2, 8),
                                                    'colsample_bytree': ContinuousParameter(0.2, 1.0),
                                                    'subsample': ContinuousParameter(0.3, 1.0),
                                               })

Now that we have our hyperparameter tuner object completely set up, it is time to train it. To do this we make sure that SageMaker knows our input data is in csv format and then execute the `fit` method.

In [106]:
# This is a wrapper around the location of our train and validation data, to make sure that SageMaker
# knows our data is in csv format.
s3_input_train = sagemaker.s3_input(s3_data=f's3://{bucket}/Capstone_Starbucks/{prefix}/{prefix}_train.csv', content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=f's3://{bucket}/Capstone_Starbucks/{prefix}/{prefix}_val.csv', content_type='csv')

xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

As in many of the examples we have seen so far, the `fit()` method takes care of setting up and fitting a number of different models, each with different hyperparameters. If we wish to wait for this process to finish, we can call the `wait()` method.

In [107]:
xgb_hyperparameter_tuner.wait()

..................................................................................................................................................................................................!


In [14]:
xgb_hyperparameter_tuner.best_training_job()

'sagemaker-xgboost-191110-1613-016-cbc69477'

In addition, since we'd like to set up a batch transform job to test the best model, we can construct a new estimator object from the results of the best training job. The `xgb_attached` object below can now be used as though we constructed an estimator with the best performing hyperparameters and then fit it to our training data.

In [108]:
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())

2020-01-12 18:49:24 Starting - Preparing the instances for training
2020-01-12 18:49:24 Downloading - Downloading input data
2020-01-12 18:49:24 Training - Training image download completed. Training in progress.
2020-01-12 18:49:24 Uploading - Uploading generated training model
2020-01-12 18:49:24 Completed - Training job completed[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter _tuning_objective_metric value validation:error to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter 