## Restore Variables from Previous Notebook

This part will restore the variables that was stored in the previous notebook

In [4]:
%store -r

---
## Train the Classifier on Amazon SageMaker

We will train a decision tree based classifier using SageMaker's built-in Algorithm (XGBoost)

## Amazon SageMaker
Amazon SageMaker Amazon SageMaker is a fully managed machine learning service that automates the end-to-end ML process. With Amazon SageMaker, data scientists and developers can quickly and easily build and train machine learning models and directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don't have to manage servers. It also provides common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment. With native support for bring-your-own-algorithms and frameworks, Amazon SageMaker offers flexible distributed training options that adjust to your specific workflows.



## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).
- Specify the locations of the XGBoost algorithm containers.

In [11]:
from sagemaker.amazon.amazon_estimator import get_image_uri
import boto3
import re
import sagemaker
import seaborn as sns

role = sagemaker.get_execution_role()

#Manage interactions with the Amazon SageMaker APIs and any other AWS services needed.
#manipulating entities and resources that Amazon SageMaker uses, such as training jobs, endpoints, and input datasets in S3.
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'DEMO-xgboost-fraud-detection'

container = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')
container

'683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:0.90-1-cpu-py3'

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [12]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

Binary classification error rate. It is calculated as #(wrong cases)/#(all cases)


More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [13]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(
                        max_depth=10,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

2019-10-20 09:29:54 Starting - Starting the training job...
2019-10-20 09:29:56 Starting - Launching requested ML instances......
2019-10-20 09:30:59 Starting - Preparing the instances for training......
2019-10-20 09:32:07 Downloading - Downloading input data...
2019-10-20 09:32:40 Training - Downloading the training image..[31mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[31mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[31mReturning the value itself[0m
[31mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[31mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[31mINFO:root:Determined delimiter of CSV input is ','[0m
[31mINFO:root:Determined delimiter of CSV input is ','[0m
[31mINFO:root:Determined delimiter of CSV input is ','[0m
[31m[09:33:02] 688x30 matrix with 20640 entries loaded from /opt/ml/input/data/tr

## Make a note of the best validation error after the training is done

This is to compare the default hyperparameters with the best job after hyperparameter tuning is finised

---
## SageMaker Deployment

Now that we've trained the algorithm, let's create a model and deploy it to a hosted endpoint.
<img src="./images/deployment.png" width="200" height="200">


In [14]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

--------------------------------------------------------------------------------------------------------------!

### Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request.  But first, we'll need to setup serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

In [16]:
from sagemaker.predictor import csv_serializer
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batchs to CSV string payloads
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [19]:
import numpy as np
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.values[:, 1:])

There are many ways to compare the performance of a machine learning model, but let's start by simply by comparing actual to predicted values.  In this case, we're simply predicting whether the credit card transaction is Fraud (`1`) or not (`0`), which produces a simple confusion matrix.

#### Print Confusion Matrix
<img src="./images/Confusion_matrix.png" width="200" height="200">


In [21]:
import pandas as pd
pd.crosstab(index=test_data.iloc[:, 0], columns=np.round(predictions), rownames=['actual'], colnames=['predictions'])

predictions,0.0,1.0
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,52,0
1,3,44


## Model Tuning:


<img src="./images/model_tuning.png" width="700" height="700">



### Tuning The Model - Hyperparameter Optimization (HPO)

![HPO](./images/gif.gif "HPO Experiment")

![HPO](./images/Optimized_Controller.gif "HPO Experiment")


**Source: ** http://arxiv.org/abs/1509.01066 and https://www.youtube.com/watch?v=GiqNQdzc5TI


Hyperparameter tuning is a supervised machine learning regression problem. Given a set of input features (the hyperparameters), hyperparameter tuning optimizes a model for the metric that you choose. hyperparameter tuning makes guesses about which hyperparameter combinations are likely to get the best results, and runs training jobs to test these guesses.

In [None]:
'''hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                        'min_child_weight': ContinuousParameter(1, 10),
                        'alpha': ContinuousParameter(0, 2),
                        'max_depth': IntegerParameter(1, 10)}
objective_metric_name = 'validation:error'

tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=20,
                            objective_type='Minimize',
                            max_parallel_jobs=3)

tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}, include_cls_metadata=False)'''

Check the Hyperparameter Tuning Job Status

In [None]:
'''boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName="credit-fraud-detection-HPO-Job")['HyperParameterTuningJobStatus']'''

### Making an inference request After Loading The Model

Get the model artifact from S3 Location then the unpack it and load it to use it for prediction.

In [None]:
# Download the model artifacts before running the next piece of code

In [None]:
import pickle
import xgboost as xgb
transaction= "-1.009630,0.141192,0.167167,-0.808785,2.112167,-1.294934,0.592454,-0.049872,-0.284882,-1.296757,-1.010293,-0.272631,-0.139809,-0.918097,-0.475136,0.519497,0.158822,-0.120745,-0.519128,0.108956,-0.225473,-0.947079,0.054725,0.368866,-0.158482,0.070904,0.022035,0.177674,-0.279746,0.391123"
test = transaction.split(',')
data = np.asarray(test).reshape((1,-1))
test_matrix = xgb.DMatrix(data)
filename = "./xgboost-model"
xgb_loaded = pickle.load(open(filename, 'rb'))
predictions = xgb_loaded.predict(test_matrix)
predictions[0]

## Interperting the Machine Learning Model

In [None]:
from xgboost import plot_tree, Booster
import xgboost as xgb
import matplotlib.pyplot as plt
import pickle as pkl
from xgboost import plot_tree, plot_importance
from matplotlib.pylab import rcParams

filename='./xgboost-model'
# plot single tree
rcParams['figure.figsize'] = 50,50
 
model = pkl.load(open(filename,'rb')) 
plot_tree(model, num_trees=4)
plt.show()


## Now, Let's Deploy The Model in Lambda and API Gateway

Using Cloud9 which is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser. It includes a code editor, debugger, and terminal. Cloud9 comes prepackaged with essential tools for popular programming languages, including JavaScript, Python, PHP, and more, so you don’t need to install files or configure your development machine to start new projects.


## Useful Resources:

- Amazon Sagemaker: https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
- XGBoost Algorithm: https://xgboost.readthedocs.io/en/latest/
- Oversampling vs Undersampling: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
- Why Correlation Matters: https://towardsdatascience.com/why-feature-correlation-matters-a-lot-847e8ba439c4
- Correlation Matrix: https://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_matrices
- Hyperparameters Optimization: https://aws.amazon.com/blogs/aws/sagemaker-automatic-model-tuning/


In [28]:
%store prefix
%store bucket
%store s3_input_train
%store s3_input_validation

Stored 'prefix' (str)
Stored 'bucket' (str)
Stored 's3_input_train' (s3_input)
Stored 's3_input_validation' (s3_input)
