# Solution Guidelines to the Banking Fraud Problem
Please know, this is not exhaustive. Students can and should produce a lot more than what we have here. Below is just the absolute minimum to get a trained model on this dataset.

Students should know that they may run out of memory on a t3.medium. If that is the case, please upgrade the notebook instance here up to the ml.m5.large. This will give them 8 GiB RAM.

Chosing the fast startup notebook should improve startup times here.

Try to do the notebook upgrade at the beginning, rather than the end, so they don't have to reset all of their variables. Otherwise, try to write their dataframes to disk prior to upgrading. 

## Revision History

`
Rev  Date         By      Description
PA5  2020-11-01   akirmak Working on SageM Studio. Dry Run review.
`

### Step: Upgrade SageMaker SDK

In [None]:
import sys
!{sys.executable} -m pip install sagemaker -U
!{sys.executable} -m pip install sagemaker-experiments

### Step: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import boto3
import re


import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig
from sagemaker.model_monitor import DataCaptureConfig, DatasetFormat, DefaultModelMonitor
from sagemaker.s3 import S3Uploader, S3Downloader
from sagemaker.serializers import CSVSerializer

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker


### Step: Get access to SDK via Boto Library, and get IAM Role

In [None]:
sess = boto3.Session()
sm = sess.client('sagemaker')
role = sagemaker.get_execution_role()

### Step: Create a Bucket in S3 that will hold your Trainining Data and Model Output

In [None]:
account_id = sess.client('sts', region_name=sess.region_name).get_caller_identity()["Account"]
bucket = 'sagemaker-studio-{}-{}'.format(sess.region_name, account_id)
prefix = 'archml-banking-writeup'

try:
    if sess.region_name == "us-east-1":
        sess.client('s3').create_bucket(Bucket=bucket)
    else:
        sess.client('s3').create_bucket(Bucket=bucket, 
                                        CreateBucketConfiguration={'LocationConstraint': sess.region_name})
except Exception as e:
    print("Looks like you already have a bucket of this name. That's good. Uploading the data files...")


---
## Data


### Step: Load data into a DataFrame

## Explore Data

### Step: Check Dataset shape with `df.shape`

### Step: Describe Dataset,and check statistical outlook using `df.describe()`

### Step: Check Target Variable class distribution with `df['isFraud'].value_counts()`

### Step: Prot Class Distribution using a library such as MatPlot, SeaBorn etc. 

```
%matplotlib inline
import matplotlib.pyplot as plt
df.hist(bins=15, figsize=(10,10))
plt.show()
```

## Feature Engineering


### Step: IdentifyMissing data with `df.isna().any()`

### Step: Drop Columns you think are unnecessary

```
drop_list = ['col1', 'col2', 'coln']

reduced_df = df.drop(drop_list, axis=1, inplace=False)
```

### Step: Identify Categorical data columns, and Perform things like one-hot-encoding

```
pd.get_dummies(data = df, columns = ['col1']
```

### IMPORTANT Step: Move target variable to the beginning (This is required by XGBoost)

```
   df = df[['target_feature', 'col1','col2']]
```

## Split Train, Validation and Test Dataset

In [None]:
train_data, validation_data, test_data = np.split(encoded_df.sample(frac=1, random_state=1729), [int(0.7 * len(encoded_df)), int(0.9 * len(encoded_df))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

### Step: Copy Test Dataset to another DataFrame that doesn't contain the target variable

In [None]:
test_data_reduced = test_data.drop("isFraud", axis=1, inplace=False)

test_data_reduced.to_csv('test.csv', header=False, index=False)


### Step: Upload Train & Validation Dataset to S3

In [None]:
# Return the URLs of the uploaded file, so they can be reviewed or used elsewhere
s3url = S3Uploader.upload('train.csv', 's3://{}/{}/{}'.format(bucket, prefix,'train'))
print(s3url)
s3url = S3Uploader.upload('validation.csv', 's3://{}/{}/{}'.format(bucket, prefix,'validation'))
print(s3url)

---
## Train ML Model using SageMaker SDK and with XGBoost Built-in Algorithm

We'll use the XGBoost library to train a class of models known as gradient boosted decision trees on the data that we just uploaded. 

Because we're using XGBoost, we first need to specify the locations of the XGBoost algorithm containers.

In [None]:
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest")
display(container)

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [None]:

s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

### Amazon SageMaker Experiments

Amazon SageMaker Experiments allows us to keep track of model training; organize related models together; and log model configuration, parameters, and metrics to reproduce and iterate on previous models and compare models. We'll create a single experiment to keep track of the different approaches we'll try to train the model.

Each approach or block of training code that we run will be an experiment trial. Later, we'll be able to compare different trials in Amazon SageMaker Studio.

Let's create the experiment.



In [None]:
sess = sagemaker.session.Session()

create_date = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
customer_churn_experiment = Experiment.create(experiment_name="archml-exp-banking-xgb-{}".format(create_date), 
                                              description="Using xgboost to predict banking writeup", 
                                              sagemaker_boto_client=boto3.client('sagemaker'))

#### Hyperparameters
Now we can specify our XGBoost hyperparameters.  Among them are these key hyperparameters:
- `max_depth` Controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  Typically, you need to explore some trade-offs in model performance between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` Controls sampling of the training data.  This hyperparameter can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` Controls the number of boosting rounds.  This value specifies the models that are subsequently trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` Controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` Controls how aggressively trees are grown.  Larger values lead to more conservative models.
- `min_child_weight` Also controls how aggresively trees are grown. Large values lead to a more conservative model.

For more information about these hyperparameters, see [XGBoost's hyperparameters GitHub page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [None]:
hyperparams = {"max_depth":5,
               "subsample":0.8,
               "num_round":30,
               "eta":0.2,
               "gamma":4,
               "min_child_weight":6,
               "silent":0,
               "objective":'binary:logistic'}

### Step: Model Training with XGBoost (on a Separate SageMaker Instance) 

For our first trial, we'll use the built-in XGBoost algorithm to train a model without supplying any additional code. This way, we can use XGBoost to train and deploy a model as we would with other Amazon SageMaker built-in algorithms.

We'll create a new `Trial` object and associate the trial with the experiment that we created earlier. To train the model, we'll create an estimator and specify a few parameters, like the type of training instances we'd like to use and how many, and where the artifacts of the trained model should be stored. 

We'll also associate the training job with the experiment trial that we just created when we call the `fit` method of the `estimator`.

In [None]:
trial = Trial.create(trial_name="algorithm-mode-trial-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime())), 
                     experiment_name=customer_churn_experiment.experiment_name,
                     sagemaker_boto_client=boto3.client('sagemaker'))

xgb = sagemaker.estimator.Estimator(image_uri=container,
                                    role=role,
                                    hyperparameters=hyperparams,
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    base_job_name="banking-writeup-",
                                    sagemaker_session=sess)

xgb.fit({'train': s3_input_train,
         'validation': s3_input_validation}, 
        experiment_config={
            "ExperimentName": customer_churn_experiment.experiment_name, 
            "TrialName": trial.trial_name,
            "TrialComponentDisplayName": "Training",
        }
       )

#### Review the results

After the training job completes successfully, you can view metrics, logs, and graphs related to the trial on the **Experiments** tab in Amazon SageMaker Studio. 



---
## Deploy the Model (Host the model)

Now that we've trained the model, let's deploy it to a hosted endpoint. To monitor the model after it's hosted and serving requests, we'll also add configurations to capture data that is being sent to the endpoint.

In [None]:
data_capture_prefix = '{}/datacapture'.format(prefix)

endpoint_name = "archml-banking-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName = {}".format(endpoint_name))

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1, 
                           instance_type="ml.m4.xlarge",
                           endpoint_name=endpoint_name,
                           serializer=CSVSerializer()
)

### Invoke the deployed model

Now that we have a hosted endpoint running, we can make real-time predictions from our model by making an http POST request.  But first, we need to set up serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

## Evaluate Training Results by Plotting a Confusion Matrix

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batchs to CSV string payloads
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
test_data.head(5)

In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = ",".join([predictions, xgb_predictor.predict(array).decode("utf-8")])

    return np.fromstring(predictions[1:], sep=",")


predictions = predict(test_data.to_numpy()[:, 1:])

In [None]:
predictions.shape

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_data.to_numpy()[:,0], predictions >= 0.50))

There are many ways to compare the performance of a machine learning model, but let's start by simply by comparing actual to predicted values.  In this case, we're simply predicting whether the customer churned (`1`) or not (`0`), which produces a simple confusion matrix.

In [None]:
pd.crosstab(index=test_data.iloc[:, 0], 
            columns=np.round(predictions), 
            rownames=['actual'], 
            colnames=['predictions'])

_Note, due to randomized elements of the algorithm, you results may differ slightly._

Of the .. fraudsters, we've correctly predicted xx of them (true positives). And, we incorrectly predicted .. cases would be fraud who then ended up not doing so (false positives).  There are also ... who ended up fraud, that we predicted would not (false negatives).

An important point here is that because of the `np.round()` function above we are using a simple threshold (or cutoff) of 0.5.  Our predictions from `xgboost` come out as continuous values between 0 and 1 and we force them into the binary classes that we began with.  However, because a fraud is expected to cost the company more, we should consider adjusting this cutoff.  That will almost certainly increase the number of false positives, but it can also be expected to increase the number of true positives and reduce the number of false negatives.

To get a rough intuition here, let's look at the continuous values of our predictions.

In [None]:
plt.hist(predictions)
plt.show()

The continuous valued predictions coming from our model tend to skew toward 0 or 1, but there is sufficient mass between 0.1 and 0.9 that adjusting the cutoff should indeed shift a number of customers' predictions.  For example...


## Re-evaluate Training Results by changing Cutoff Threshold
We can see that changing the cutoff from 0.5 to 0.3 results in .. more true positives, ... more false positives, and ... fewer false negatives.  The numbers are small overall here, but that's ..% of customers overall that are shifting because of a change to the cutoff.  Was this the right decision?  We may end up predicting ... extra cases, but we also unnecessarily alarm ... more customers who are not fraud.  Determining optimal cutoffs is a key step in properly applying machine learning in a real-world setting.  Let's discuss this more broadly and then apply a specific, hypothetical solution for our current problem.

In [None]:
pd.crosstab(index=test_data.iloc[:, 0], 
            columns=np.where(predictions > 0.3, 1, 0),
            rownames=['actual'], 
            colnames=['predictions'])

In [None]:
cutoffs = np.arange(0.01, 1, 0.01)
costs = []
for c in cutoffs:
    costs.append(np.sum(np.sum(np.array([[0, 100], [500, 100]]) * 
                               pd.crosstab(index=test_data.iloc[:, 0], 
                                           columns=np.where(predictions > c, 1, 0)))))

costs = np.array(costs)
plt.plot(cutoffs, costs)
plt.show()
print('Cost is minimized near a cutoff of:', 
      cutoffs[np.argmin(costs)], 
      'for a cost of:', 
      np.min(costs))

## Clean up

If you no longer need this notebook, clean up your environment by running the following cell. It removes the hosted endpoint that you created for this walkthrough and prevents you from incurring charges for running an instance that you no longer need. It also cleans up all artifacts related to the experiments. 

You might also want to delete artifacts stored in the S3 bucket used in this notebook. To do so, open the Amazon S3 console, find the `sagemaker-studio-<region-name>-<account-name>` bucket, and delete the files associated with this notebook.

In [None]:
sess.delete_endpoint(xgb_predictor.endpoint)

## Optional - Appendix

#### Trying other hyperparameter values

To improve a model, you typically try other hyperparameter values to see if they affect the final validation error. Let's vary the `min_child_weight` parameter and start other training jobs with those different values to see how they affect the validation error. For each value, we'll create a separate trial so that we can compare the results in Amazon SageMaker Studio later.

In [None]:
# min_child_weights = [1, 10]

# # min_child_weights = [1, 2, 4, 8, 10]

# for weight in min_child_weights:
#     hyperparams["min_child_weight"] = weight
#     trial = Trial.create(trial_name="banking-writeup-algol-mode-{}-weight-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime()), weight), 
#                          experiment_name=customer_churn_experiment.experiment_name,
#                          sagemaker_boto_client=boto3.client('sagemaker'))

#     t_xgb = sagemaker.estimator.Estimator(image_uri=docker_image_name,
#                                           role=role,
#                                           hyperparameters=hyperparams,
#                                           train_instance_count=1, 
#                                           train_instance_type='ml.m4.xlarge',
#                                           output_path='s3://{}/{}/output'.format(bucket, prefix),
#                                           base_job_name="banking-writeup-",
#                                           sagemaker_session=sess)

#     t_xgb.fit({'train': s3_input_train,
#                'validation': s3_input_validation},
#                 wait=False,
#                 experiment_config={
#                     "ExperimentName": customer_churn_experiment.experiment_name, 
#                     "TrialName": trial.trial_name,
#                     "TrialComponentDisplayName": "Training",
#                 }
#                )