# Bank Fraud Prediction with XGBoost

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Compile](#Compile)
1. [Host](#Host)
  1. [Evaluate](#Evaluate)
  1. [Relative cost of errors](#Relative-cost-of-errors)
1. [Extensions](#Extensions)

---

## Background

_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_

---

## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

---
## Data

In [3]:
bucket = 'bank-fraud-detection-md'
prefix = 'sagemaker/DEMO-xgboost-churn'

s3 = boto3.client('s3')
obj = s3.get_object(Bucket= bucket, Key= 'PS_20174392719_1491204439457_log.csv') 

In [4]:
df = pd.read_csv(obj['Body'])
df.drop_duplicates(keep=False,inplace=True) 
#pd.set_option('display.max_columns', 500)
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


The last attribute, `isFraud`, is known as the target attribute–the attribute that we want the ML model to predict.  Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.
Exploring the data:

In [5]:
drop_list  = ['step', 'nameOrig','nameDest','isFlaggedFraud']
reduced_df = df.drop(drop_list, axis=1, inplace=False)

In [6]:
reduced_df.head()

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,PAYMENT,9839.64,170136.0,160296.36,0.0,0.0,0
1,PAYMENT,1864.28,21249.0,19384.72,0.0,0.0,0
2,TRANSFER,181.0,181.0,0.0,0.0,0.0,1
3,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,1
4,PAYMENT,11668.14,41554.0,29885.86,0.0,0.0,0


Now that we've cleaned up our dataset, let's determine which algorithm to use.  As mentioned above, there appear to be some variables where both high and low (but not intermediate) values are predictive of churn.  In order to accommodate this in an algorithm like linear regression, we'd need to generate polynomial (or bucketed) terms.  Instead, let's attempt to model this problem using gradient boosted trees.  Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint.  XGBoost uses gradient boosted trees which naturally account for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.

Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format.  For this example, we'll stick with CSV.  It should:
- Have the predictor variable in the first column
- Not have a header row

But first, let's convert our categorical features into numeric features.

In [12]:
encoded_df = pd.get_dummies(data=reduced_df, columns=['type'])
encoded_df.head(5)

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,9839.64,170136.0,160296.36,0.0,0.0,0,0,0,0,1,0
1,1864.28,21249.0,19384.72,0.0,0.0,0,0,0,0,1,0
2,181.0,181.0,0.0,0.0,0.0,1,0,0,0,0,1
3,181.0,181.0,0.0,21182.0,0.0,1,0,1,0,0,0
4,11668.14,41554.0,29885.86,0.0,0.0,0,0,0,0,1,0


In [13]:
label = 'isFraud'
ys = encoded_df[label]
xs = encoded_df.drop(label, axis=1, inplace=False)

In [14]:
xs.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,9839.64,170136.0,160296.36,0.0,0.0,0,0,0,1,0
1,1864.28,21249.0,19384.72,0.0,0.0,0,0,0,1,0
2,181.0,181.0,0.0,0.0,0.0,0,0,0,0,1
3,181.0,181.0,0.0,21182.0,0.0,0,1,0,0,0
4,11668.14,41554.0,29885.86,0.0,0.0,0,0,0,1,0


In [15]:
ys.head()

0    0
1    0
2    1
3    1
4    0
Name: isFraud, dtype: int64

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(xs,ys, test_size=0.33, random_state=42)

In [22]:
train_set = pd.concat([y_train,X_train], axis=1)

train_set.to_csv('training_data.csv',index=False, header=False)
!aws s3 cp training_data.csv s3://bank-fraud-detection-md/

upload: ./training_data.csv to s3://bank-fraud-detection-md/training_data.csv


In [23]:
test_set = pd.concat([y_test,X_test], axis=1)

test_set.to_csv('test_data.csv', index=False, header=False)
!aws s3 cp test_data.csv s3://bank-fraud-detection-md/

upload: ./test_data.csv to s3://bank-fraud-detection-md/test_data.csv


---
## Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [29]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

s3_input_train      = sagemaker.s3_input('s3://bank-fraud-detection-md/training_data.csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://bank-fraud-detection-md/test_data.csv')

	get_image_uri(region, 'xgboost', '1.0-1').


In [30]:
# Define IAM role
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
import boto3
import re

container = get_image_uri(boto3.Session().region_name, 'xgboost')
role = get_execution_role()

	get_image_uri(region, 'xgboost', '1.0-1').


Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://bank-fraud-detection-md/',
                                    sagemaker_session=sess)

xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, wait = False) 

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='m1.m4.xlarge')

In [39]:
!aws sagemaker list-endpoints

{
    "Endpoints": []
}


In [42]:
#my_existing_endpoint_name = 'xgoost-2020-07-30'
#xgb_predictor = sagemaker.RealTimePredictor(endpoint=my_existing_endpoint_name)

In [44]:
#xgb_predictor.content_type = 'text/csv'
#xgb_predictor.serialzer = csv_serializer

In [46]:
#import numpy as np

#def predict(data, rows=500):
#    split_array = np.array_split(data, int(data.shape[0]/float(rows) + 1))
#    predictions = ''
#    for array in split_array:
#        predictions = ','.join([predictions,xgb_predictor.predict(array).decode('utf-8')])
        
#    return np.formstring(predictions[1:], sep=',')

#predictions = predict(test_set.drop('isFraud', axis=1).to_numpy())

In [50]:
#pd.crosstab(index=test_set['isFraud'], columns = np.round(predictions), rownames=['actuals'], colnames=['predictions'])