# Credit card fraud detector

In [None]:
# import libraries
import boto3, re, sys, math, json, os, sagemaker
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer


In [None]:
bucket='bucketfraud'

In [None]:
!aws s3 cp 's3://bucketfraud/creditcard.csv' 'creditcard.csv'

## Investigate and process the data

Let's start by reading in the credit card fraud data set.

In [None]:
data = pd.read_csv('creditcard.csv', delimiter=',')
data.head()

Let's take a peek at our data (we only show a subset of the columns in the table):

We will be going to use the same data in amazon fruad detector to perform a supervised classification task- in this notebook, we will build an unsupervised anomally detector to export to the frauddetector and use both modles in an ensemble fashion in our detection rule engine. When we export a model from Sagemaker into Fraud Detector, we will need to first specify all the varibles in Fraud Detector. Fraud detector requires the name of the variables to be lowere case without special characters and therefore we will lowercase the name of the above variables.

In [None]:
data.columns = map(str.lower, data.columns)
print(data.columns)

The dataset contains
only numerical features, because the original features have been transformed for confidentiality using PCA. As a result,
the dataset contains 28 PCA components, V1-V28, and two features that haven't been transformed, _Amount_ and _Time_.
_Amount_ refers to the transaction amount, and _Time_ is the seconds elapsed between any transaction in the data
and the first transaction.

The class column corresponds to whether or not a transaction is fraudulent. We see that the majority of data is non-fraudulent with only $492$ ($0.173\%$) of the data corresponding to fraudulent examples, out of the total of 284,807 examples in the data.

In [None]:
nonfrauds, frauds = data.groupby('class').size()
print('Number of frauds: ', frauds)
print('Number of non-frauds: ', nonfrauds)
print('Percentage of fradulent data:', 100.*frauds/(frauds + nonfrauds))

We already know that the columns $V_i$ have been normalized to have $0$ mean and unit standard deviation as the result of a PCA.

In [None]:
feature_columns = data.iloc[:,:-1]
label_column = data.iloc[: , -1]


Next, we will prepare our data for loading and training.

## Training

We will split our dataset into a train and test to evaluate the performance of our models. It's important to do so _before_ any techniques meant to alleviate the class imbalance are used. This ensures that we don't leak information from the test set into the train set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    feature_columns, label_column, test_size=0.2, random_state=42)

len(X_train)

In [None]:
# X_train.to_csv('train.csv', index=False, header=False)

# boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join('train/train.csv')).upload_file('train.csv')

# train_data = sagemaker.inputs.TrainingInput(s3_data='s3://{}/train'.format(bucket),
#        content_type='text/csv;label_size=0',
#        distribution='ShardedByS3Key')

## Unsupervised Learning

In a fraud detection scenario, commonly we will have very few labeled examples, and it's possible that labeling fraud takes a very long time. We would like then to extract information from the unlabeled data we have at hand as well. _Anomaly detection_ is a form of unsupervised learning where we try to identify anomalous examples based solely on their feature characteristics. Random Cut Forest is a state-of-the-art anomaly detection algorithm that is both accurate and scalable. We will train such a model on our training data and evaluate its performance on our test set.

In [None]:
import boto3
import os
import sagemaker
from sagemaker import get_execution_role


session = sagemaker.Session()
bucket='bucketfraud'
prefix = 'fraud-classifier'

output_path='s3://{}/{}/output'.format(bucket, prefix)
print(output_path)

In [None]:
from sagemaker import RandomCutForest

# specify general training job information
rcf = RandomCutForest(role=get_execution_role(),
                      instance_count=1,
                      instance_type='ml.c4.xlarge',
                      data_location=f"s3://{bucket}/train/",
                      output_path='s3://{}/{}/output'.format(bucket, prefix),
                      num_samples_per_tree=1000,
                      num_trees=100)

Now we are ready to fit the model. The below cell should take around 5 minutes to complete.

In [None]:
rcf.fit(rcf.record_set(X_train.to_numpy()))
#rcf.fit({'train': train_data}) 

### Host Random Cut Forest

Once we have a trained model we can deploy it and get some predictions for our test set. SageMaker will spin up an instance for us and deploy the model, the whole process should take around 10 minutes, you will see progress being made with each `-` and an exclamation point when the process is finished.

In [None]:
rcf_predictor = rcf.deploy(
    #model_name="{}-rcf".format('model'),
    #endpoint_name="{}-rcf".format(rcf_inference.endpoint),
    initial_instance_count=1,
    instance_type='ml.c4.xlarge',
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer()
        )

In [None]:
test_data=X_test.values
print(test_data[1])

In [None]:
result = rcf_predictor.predict(test_data[1], initial_args={"ContentType": "text/csv", "Accept": "application/json"})
print(result)

### Test Random Cut Forest

With the model deployed, let's see how it performs in terms of separating fraudulent from legitimate transactions.

In [None]:
def predict_rcf(current_predictor, data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = []
    for array in split_array:
        array_preds = [s['score'] for s in current_predictor.predict(array, initial_args={"ContentType": "text/csv", "Accept": "application/json"})['scores']]
        predictions.append(array_preds)
    return np.concatenate([np.array(batch) for batch in predictions])
    

In [None]:
results=predict_rcf(rcf_predictor, test_data)


In [None]:
#print(results)
len(results)

Lets plot the scores and have a look at their distributions

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)

In [None]:
sns.distplot(results)
plt.legend()

Below we identify any data points with scores greater than 3 standard deviations (approx 99.9th percentile) from the mean score.

In [None]:
score_mean =results.mean()
score_std = results.std()
score_cutoff = score_mean + 2 * score_std

In [None]:
positives = X_test[Y_test == 1]
positives=positives.values
#print(positives[1])
positives_scores = predict_rcf(rcf_predictor, positives)

negatives = X_test[Y_test == 0]
negatives=negatives.values
negatives_scores = predict_rcf(rcf_predictor, negatives)

If we use our groubd truth label, we can see that our random forest model already can achieve some separation between the classes, with majority of the frud cases (Red) having a higher anomaly score.

In [None]:
plt.figure(figsize=(10,7));
ax=sns.distplot(positives_scores, label='fraud', bins=50, color='red')
ax=sns.distplot(negatives_scores, label='not-fraud', bins=50, color='blue')
ax.set(xlabel='Anomaly score', ylabel='Counts')
ax.axvline(score_cutoff, color='g', linestyle='--', label="score_cutoff")
plt.legend()