# FraudML - Autoencoder for Fraud Detection

## Using Algorithm ARN with Amazon SageMaker APIs

This sample notebook demonstrates the following:
1. Using an Algorithm ARN to run training jobs and use that result for inference
2. Using an AWS Marketplace product ARN - we will use the [Autoencoder for Fraud Detection](https://aws.amazon.com/marketplace/pp/prodview-vcfitb65ln2ro)

## Overall flow diagram
<img src="images/AlgorithmE2EFlow.jpg">

## Compatibility
This notebook is compatible only with the [Autoencoder for Fraud Detection](https://aws.amazon.com/marketplace/pp/prodview-vcfitb65ln2ro) algorithm published to AWS Marketplace. 

***Pre-Requisite:*** Please subscribe to this product before proceeding with this notebook

## Set up the environment

In [1]:
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()

# S3 prefixes
common_prefix = "autoencoder"
training_input_prefix = common_prefix + "/training-input-data"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"

### Create the session

The session remembers our connection parameters to Amazon SageMaker. We'll use it to perform all of our Amazon SageMaker operations.

In [2]:
sagemaker_session = sage.Session()

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using some the [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud) data set, which we have included. 

We can use the tools provided by the Amazon SageMaker Python SDK to upload the data to a default bucket. 

In [3]:
TRAINING_WORKDIR = "data/training"

training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print ("Training Data Location " + training_input)

Training Data Location s3://sagemaker-us-east-2-587740566727/autoencoder/training-input-data


## Creating Training Job using Algorithm ARN

The algorithm arn listed below belongs to the [Autoencoder for Fraud Detection](https://aws.amazon.com/marketplace/pp/prodview-vcfitb65ln2ro) product.

In [4]:
from src.autoencoder_product_arns import AutoencoderArnProvider

algorithm_arn = AutoencoderArnProvider.get_algorithm_arn(sagemaker_session.boto_region_name)

In [5]:
import json
import time
from sagemaker.algorithm import AlgorithmEstimator

algo = AlgorithmEstimator(
            algorithm_arn=algorithm_arn,
            role=role,
            train_instance_count=1,
            train_instance_type='ml.m4.xlarge',
            base_job_name='autoencoder-for-fraud-detection')

## Run Training Job

In [6]:
print ("Now run the training job using algorithm arn %s in region %s" % (algorithm_arn, sagemaker_session.boto_region_name))
algo.fit({'training': training_input})

Now run the training job using algorithm arn arn:aws:sagemaker:us-east-2:587740566727:algorithm/autoencoder-1588167541 in region us-east-2
2020-04-30 10:18:24 Starting - Starting the training job...
2020-04-30 10:18:25 Starting - Launching requested ML instances......
2020-04-30 10:19:48 Starting - Preparing the instances for training...
2020-04-30 10:20:24 Downloading - Downloading input data...
2020-04-30 10:20:35 Training - Downloading the training image......
2020-04-30 10:21:37 Training - Training image download completed. Training in progress.[34mUsing TensorFlow backend.[0m
[34mStarting the training.[0m
[34mInstructions for updating:[0m
[34mIf using Keras pass *_constraint arguments to layers.[0m
[34m181961 22746 Tensor("input_1:0", shape=(?, 28), dtype=float32) (?, 28) (None, 28) (None, 28) float64[0m
[34mTrain on 181961 samples, validate on 22746 samples[0m
[34m2020-04-30 10:21:42.055434: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not lo

[34mEpoch 3/10[0m


[34mEpoch 4/10[0m
[34mEpoch 5/10[0m


[34mEpoch 6/10[0m


[34mEpoch 7/10[0m


[34mEpoch 8/10[0m
[34mEpoch 9/10[0m



[34mEpoch 10/10[0m


[34mTraining complete.[0m



2020-04-30 10:22:59 Completed - Training job completed
Training seconds: 155
Billable seconds: 155


## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [45]:
import pandas as pd
TRANSFORM_WORKDIR = "data/transform"
shape=pd.read_csv(TRANSFORM_WORKDIR + "/test_data.csv", header=None).drop([0,1], axis=1)
shape=shape.iloc[1:]
shape.to_csv(TRANSFORM_WORKDIR + "/test_data1.csv", index=False, header=False)
transform_input = sagemaker_session.upload_data(TRANSFORM_WORKDIR, key_prefix=batch_inference_input_prefix) + "/test_data1.csv"
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-2-587740566727/autoencoder/batch-inference-input-data/test_data1.csv


In [46]:
transformer = algo.transformer(1, 'ml.m4.xlarge')
transformer.transform(transform_input, content_type='text/csv')
transformer.wait()

print("Batch Transform output saved to " + transformer.output_path)

..........
...................[34mStarting the inference server with 4 workers.[0m
[34m[2020-04-30 11:46:18 +0000] [11] [INFO] Starting gunicorn 19.10.0[0m
[34m[2020-04-30 11:46:18 +0000] [11] [INFO] Listening at: unix:/tmp/gunicorn.sock (11)[0m
[34m[2020-04-30 11:46:18 +0000] [11] [INFO] Using worker: gevent[0m
[34m[2020-04-30 11:46:18 +0000] [16] [INFO] Booting worker with pid: 16[0m
[34m[2020-04-30 11:46:18 +0000] [17] [INFO] Booting worker with pid: 17[0m
[34m[2020-04-30 11:46:18 +0000] [18] [INFO] Booting worker with pid: 18[0m
[34m[2020-04-30 11:46:18 +0000] [25] [INFO] Booting worker with pid: 25[0m
[34mInstructions for updating:[0m
[34mCall initializer instance with the dtype argument instead of passing it to the constructor[0m
[34mInstructions for updating:[0m
[34mCall initializer instance with the dtype argument instead of passing it to the constructor[0m
[34mInstructions for updating:[0m
[34mIf using Keras pass *_constraint arguments to layers.[0m

#### Inspect the Batch Transform Output in S3

In [48]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], "test_data1.csv")

s3_client = sagemaker_session.boto_session.client('s3')

response = s3_client.get_object(Bucket = sagemaker_session.default_bucket(), Key = file_key)
response_bytes = response['Body'].read().decode('utf-8')
print(response_bytes)

"[0.         0.         0.         0.         0.         0.
 0.04495115 0.         0.         0.         0.         0.
 0.         0.07881181 0.         0.18280244 0.00850695 0.
 0.         0.         0.         0.         0.         0.
 0.17540304 0.72252455 0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.07884263 0.32763249 0.         0.52146017 0.01206959
 0.         0.         0.22562867 0.         0.         0.27981806
 0.10323042 0.         0.         0.01554949 0.         0.11196473
 0.         0.         0.         0.         0.05725251 0.29135864
 0.         0.14727443 0.06335466 0.         0.         0.
 0.39426629 0.13248053 0.         0.         0.         0.15007943
 0.         0.05942433 0.19972181 0.         0.         0.
 0.         0.         0.10343988 0.         0.         0.
 0.         0.06237259 0.         0.0314042  0.         0.
 0.         0. 

## Live Inference Endpoint

Finally, we demonstrate the creation of an endpoint for live inference using this AWS Marketplace algorithm generated model


In [49]:
from sagemaker.predictor import csv_serializer
predictor = algo.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

Using already existing model package: scikit-from-aws-marketplace-2020-04-30-10-18-23-846


.


Using already existing model: scikit-from-aws-marketplace-2020-04-30-10-18-23-846


-----------!

### Choose some data and use it for a prediction

In order to do some predictions, we'll extract some of the testing data and do predictions against it.


In [55]:
shape=pd.read_csv(TRANSFORM_WORKDIR + "/test_data1.csv", header=None)

import itertools
import numpy as np

a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]

test_data=shape.iloc[indices[:-1]]

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The serializers take care of doing the data conversions for us.

In [67]:
print(predictor.predict(test_data.values).decode('utf-8'))

"[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.24002708 0.
 0.84355595 0.99980577 0.         0.         0.         0.
 0.0807944  0.         0.         0.86060132 0.35867418 0.63242463
 0.         0.         0.         0.         0.93750078]"



### Cleanup the endpoint

In [68]:
algo.delete_endpoint()