# FraudML - Autoencoder for Fraud Detection

## Using Algorithm ARN with Amazon SageMaker APIs

This sample notebook demonstrates the following:
1. Using an Algorithm ARN to run training jobs and use that result for inference
2. Using an AWS Marketplace product ARN - we will use the [Autoencoder for Fraud Detection](https://aws.amazon.com/marketplace/pp/prodview-vcfitb65ln2ro)

## Overall flow diagram
<img src="images/AlgorithmE2EFlow.jpg">

## Compatibility
This notebook is compatible only with the [Autoencoder for Fraud Detection](https://aws.amazon.com/marketplace/pp/prodview-vcfitb65ln2ro) algorithm published to AWS Marketplace. 

***Pre-Requisite:*** Please subscribe to this product before proceeding with this notebook

## Set up the environment

In [None]:
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()

# S3 prefixes
common_prefix = "autoencoder"
training_input_prefix = common_prefix + "/training-input-data"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"

### Create the session

The session remembers our connection parameters to Amazon SageMaker. We'll use it to perform all of our Amazon SageMaker operations.

In [None]:
sagemaker_session = sage.Session()

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using some the [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud) data set, which we have included. 

We can use the tools provided by the Amazon SageMaker Python SDK to upload the data to a default bucket. 

In [None]:
TRAINING_WORKDIR = "data/training"

training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print ("Training Data Location " + training_input)

## Creating Training Job using Algorithm ARN

The algorithm arn listed below belongs to the [Autoencoder for Fraud Detection](https://aws.amazon.com/marketplace/pp/prodview-vcfitb65ln2ro) product.

In [None]:
algorithm_arn = "insert-arn-here"

In [None]:
import json
import time
from sagemaker.algorithm import AlgorithmEstimator

algo = AlgorithmEstimator(
            algorithm_arn=algorithm_arn,
            role=role,
            train_instance_count=1,
            train_instance_type='ml.m4.xlarge',
            base_job_name='autoencoder-for-fraud-detection')

## Run Training Job

In [None]:
print ("Now run the training job using algorithm arn %s in region %s" % (algorithm_arn, sagemaker_session.boto_region_name))
algo.fit({'training': training_input})

## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [None]:
import pandas as pd
TRANSFORM_WORKDIR = "data/transform"
shape=pd.read_csv(TRANSFORM_WORKDIR + "/test_data.csv", header=None).drop([0,1], axis=1)
shape=shape.iloc[1:]
shape.to_csv(TRANSFORM_WORKDIR + "/test_data1.csv", index=False, header=False)
transform_input = sagemaker_session.upload_data(TRANSFORM_WORKDIR, key_prefix=batch_inference_input_prefix) + "/test_data1.csv"
print("Transform input uploaded to " + transform_input)

In [None]:
transformer = algo.transformer(1, 'ml.m4.xlarge')
transformer.transform(transform_input, content_type='text/csv')
transformer.wait()

print("Batch Transform output saved to " + transformer.output_path)

#### Inspect the Batch Transform Output in S3

In [None]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], "test_data1.csv")

s3_client = sagemaker_session.boto_session.client('s3')

response = s3_client.get_object(Bucket = sagemaker_session.default_bucket(), Key = file_key)
response_bytes = response['Body'].read().decode('utf-8')
print(response_bytes)

## Live Inference Endpoint

Finally, we demonstrate the creation of an endpoint for live inference using this AWS Marketplace algorithm generated model


In [None]:
from sagemaker.predictor import csv_serializer
predictor = algo.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

### Choose some data and use it for a prediction

In order to do some predictions, we'll extract some of the testing data and do predictions against it.


In [None]:
shape=pd.read_csv(TRANSFORM_WORKDIR + "/test_data1.csv", header=None)

import itertools
import numpy as np

a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]

test_data=shape.iloc[indices[:-1]]

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The serializers take care of doing the data conversions for us.

In [None]:
print(predictor.predict(test_data.values).decode('utf-8'))

### Cleanup the endpoint

In [None]:
algo.delete_endpoint()