# An Introduction to the AWS Fraud Detector Prediction API  
#### Supervised fraud detection  
-------
- [Introduction](#Introduction)
- [Setup](#Setup)
- [Plan](#Plan)

### Reviews 
https://github.com/aayush210789/Deception-Detection-on-Amazon-reviews-dataset



## Introduction
-------

Amazon Fraud Detector is a fully managed service that makes it easy to identify potentially fraudulent online activities such as online payment fraud and the creation of fake accounts. Fraud Detector capitalizes on the latest advances in machine learning (ML) and 20 years of fraud detection expertise from AWS and Amazon.com to automatically identify potentially fraudulent activity so you can catch more fraud faster. 

In this notebook, we'll use the Amazon Fraud Detector Predict API to apply a Dector to sample data to identify potentially fraudlent envents. After running this notebook you should be able to: 

- Apply the Detector's "predict" function, to generate a model score and rule outcomes on data  

If you would like to know more, please check out the [Fraud Detector's Documentation](https://docs.aws.amazon.com/frauddetector/). 

## Setup
------
First setup your AWS credentials so that Fraud Detector can store and access training data and supporting detector artifacts 


### Setting up AWS Credentials & Permissions

https://docs.aws.amazon.com/frauddetector/latest/ug/set-up.html

To use Amazon Fraud Detector, you have to set up permissions that allow access to the Amazon Fraud Detector console and API operations. You also have to allow Amazon Fraud Detector to perform tasks on your 
behalf and to access resources that you own.

We recommend creating an AWS Identify and Access Management (IAM) user with access restricted to Amazon Fraud Detector operations and required permissions. You can add other permissions as needed.

The following policies provide the required permission to use Amazon Fraud Detector:

- *AmazonFraudDetectorFullAccessPolicy*  
    Allows you to perform the following actions:  
        - Access all Amazon Fraud Detector resources  
        - List and describe all model endpoints in Amazon SageMaker  
        - List all IAM roles in the account  
        - List all Amazon S3 buckets  
        - Allow IAM Pass Role to pass a role to Amazon Fraud Detector  

- *AmazonS3FullAccess*  
    Allows full access to Amazon S3. This is required to upload training files to S3.  



## Plan
------

A *Detector* contains the model(s) and rule(s) detection logic for a particular type of fraud that you want to detect. We'll use the following 5 step process to plan a Fraud Detector: 

1. Detector Name  
    - You'll need the name of the detector, you can look this up in the AFD console 
    
2. Model Name   
    - You'll need the active model name and Version used by the detector  
    
3. Call Prediction API 
    - You'll need to specify the number of records to predict 

4. Score Threshold
    - the score threshold is the cut off where above the threshold you'll call the record fraud else it's legit 
    - this is used to create confusion matrix (TP,FP,TN,FN)  
    


In [None]:
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
# ------------------------------------------------------------------

import numpy as np
np.seterr(divide='ignore', invalid='ignore')
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# -- dask for parallelisim -- 
import dask 

# -- standard stuff -- 
import os
import sys
import time
from datetime import datetime
import json

# -- AWS stuff -- 
import boto3
import sagemaker

# -- sklearn --
from sklearn.metrics import roc_curve, roc_auc_score, auc, roc_auc_score
%matplotlib inline 

## Initialize AWS Fraud Detector Client 
------

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/frauddetector.html 


In [None]:
# -- fraud detector client --
client = boto3.client('frauddetector')


### Detector, Model, and Identifiers 
-----
<div class="alert alert-info"> 💡 <strong> Detector, Model and Versions </strong>

- DETECTOR_NAME & VERSION coresponds to the name and version of your deployed Fraud Detector  
- MODEL_NAME & VERSION coresponds to the name and version of the model deployed with your Fraud Detector   
- FRAUD_LABEL is useful if you are <b> comparing performance of your detector's </b>predictions to known frauds this is optional   
- EMAIL_ADDRESS is used as a key to identify your prediction infrences, you can look up a specific infrence in console by seraching for a specific email address. This maps to the <b>email address field</b> in your file you are predicting on. 
- S3_FILE this is the url of the S3 file you wish to apply your detector to.   

</div>

In [None]:
DETECTOR_NAME = "your_fraud_detector_name"
DETECTOR_VER  = '1.0'

MODEL_NAME    = "your_model_name"
MODEL_VER     = '1.0'

# -- if fraud label exists -- 
FRAUD_LABEL   = "your_target_field"

# -- use email as the identifier for predictions
EMAIL_ADDRESS = "email_address"

# -- input file of data to be scored -- 
S3_FILE       = "s3://your-bucket-name/your-file-to-predict.csv"

#### Load Data to be Scored 
-----
<div class="alert alert-info"> 💡 <strong> Check the first 5 Records </strong>

- Does your data look correct? 
</div>

In [None]:
df = pd.read_csv(S3_FILE)
df.head()

#### Detector Details 
-----

This simply displays details about your detector, the main thing you want to see is that your Detector's statsus is 'ACTIVE' 

In [None]:
# -- details on your detector -- 
response = client.describe_detector(
    detectorId = DETECTOR_NAME ,
)
response 

### Model Info
------

This section will display the score threshold table that you see in console. 

In [None]:
# -- model performance summary -- 
auc = eval(client.describe_model_versions(
    modelId= MODEL_NAME,
    modelVersionNumber=MODEL_VER,
    modelType='ONLINE_FRAUD_INSIGHTS',
    maxResults=10
)['modelVersionDetails'][0]['trainingMetrics']['auc'])

thr = eval(client.describe_model_versions(
    modelId= MODEL_NAME,
    modelVersionNumber=MODEL_VER,
    modelType='ONLINE_FRAUD_INSIGHTS',
    maxResults=10
)['modelVersionDetails'][0]['trainingMetrics']['thresholds'])

fpr = eval(client.describe_model_versions(
    modelId= MODEL_NAME,
    modelVersionNumber=MODEL_VER,
    modelType='ONLINE_FRAUD_INSIGHTS',
    maxResults=10
)['modelVersionDetails'][0]['trainingMetrics']['fpr'])

tpr = eval(client.describe_model_versions(
    modelId= MODEL_NAME,
    modelVersionNumber=MODEL_VER,
    modelType='ONLINE_FRAUD_INSIGHTS',
    maxResults=10
)['modelVersionDetails'][0]['trainingMetrics']['tpr'])

precision = eval(client.describe_model_versions(
    modelId= MODEL_NAME,
    modelVersionNumber=MODEL_VER,
    modelType='ONLINE_FRAUD_INSIGHTS',
    maxResults=10
)['modelVersionDetails'][0]['trainingMetrics']['precision'])
precision

df_model = pd.DataFrame(list(zip(thr, fpr, tpr, precision)), columns=['thr','fpr', 'tpr', 'precision'])
model_stat = df_model.round(decimals=2)               
m = model_stat.loc[model_stat.groupby(["fpr"])["thr"].idxmax()] 
def make_rule(x):
    return "\'score > " + str(x) + "\'"
    
m['score threshold'] = m['thr'].apply(lambda x: make_rule(x))

print (" --- score thresholds 1% to 10% --- ")
print(m[["fpr", "tpr", "score threshold"]].loc[(m['fpr'] > 0.0 ) & (m['fpr'] <= 0.1)].reset_index(drop=True))

### Score Thresholds 

----
Identify a score threshold based on the false positive rate. 

- most operatoins operate at a 1% - 4% false positive ratio 


<div class="alert alert-info"> 💡 <strong> False Positive & True Positive Rates </strong>

- false positive rate - is the % of events incorrectly identified as fraud for a given score threshold 

- true positive rate - is the % of events correctly identified at a given score threshold 

- identify a score threshold that coresponds to your false positive rate, if in doubt start with the 1% FPR


</div>

In [None]:
# -- update threshold -- 
score_threshold = 800

# -- FPR based on score threshold above -- 
fpr_threshold = list(m.iloc[(m['thr']- score_threshold).abs().argsort()[:1]]['fpr'])[0]
tpr_threshold = list(m.iloc[(m['thr']- score_threshold).abs().argsort()[:1]]['tpr'])[0]

print("At the score of " + str(score_threshold) + " the model will idenitify " + str(tpr_threshold * 100) + "% of fraudulent events with a " + str(fpr_threshold*100)+"% false rate")

### Model ROC Plot 
------
The following charts show the model Area Under the Curve at the Specified Score threshold and coresponding False Positive Rate and True Positive Rate

In [None]:
plt.figure(figsize=(20,10),)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,10))

fig.suptitle( MODEL_NAME + ' ROC Chart \n\n score > ' + 
           str(score_threshold) + ' = FPR @' + 
           str(fpr_threshold*100) + '% AND TPR @' + 
           str(tpr_threshold*100) +'%') 
ax1.plot(fpr, tpr, color='darkorange',
         lw=2, label='ROC curve (area = %0.3f)' % auc)
ax1.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
#plt.xlabel('False Positive Rate')
#plt.ylabel('True Positive Rate')

ax1.legend(loc="lower right",fontsize=12)
ax1.set(xlabel='False Positive Rate', ylabel='True Positive Rate')
ax1.axvline(x = fpr_threshold ,linewidth=2, color='r')
ax1.axhline(y = tpr_threshold ,linewidth=2, color='r')


ax2.plot(fpr, tpr, color='darkorange',
         lw=2, label='ROC curve (area = %0.3f)' % auc)
ax2.set_xlim([0, 0.2])

ax2.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')


ax2.legend(loc="lower right",fontsize=12)
ax2.axhline(y = tpr_threshold ,linewidth=2, color='r')
ax2.set(xlabel='False Positive Rate', ylabel='True Positive Rate')


fig.show()

## Get Setup for Scoring
-----
The following function returns model variables  


<div class="alert alert-info"> 💡 <strong> Model Variables </strong>

- pass just the variables needed for the detector to score 

</div>


In [None]:
def get_model_variables(MODEL_NAME):
    """ return list of variables used by a model 
    
    """
    response = client.get_models(
    modelType='ONLINE_FRAUD_INSIGHTS',
    modelId= MODEL_NAME)
    model_variables = []

    for v in response['models'][0]['modelVariables']:
        model_variables.append(v['name'])
    return model_variables

model_variables = get_model_variables(MODEL_NAME)
print("\n -- model variables -- ")
print(model_variables)

## Run Predictions  
-----
The following applies the get_prediction to records   

<i> Note: this uses the Dask backend to parallelize the prediction calls. </i>



<div class="alert alert-info"> 💡 <strong>get_prediction </strong>

- Specify the number of records to score, you change the record_count to a specific number if you want to just predict on say 100 records, by default it assumes you want to apply predicitons to the whole dataset. 
- Once completed conver json to a pandas dataframe, appends any existing labels
- Analyze based on score threshold for a particular false positive rate FPR

</div>

In [None]:
record_count = df.shape[0]
start = time.time()

@dask.delayed
def _predict(record):
    stime = time.time()
    try:
        pred  = client.get_prediction(detectorId=DETECTOR_NAME, detectorVersionId='1.0', eventId = record[EMAIL_ADDRESS], eventAttributes = record)
        etime = time.time()
        record['outcome'] = pred['outcomes'][0]
        record['status'] = pred['ResponseMetadata']['HTTPStatusCode']
        record['score']  = pred['modelScores'][0]['scores'][MODEL_NAME + '_insightscore']
        record['score_ms'] = ((etime - stime)*1000)
        return record
    except:
        pred  = client.get_prediction(detectorId=DETECTOR_NAME, detectorVersionId='1.0', eventId = record[EMAIL_ADDRESS], eventAttributes = record)
        etime = time.time()
        record['outcome'] = '-- failed --'
        record['status']  = pred['ResponseMetadata']['HTTPStatusCode']
        record['score']   =  -1 
        record['score_ms'] = ((etime - stime)*1000)
        return record

predict_data  = df[model_variables].head(record_count).astype(str).to_dict(orient='records')
predict_score = []

i=0
for record in predict_data:
    clear_output(wait=True)
    rec = dask.delayed(_predict)(record)
    predict_score.append(rec)
    i += 1
    print("current progress: ", round((i/record_count)*100,2), "%" )
    

predict_recs = dask.compute(*predict_score)

# Calculate time taken and print results
time_taken = time.time() - start
tps = len(predict_recs) / time_taken

print ('Process took %0.2f seconds' %time_taken)
print ('Scored %d records' %len(predict_recs))



### Take a look at your predictions
-----
Each record will have a score, the time (ms) it took to score it, the outcome and if a label was provided the label. 

In [None]:
predictions = pd.DataFrame.from_dict(predict_recs, orient='columns')
if FRAUD_LABEL:
    predictions[FRAUD_LABEL] = df[FRAUD_LABEL].head(record_count)
    all_variables = ['score', 'score_ms', 'outcome', FRAUD_LABEL] + model_variables
else:
    all_variables = ['score', 'score_ms', 'outcome'] + model_variables

predictions[all_variables].head()

```python
from sklearn.metrics import accuracy_score
predictions.loc[predictions['score'] >= score_threshold, 'y_pred'] = 1
predictions.loc[predictions['score'] < score_threshold, 'y_pred'] = 0
accuracy_score(predictions[FRAUD_LABEL], predictions['y_pred'])
```

### Optionally Write Predictions to File

<div class="alert alert-info"> <strong> Write Predictions </strong>

- You can write your prediction dataset to a CSV to manually review predictions
- Simply add a cell below and copy the code below

</div>



```python

# -- optionally write predictions to a CSV file -- 
predictions.to_csv(MODEL_NAME + ".csv", index=False)
# -- or to a XLS file 
predictions.to_excel(MODEL_NAME + ".xlsx", index=False)

```

## EVALUATION
------
The following section requires FRAUD_LABEL to be set to a Column value in the prediction dataset 

#### Score Distribution
-----


<div class="alert alert-info"> 💡 <strong> Separationg </strong>

- typically we recomend 1 - 4% false positive rate (FPR) but this is totally business dependant 
- the table below will help you identify a score threshold to evaluate your model with. 

</div>

In [None]:
if FRAUD_LABEL:
    # -- assign predictions based on threshold --
    predictions.loc[predictions['score'].astype(float) > score_threshold, "predicted_fraud" ] = 1
    predictions.loc[predictions['score'].astype(float) <= score_threshold, "predicted_fraud" ] = 0


    fraud = predictions.loc[predictions[FRAUD_LABEL] == 1 ]
    legit = predictions.loc[predictions[FRAUD_LABEL] == 0 ]

    bins = np.linspace(0, 1000, 100)
    plt.figure(figsize=(20,8))
    plt.hist(legit['score'].astype(float) , bins, alpha=1, density=True, label='Normal')
    plt.hist(fraud['score'].astype(float) , bins, alpha=0.5, density=True, label='Fraud')
    plt.legend(loc='upper right')
    plt.title("AWS Fraud Detector Score Distribution")
    plt.xlabel("score")
    plt.ylabel("Percentage of transactions (%)");
    plt.axvline(x = score_threshold ,linewidth=4, color='r')
    plt.show()

#### AWS Fraud Detector Prediction Classification 

-----

<div class="alert alert-info"> 💡 <strong> TP/TN/FP/FN </strong>

- True Positive (TP) - correctly identified fraud events  
- True Negative (TN) - correctly identified legitimate events  
- False Positive (FP) - legitmate events incorrectly identified as fraud   
- False Negative (FP) - fraudulent events incorrectly identified as legimate  


</div>

In [None]:
if FRAUD_LABEL:
    tp = predictions.loc[(predictions[FRAUD_LABEL] == 1) & (predictions['predicted_fraud'].astype(float) == 1)]
    tn = predictions.loc[(predictions[FRAUD_LABEL] == 0) & (predictions['predicted_fraud'].astype(float) == 0)]
    fp = predictions.loc[(predictions[FRAUD_LABEL] == 0) & (predictions['predicted_fraud'].astype(float) == 1)]
    fn = predictions.loc[(predictions[FRAUD_LABEL] == 1) & (predictions['predicted_fraud'].astype(float) == 0)]

    bins = np.linspace(0, 1000, 100)
    plt.figure(figsize=(20,4))
    plt.hist(tn['score'].astype(float), bins, alpha=1, density=True, label='True Legit')
    plt.hist(tp['score'].astype(float), bins, alpha=0.5, density=True, label='True Fraud')


    plt.legend(loc='upper right')
    plt.title("AWS Fraud Detector Score Distribution \n Correct Classifications")
    plt.xlabel("score")
    plt.ylabel("Event %");
    plt.axvline(x = score_threshold ,linewidth=4, color='r')
    plt.show()

    plt.figure(figsize=(20,4))
    plt.hist(fp['score'].astype(float), bins, alpha=1, density=True, label='False Positive')
    plt.hist(fn['score'].astype(float), bins, alpha=0.5, density=True, label='Missed Fraud')
    plt.legend(loc='upper right')
    plt.title("AWS Fraud Detector Score Distribution \n Missclassifications")

    plt.xlabel("score")
    plt.ylabel("Event %");
    plt.axvline(x = score_threshold ,linewidth=4, color='r')
    plt.show()

###  Confusion Matrix 
----
the following sports a typical confusion matrix of actual vs. predicted fraud / not fraud, assuming the fraud label exists

In [None]:
if FRAUD_LABEL:
    confusion_matrix = pd.crosstab(predictions[FRAUD_LABEL], predictions['predicted_fraud'], rownames=['Actual'], colnames=['Predicted'],)
    plt.figure(figsize=(20,8))
    sns.set(font_scale=1.5)
    ax = sns.heatmap(confusion_matrix, annot=True, fmt='g',cmap="YlGnBu")
    plt.ylabel('Actual Fraud')
    plt.xlabel('Predicted Fraud')
    ax.xaxis.set_ticks_position('top')