# Hands on Fraud Detector 

### Project 1

This project makes use of [Dask](https://dask.org/) to predict in parallel, so you may need to pip install it in your environment. 

-------

1. Detector Name  
    - You'll need the name of the detector, you can look this up in the AFD console 
    
2. Model Name   
    - You'll need the active model name and Version used by the detector  
    - We'll use this to get the variables used by the model 
    
3. Call Prediction API 
    - You'll need to specify the number of records to predict 
    - Finally write the predictions to a file. 


### Setup Python Libraries

In [1]:
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
# ------------------------------------------------------------------

# -- standard stuff -- 
import time
from datetime import datetime
import numpy as np
np.seterr(divide='ignore', invalid='ignore')
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# -- dask for parallelisim -- 
import dask 

# -- AWS stuff -- 
import boto3

%matplotlib inline 

## Initialize AWS Fraud Detector Client 
------

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/frauddetector.html 


In [2]:
# -- fraud detector client --
client = boto3.client('frauddetector')
# -- use this to append to files 
sufx   = datetime.now().strftime("%Y%m%d")

### Detector, Model, and Identifiers 
-----
<div class="alert alert-info"> 💡 <strong> Detector, Model and Versions </strong>

- DETECTOR_NAME & VERSION coresponds to the name and version of your deployed Fraud Detector  
- MODEL_NAME & VERSION coresponds to the name and version of the model deployed with your Fraud Detector   
- FRAUD_LABEL is useful if you are <b> comparing performance of your detector's </b>predictions to known frauds this is optional   
- EMAIL_ADDRESS is used as a key to identify your prediction infrences, you can look up a specific infrence in console by seraching for a specific email address. This maps to the <b>email address field</b> in your file you are predicting on. 
- S3_FILE this is the url of the S3 file you wish to apply your detector to.   

</div>

In [3]:
# -- name and version of your detector -- 
DETECTOR_NAME = "project_1_detector"
DETECTOR_VER  = "1.0"

# -- name and version of model, used to get the model column names -- 
MODEL_NAME    = "project_1_model"
MODEL_VER     = "1.0"

# -- if fraud label exists, this is optional -- 
FRAUD_LABEL   = "is_fraud"

# -- use email as the identifier for predictions
EMAIL_ADDRESS = "email_address"

# -- input file of data to be scored -- 
S3_FILE       = "s3://hands-on-frauddetector/project_1_newaccounts_5k.csv"


#### Load Data to be Scored 
-----
<div class="alert alert-info"> 💡 <strong> Check the first 5 Records </strong>

- Does your data look correct? 
- Do you need to rename any columns? - in this example i renamed credit_card_bin to cc_bin; you want the column names to match the field names used by the Model

</div>

In [4]:
df = pd.read_csv(S3_FILE)
df.head(5)

Unnamed: 0,ip_address,email_address,user_agent,customer_city,customer_state,customer_postal,event_timestamp,customer_name,customer_address,phone_number,is_fraud
0,84.138.6.238,synth_tmorton@yahoo.com,Mozilla/5.0 (X11; Linux i686) AppleWebKit/535....,Meganstad,LA,32733.0,2020-04-11 17:27:38,Brandon Moran,824 Price Bypass,(555)784 - 5238,0
1,194.147.250.63,synth_oscott@yahoo.com,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_6_4 rv...,Christinaport,MN,34319.0,2020-04-11 17:31:12,Dominic Murray,13515 Ashley Haven Apt. 472,(555)114 - 6133,0
2,192.54.60.50,synth_aoliver@gmail.com,Mozilla/5.0 (iPad; CPU iPad OS 3_1_3 like Mac ...,Donaldfurt,WA,32436.0,2020-04-11 17:46:34,Anthony Abbott,039 Amy Glens,(555)780 - 7652,0
3,169.120.193.154,synth_clewis@gmail.com,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_10_9; ...,Williamburgh,AL,34399.0,2020-04-11 17:48:52,Kimberly Webb,81397 Tom Forge,(555)588 - 4426,0
4,192.175.55.43,synth_katherinedavis@hotmail.com,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_0 ...,East Markland,IL,33690.0,2020-04-11 17:49:23,Renee James,6815 Dawson Estate,(555)785 - 8274,0


## Get Setup for Scoring
-----
The following function returns model variables, these are the expected input to the model / detector.   


<div class="alert alert-info"> 💡 <strong> Model Variables </strong>

- pass just the variables needed for the detector to score 

</div>


In [5]:
def get_model_variables(MODEL_NAME):
    """ return list of variables used by a model 
    
    """
    response = client.get_models(
    modelType='ONLINE_FRAUD_INSIGHTS',
    modelId= MODEL_NAME)
    model_variables = []

    for v in response['models'][0]['modelVariables']:
        model_variables.append(v['name'])
    return model_variables



model_variables = get_model_variables(MODEL_NAME)
print("-- model variables -- ")
print(model_variables)
    





-- model variables -- 
['ip_address', 'email_address', 'user_agent', 'customer_city', 'customer_state', 'customer_postal', 'event_timestamp', 'customer_name', 'phone_number']


## Run Predictions  
-----
The following applies the get_prediction to records   

<i> Note: this uses the Dask backend to parallelize the prediction calls. </i>



<div class="alert alert-info"> 💡 <strong>get_prediction </strong>

- Specify the number of records to score, you change the record_count to a specific number if you want to just predict on say 100 records, by default it assumes you want to apply predicitons to the whole dataset. 
- Once completed conver json to a pandas dataframe, appends any existing labels
- Analyze based on score threshold for a particular false positive rate FPR

</div>

this is all you need to run predictions: 

<b>client.get_prediction(detectorId=DETECTOR_NAME, detectorVersionId=DETECTOR_VER, eventId = SOME_IDENTIFIER, eventAttributes = record)</b>

Example of what a record would look like: 

```python
record = [{'order_amt': '8036.0',
  'ip_address': '192.18.59.93',
  'email_address': 'synth_patrickjennings@gmail.com',
  'cc_bin': '42785',
  'billing_postal': '17740-2745',
  'shipping_postal': '20950-6945',
  'event_timestamp': '2019-03-31 11:21:22',
  'customer_name': 'Jeremy Dougherty'}]
```

In [6]:
# -- if you don't set the record count this will run the whole file. 
record_count = df.shape[0]

start = time.time()

@dask.delayed
def _predict(record):
    stime = time.time()
    try:
        pred  = client.get_prediction(detectorId=DETECTOR_NAME, detectorVersionId=DETECTOR_VER, eventId = record[EMAIL_ADDRESS], eventAttributes = record)
        etime = time.time()
        record['outcome'] = pred['outcomes']
        record['status'] = pred['ResponseMetadata']['HTTPStatusCode']
        record['score']  = pred['modelScores'][0]['scores'][MODEL_NAME + '_insightscore']
        record['score_ms'] = ((etime - stime)*1000)
        return record
    except:
        pred  = client.get_prediction(detectorId=DETECTOR_NAME, detectorVersionId=DETECTOR_VER, eventId = record[EMAIL_ADDRESS], eventAttributes = record)
        etime = time.time()
        record['outcome'] = '-- failed --'
        record['status']  = pred['ResponseMetadata']['HTTPStatusCode']
        record['score']   =  -1 
        record['score_ms'] = ((etime - stime)*1000)
        return record

predict_data  = df[model_variables].head(record_count).astype(str).to_dict(orient='records')
predict_score = []

i=0
for record in predict_data:
    clear_output(wait=True)
    rec = dask.delayed(_predict)(record)
    predict_score.append(rec)
    i += 1
    print("current progress: ", round((i/record_count)*100,2), "%" )
    

predict_recs = dask.compute(*predict_score)

# Calculate time taken and print results
time_taken = time.time() - start
tps = len(predict_recs) / time_taken

print ('Process took %0.2f seconds' %time_taken)
print ('Scored %d records' %len(predict_recs))



current progress:  100.0 %
Process took 344.89 seconds
Scored 5000 records


### Take a look at your predictions
-----
Each record will have a score, the time (ms) it took to score it, the outcome and if a label was provided the label. 

In [7]:
predictions = pd.DataFrame.from_dict(predict_recs, orient='columns')
if FRAUD_LABEL:
    predictions[FRAUD_LABEL] = df[FRAUD_LABEL].head(record_count)
    all_variables = ['score', 'score_ms', 'outcome', FRAUD_LABEL] + model_variables
else:
    all_variables = ['score', 'score_ms', 'outcome'] + model_variables

predictions[all_variables].head()

Unnamed: 0,score,score_ms,outcome,is_fraud,ip_address,email_address,user_agent,customer_city,customer_state,customer_postal,event_timestamp,customer_name,phone_number
0,4.0,125.944853,[approve],0,84.138.6.238,synth_tmorton@yahoo.com,Mozilla/5.0 (X11; Linux i686) AppleWebKit/535....,Meganstad,LA,32733.0,2020-04-11 17:27:38,Brandon Moran,(555)784 - 5238
1,12.0,140.397787,[approve],0,194.147.250.63,synth_oscott@yahoo.com,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_6_4 rv...,Christinaport,MN,34319.0,2020-04-11 17:31:12,Dominic Murray,(555)114 - 6133
2,3.0,123.947859,[approve],0,192.54.60.50,synth_aoliver@gmail.com,Mozilla/5.0 (iPad; CPU iPad OS 3_1_3 like Mac ...,Donaldfurt,WA,32436.0,2020-04-11 17:46:34,Anthony Abbott,(555)780 - 7652
3,653.0,144.717693,[review],0,169.120.193.154,synth_clewis@gmail.com,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_10_9; ...,Williamburgh,AL,34399.0,2020-04-11 17:48:52,Kimberly Webb,(555)588 - 4426
4,52.0,148.059845,[approve],0,192.175.55.43,synth_katherinedavis@hotmail.com,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_0 ...,East Markland,IL,33690.0,2020-04-11 17:49:23,Renee James,(555)785 - 8274


### Optionally Write Predictions to File

<div class="alert alert-info"> <strong> Write Predictions </strong>

- You can write your prediction dataset to a CSV to manually review predictions
- Simply add a cell below and copy the code below

</div>



```python

# -- optionally write predictions to a CSV file -- 
predictions.to_csv(MODEL_NAME + ".csv", index=False)
# -- or to a XLS file 
predictions.to_excel(MODEL_NAME + ".xlsx", index=False)

```

In [8]:
predictions.to_csv("predicted_data_"+sufx+".csv", index=False)