# An Introduction to the Amazon Fraud Detector API  
#### Supervised fraud detection  
-------
- [Introduction](#Introduction)
- [Setup](#Setup)
- [Plan](#Plan)


## Introduction
-------

Amazon Fraud Detector is a fully managed service that makes it easy to identify potentially fraudulent online activities such as online payment fraud and the creation of fake accounts. Fraud Detector capitalizes on the latest advances in machine learning (ML) and 20 years of fraud detection expertise from AWS and Amazon.com to automatically identify potentially fraudulent activity so you can catch more fraud faster.

In this notebook, we'll use the Amazon Fraud Detector API to define an entity and event of interest and use CSV data stored in S3 to train a model. Next, we'll derive some rules and create a "detector" by combining our entity, event, model, and rules into a single endpoint. Finally, we'll apply the detector to a sample of our data to identify potentially fraudulent events.

After running this notebook you should be able to:
- Define an Entity and Event
- Create a Detector
- Train a Machine Learning (ML) Model
- Author Rules to identify potential fraud based on the model's score
- Apply the Detector's "predict" function, to generate a model score and rule outcomes on data

If you would like to know more, please check out [Fraud Detector's Documentation](https://docs.aws.amazon.com/frauddetector/). 


## Setup
------
First setup your AWS credentials so that Fraud Detector can store and access training data and supporting detector artifacts in S3.


### Setting up AWS Credentials & Permissions

https://docs.aws.amazon.com/frauddetector/latest/ug/set-up.html

To use Amazon Fraud Detector, you have to set up permissions that allow access to the Amazon Fraud Detector console and API operations. You also have to allow Amazon Fraud Detector to perform tasks on your behalf and to access resources that you own. We recommend creating an AWS Identify and Access Management (IAM) user with access restricted to Amazon Fraud Detector operations and required permissions. You can add other permissions as needed.
The following policies provide the required permission to use Amazon Fraud Detector:

*AmazonFraudDetectorFullAccessPolicy* 
- Allows you to perform the following actions:
    - Access all Amazon Fraud Detector resources  
    - List and describe all model endpoints in Amazon SageMaker  
    - List all IAM roles in the account  
    - List all Amazon S3 buckets  
    - Allow IAM Pass Role to pass a role to Amazon Fraud Detector  

* AmazonS3FullAccess* 
- Allows full access to Amazon S3. This is required to upload training files to S3.

  

To use Amazon Fraud Detector, you have to set up permissions that allow access to the Amazon Fraud Detector console and API operations. You also have to allow Amazon Fraud Detector to perform tasks on your behalf and to access resources that you own. We recommend creating an AWS Identify and Access Management (IAM) user with access restricted to Amazon Fraud Detector operations and required permissions. You can add other permissions as needed.

The following policies provide the required permission to use Amazon Fraud Detector:

- *AmazonFraudDetectorFullAccessPolicy*  
    Allows you to perform the following actions:  
        - Access all Amazon Fraud Detector resources  
        - List and describe all model endpoints in Amazon SageMaker  
        - List all IAM roles in the account  
        - List all Amazon S3 buckets  
        - Allow IAM Pass Role to pass a role to Amazon Fraud Detector  

- *AmazonS3FullAccess*  
    Allows full access to Amazon S3. This is required to upload training files to S3.  



## Plan
### Plan a Fraud Detector
------
A Detector contains the event, model(s) and rule(s) detection logic for a particular type of fraud that you want to detect. We'll use the following 7 step process to plan a Fraud Detector:  

1.	Setup your notebook
    - Name the major components entity, entity type, model, detector
    - Plug in your ARN role
    - Plug in your S3 Bucket and CSV File
2.	Read and Profile your Data
    - This will give you an idea of what your dataset contains
    - This will also identify the variables and labels that will need to be created to define your event
3.	Create event variables and labels
    - This will create the variables and labels in fraud detector
4.	Define your Entity and Event Type
    - What is the activity that you are detecting? That's likely your Event Type (e.g., account_registration)
    - Who is performing this activity? That's likely your Entity (e.g., customer)
5.	Create and Train your Model
    - Model training takes anywhere from 45-60 minutes, once complete you need to promote your model
    - Promote your model
6.	Create Detector, generate Rules and assemble your Detector
    - Create your detector
    - Create rules based on your model scores
        - Define outcomes (e.g., fraud, investigate and approve)
    - Assemble your detector by adding your model and rules to it
7.	Test your Detector
    - Interactively call predict on a handful of records


A *Detector* contains the event, model(s) and rule(s) detection logic for a particular type of fraud that you want to detect. We'll use the following 7 step process to plan a Fraud Detector: 

1. Setup your notebook
    - name the major components entity, entity type, model, detector .
    - plug in your ARN role
    - plug in your S3 Bucket and CSV File

2. Read and Profile your Data. 
    - this will give you an idea of what your dataset contains. 
    - this will also identify the variables and labels that will need to be created to define your event. 
 
3. Create event variables and labels
    - this will create the variables and labels in fraud detector 
    
4. Define your Entity and Event Type 
    - What is activity that you are detecting? that's likely your Event Type ex. account_registration
    - Who is performing this activity? that's likely your Entity ex. customer 
    
5. Create and Train your Model   
    - model training takes anywhere from 45-60 minutes, once complete you need to promote your endpoint  
    - promote your model
    
6. Create Detector, generate Rules and assemble your Detector  
    - create your detector 
    - create rules based on your model scores 
        - define outcomes ex:  fraud, investigate and approve 
    - assemble your detector 
        - combines rules and model into a "detector
    
7. Test your Detector 
    - Interactively call predict on a handful of record 
     

In [94]:
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
# ------------------------------------------------------------------

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import os
import sys
import time
import json
import uuid 
from datetime import datetime

# -- AWS stuff -- 
import boto3
import sagemaker

# -- sklearn --
from sklearn.metrics import roc_curve, roc_auc_score, auc, roc_auc_score
%matplotlib inline 

In [95]:
# -- initialize the AFD client 
client = boto3.client('frauddetector')

# -- suffix is appended to detector and model name for uniqueness  
sufx   = datetime.now().strftime("%Y%m%d")


### 1. Setup 
-----

***To get started ***  

1. Name the major components of Fraud Detector.
2. Plug in your ARN role 
3. Plug in your S3 Bucket and CSV File 

Then you can interactively exeucte the code cells in the notebook, no need to change anything unless you want to. 


<div class="alert alert-info"> <strong> Fraud Detector Components </strong>
Fraud Detector Components:  EVENT_TYPE is a business activity that you want evaluated for fraud risk. ENTITY_TYPE represents the "what or who" that is performing the event you want to evaluate. MODEL_NAME is the name of your supervised machine learning model that Fraud Detector trains on your behalf. DETECTOR_NAME is the name of the detector that contains the detection logic (model and rules) that you apply to events that you want to evaluate for fraud.

</div>


-----

### Bucket, File, and ARN Role

Bucket, ARN and Model Name Identify the following assets. S3_BUCKET is the name of the bucket where your file lives. S3_FILE is the URL to your s3 file. ARN_ROLE is the data access role "ARN" for the training data source.



<div class="alert alert-info"><strong> Bucket, ARN and Model Name </strong>

Identify the following assets. S3_BUCKET is the name of the bucket where your file lives. S3_FILE is the URL to your s3 file. ARN_ROLE is the data access role "ARN" for the training data source.

</div>

```
Note: To use Amazon Fraud Detector, you have to set up permissions that allow access to the Amazon Fraud Detector console and API operations. You also have to allow Amazon Fraud Detector to perform tasks on your behalf and to access resources that you own. We recommend creating an AWS Identify and Access Management (IAM) user with access restricted to. Amazon Fraud Detector operations and required permissions. You can add other permissions as needed. See "Create an IAM User and Assign Required Permissions" in the user's guide:
```
https://docs.aws.amazon.com/frauddetector/latest/ug/frauddetector.pdf


In [96]:
# -- This is all you need to fill out. Once complete simply interactively run each code cell. --  

ENTITY_TYPE    = "customer{0}".format(sufx) 
ENTITY_DESC    = "entity description: {0}".format(sufx) 

EVENT_TYPE     = "cardpayment{0}".format(sufx) 
EVENT_DESC     = "example event description: {0}".format(sufx) 

MODEL_NAME     = "fraudmodel{0}".format(sufx) 
MODEL_DESC     = "model trained on: {0}".format(sufx) 

DETECTOR_NAME  = "detector{0}".format(sufx)                        
DETECTOR_DESC  = "detects synthetic fraud events created: {0}".format(sufx) 

ARN_ROLE       = "arn:aws:iam::261625305249:role/frauddetctoraccess"
S3_BUCKET      = "bucketfraud2"
S3_FILE        = "test.csv"
S3_FILE_LOC    = "s3://{0}/{1}".format(S3_BUCKET,S3_FILE)


In [116]:
df=pd.read_csv(S3_FILE_LOC )
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 82 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   r_y_d_e_r    163 non-null    float64
 1   j_l_q_m_p    163 non-null    float64
 2   v_o_g_p_t    163 non-null    float64
 3   a_x_w_d_r    163 non-null    float64
 4   j_k_p_g_w    163 non-null    float64
 5   f_u_y_f_q    163 non-null    float64
 6   p_x_x_d_o    163 non-null    float64
 7   e_z_h_o_k    163 non-null    float64
 8   m_q_m_p_h    163 non-null    float64
 9   u_x_p_x_p    163 non-null    float64
 10  n_g_i_d_g    163 non-null    float64
 11  o_r_e_q_p    163 non-null    float64
 12  z_l_u_c_r    163 non-null    float64
 13  w_p_l_z_k    163 non-null    float64
 14  o_p_o_f_n    163 non-null    float64
 15  w_v_t_p_w    163 non-null    float64
 16  a_m_z_w_x    163 non-null    float64
 17  t_x_d_q_g    163 non-null    float64
 18  o_j_e_v_b    163 non-null    float64
 19  b_n_v_z_

Unnamed: 0,r_y_d_e_r,j_l_q_m_p,v_o_g_p_t,a_x_w_d_r,j_k_p_g_w,f_u_y_f_q,p_x_x_d_o,e_z_h_o_k,m_q_m_p_h,u_x_p_x_p,n_g_i_d_g,o_r_e_q_p,z_l_u_c_r,w_p_l_z_k,o_p_o_f_n,w_v_t_p_w,a_m_z_w_x,t_x_d_q_g,o_j_e_v_b,b_n_v_z_o,h_l_d_m_a,r_v_b_b_i,l_k_q_x_w,p_x_a_f_u,y_w_o_n_z,r_a_j_g_w,h_t_v_r_a,m_t_h_d_u,l_r_n_t_p,d_m_s_y_h,d_w_n_n_p,c_o_g_d_g,g_y_r_w_f,u_k_z_h_u,q_v_z_i_u,l_c_r_a_b,o_o_u_j_l,k_s_e_p_a,x_c_e_u_m,i_c_f_u_f,q_t_o_x_g,i_y_z_o_r,n_f_a_q_j,y_c_h_t_y,j_l_x_u_c,g_c_i_t_a,t_b_s_p_i,s_p_p_r_q,x_g_g_x_x,c_b_j_v_u,k_j_s_h_f,v_e_g_v_g,c_v_n_m_e,d_i_n_g_v,m_z_w_y_t,m_f_e_y_z,g_f_l_j_m,x_n_d_i_o,v_z_h_g_g,c_l_w_a_l,v_r_o_s_j,j_h_i_t_d,l_g_x_o_s,p_l_m_j_i,w_y_j_j_o,i_m_f_w_w,u_t_l_a_h,m_i_y_b_u,d_z_m_a_a,y_g_q_j_i,z_x_w_p_n,s_s_e_d_l,p_l_i_p_p,r_n_v_z_y,v_y_d_l_g,r_e_x_l_p,z_j_a_i_y,l_b_b_l_o,w_w_o_c_t,d_m_q_o_y,h_c_f_e_n,EVENT_LABEL
0,-1.059211,-2.157614,0.114405,0.392764,0.392435,-1.103485,0.751895,-0.415308,-1.565478,0.581398,-1.305657,-0.070039,0.0548,1.245651,-0.76839,-0.149087,1.011744,1.138376,-0.344796,-0.080713,-0.102756,0.741651,-0.148168,-0.268551,-0.833465,-0.108422,0.007094,-0.218585,-0.223651,1.161608,0.036443,0.57351,0.114113,-0.471596,0.626772,-0.475642,-0.162448,0.682517,-0.645584,0.698215,-0.175617,0.481963,-0.105692,-0.327408,-0.98228,-0.183494,-0.278907,0.513623,-0.053452,0.521954,W,discover,credit,,,T,T,T,M2,F,T,,,,,,,,,,,,,,,,,,,,,0
1,-0.970795,-0.821196,-0.325639,0.034164,-0.231074,-0.327025,0.261496,-0.350461,-1.242754,0.153343,-0.42645,0.035833,-0.527354,1.287943,-0.311763,0.814544,1.344377,-1.276574,-1.343481,-0.733825,1.659719,-1.888895,0.995353,-0.109262,-0.934014,0.759056,-0.731781,-1.680754,-0.547039,2.060132,-0.478194,0.650587,0.125973,-0.121168,0.714551,-0.352207,-0.102813,0.615228,-0.835115,-0.112807,0.156044,-0.007228,-0.2492,0.380962,-0.171077,0.228151,0.137645,0.358438,0.405234,0.428078,W,mastercard,credit,gmail.com,,,,,M0,T,T,,,,,,,,,,,,,,,,,,,,,0
2,-1.110525,-2.150386,0.134457,0.279532,0.368285,-1.068637,0.710021,-0.361964,-1.380856,0.564718,-1.188265,-0.026153,0.378869,0.811057,-0.254897,0.428913,1.388495,-0.113352,-0.359383,-0.643665,0.0288,0.557043,-0.227794,-0.240079,-0.635833,-0.117641,0.06691,0.102978,0.302996,0.279651,0.089465,0.115887,0.002134,-0.274917,0.354645,-0.31527,-0.057769,0.505936,-0.227214,0.749483,-0.44781,0.702026,-0.063118,-0.662812,-1.194922,-0.536442,-0.51496,0.556471,-0.517959,0.3731,W,visa,debit,outlook.com,,T,T,T,M0,F,F,F,F,F,,,,,,,,,,,,,,,,,,0
3,1.058747,-2.690885,-0.130715,0.085192,0.089674,-0.515826,1.080806,1.073882,-0.433224,0.905033,-1.980668,-0.230952,1.149696,1.145696,-1.399157,1.343961,-1.456335,-2.216079,-0.480634,-0.584631,-1.893102,0.981852,-1.175256,-0.136942,0.835391,-0.257972,0.864171,-0.545879,2.171078,1.387958,-0.862109,-0.1352,1.306486,0.155621,-0.251062,-1.517597,-0.785187,0.992017,-1.519945,1.182517,0.105246,0.128822,-0.291442,-0.942911,-0.454216,-0.875495,0.679595,0.731587,-0.471533,0.391395,W,mastercard,debit,yahoo.com,,,,,M0,T,F,,,,,,,,,,,,,,,,,,,,,0
4,0.878549,-1.573742,-2.570442,14.103111,6.257457,-11.232366,-1.723128,-0.097971,1.92481,1.985431,4.826668,0.665191,-2.209259,0.232568,3.634143,0.308498,-1.782981,-0.17898,-2.2664,0.576044,-0.087601,0.815422,0.136869,0.165116,0.419451,0.032594,0.520476,0.179938,0.6964,0.628718,-0.71411,0.980896,-1.716457,0.279648,0.081052,0.15179,-0.318723,-0.020132,-0.132947,0.434698,-0.222235,-0.155382,0.939225,0.595024,0.740305,0.073924,-0.223706,-2.32007,-0.632305,-1.787679,H,mastercard,credit,gmail.com,,,,,,,,,,,NotFound,New,NotFound,,,New,NotFound,Android 7.0,samsung browser 6.2,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M,0


In [98]:
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

arn:aws:iam::261625305249:role/service-role/AmazonSageMaker-ExecutionRole-20201209T143273


### 2. Profile Your Dataset 
-----

    
<div class="alert alert-info"> 💡 <strong> Profiling </strong>

The function below will: 1. profile your data, creating descriptive statistics, 2. perform basic data quality checks (nulls, unique variables, etc.), and 3. return summary statistics and the EVENT and MODEL schemas used to define your EVENT_TYPE and TRAIN your MODEL.


</div>

In [99]:
# --- no changes; just run this code block ---
def summary_stats(df):
    """ Generate summary statistics for a panda's data frame 
        Args:
            df (DataFrame): panda's dataframe to create summary statistics for.
        Returns:
            DataFrame of summary statistics, training data schema, event variables and event lables 
    """
    df = df.copy()
    rowcnt = len(df)
    df['EVENT_LABEL'] = df['EVENT_LABEL'].astype('str', errors='ignore')
    df_s1  = df.agg(['count', 'nunique']).transpose().reset_index().rename(columns={"index":"feature_name"})
    df_s1["null"] = (rowcnt - df_s1["count"]).astype('int64')
    df_s1["not_null"] = rowcnt - df_s1["null"]
    df_s1["null_pct"] = df_s1["null"] / rowcnt
    df_s1["nunique_pct"] = df_s1['nunique']/ rowcnt
    dt = pd.DataFrame(df.dtypes).reset_index().rename(columns={"index":"feature_name", 0:"dtype"})
    df_stats = pd.merge(dt, df_s1, on='feature_name', how='inner').round(4)
    df_stats['nunique'] = df_stats['nunique'].astype('int64')
    df_stats['count'] = df_stats['count'].astype('int64')
    
    # -- variable type mapper --  
    df_stats['feature_type'] = "UNKOWN"
    df_stats.loc[df_stats["dtype"] == object, 'feature_type'] = "CATEGORY"
    df_stats.loc[(df_stats["dtype"] == "int64") | (df_stats["dtype"] == "float64"), 'feature_type'] = "NUMERIC"
    df_stats.loc[df_stats["feature_name"].str.contains("ipaddress|ip_address|ipaddr"), 'feature_type'] = "IP_ADDRESS"
    df_stats.loc[df_stats["feature_name"].str.contains("email|email_address|emailaddr"), 'feature_type'] = "EMAIL_ADDRESS"
    df_stats.loc[df_stats["feature_name"] == "EVENT_LABEL", 'feature_type'] = "TARGET"
    df_stats.loc[df_stats["feature_name"] == "EVENT_TIMESTAMP", 'feature_type'] = "EVENT_TIMESTAMP"
    
    # -- variable warnings -- 
    df_stats['feature_warning'] = "NO WARNING"
    df_stats.loc[(df_stats["nunique"] != 2) & (df_stats["feature_name"] == "EVENT_LABEL"),'feature_warning' ] = "LABEL WARNING, NON-BINARY EVENT LABEL"
    df_stats.loc[(df_stats["nunique_pct"] > 0.9) & (df_stats['feature_type'] == "CATEGORY") ,'feature_warning' ] = "EXCLUDE, GT 90% UNIQUE"
    df_stats.loc[(df_stats["null_pct"] > 0.2) & (df_stats["null_pct"] <= 0.5), 'feature_warning' ] = "NULL WARNING, GT 20% MISSING"
    df_stats.loc[df_stats["null_pct"] > 0.5,'feature_warning' ] = "EXCLUDE, GT 50% MISSING"
    df_stats.loc[((df_stats['dtype'] == "int64" ) | (df_stats['dtype'] == "float64" ) ) & (df_stats['nunique'] < 0.2), 'feature_warning' ] = "LIKELY CATEGORICAL, NUMERIC w. LOW CARDINALITY"
   
    # -- target check -- 
    exclude_fields  = df_stats.loc[(df_stats['feature_warning'] != 'NO WARNING')]['feature_name'].to_list()
    event_variables = df_stats.loc[(~df_stats['feature_name'].isin(['EVENT_LABEL', 'EVENT_TIMESTAMP']))]['feature_name'].to_list()
    event_labels    = df["EVENT_LABEL"].unique().tolist()
    
    trainingDataSchema = {
        'modelVariables' : df_stats.loc[(df_stats['feature_type'].isin(['IP_ADDRESS', 'EMAIL_ADDRESS', 'CATEGORY', 'NUMERIC' ]))]['feature_name'].to_list(),
        'labelSchema'    : {
            'labelMapper' : {
                'FRAUD' : [df["EVENT_LABEL"].value_counts().idxmin()],
                'LEGIT' : [df["EVENT_LABEL"].value_counts().idxmax()]
            }
        }
    }
    
    
    model_variables = df_stats.loc[(df_stats['feature_type'].isin(['IP_ADDRESS', 'EMAIL_ADDRESS', 'CATEGORY', 'NUMERIC' ]))]['feature_name'].to_list()
   
    
    # -- label schema -- 
    label_map = {
        'FRAUD' : [df["EVENT_LABEL"].value_counts().idxmin()],
        'LEGIT' : [df["EVENT_LABEL"].value_counts().idxmax()]
    }
    
    
    print("--- summary stats ---")
    print(df_stats)
    print("\n")
    print("--- event variables ---")
    print(event_variables)
    print("\n")
    print("--- event labels ---")
    print(event_labels)
    print("\n")
    print("--- training data schema ---")
    print(trainingDataSchema)
    print("\n")
    
    return df_stats, trainingDataSchema, event_variables, event_labels

# -- connect to S3, snag file, and convert to a panda's dataframe --
s3   = boto3.resource('s3')
obj  = s3.Object(S3_BUCKET, S3_FILE)
body = obj.get()['Body']
df   = pd.read_csv(body)

# -- call profiling function -- 
df_stats, trainingDataSchema, eventVariables, eventLabels = summary_stats(df)


--- summary stats ---
54    m_z_w_y_t   object     31        7   132        31    0.8098       0.0429     CATEGORY       EXCLUDE, GT 50% MISSING
55    m_f_e_y_z   object     63        1   100        63    0.6135       0.0061     CATEGORY       EXCLUDE, GT 50% MISSING
56    g_f_l_j_m   object     63        2   100        63    0.6135       0.0123     CATEGORY       EXCLUDE, GT 50% MISSING
57    x_n_d_i_o   object     63        2   100        63    0.6135       0.0123     CATEGORY       EXCLUDE, GT 50% MISSING
58    v_z_h_g_g   object     79        3    84        79    0.5153       0.0184     CATEGORY       EXCLUDE, GT 50% MISSING
59    c_l_w_a_l   object     58        2   105        58    0.6442       0.0123     CATEGORY       EXCLUDE, GT 50% MISSING
61    j_h_i_t_d   object     34        2   129        34    0.7914       0.0123     CATEGORY       EXCLUDE, GT 50% MISSING
62    l_g_x_o_s   object     34        2   129        34    0.7914       0.0123     CATEGORY       EXCLUDE, GT 50% MI

### 3. Create Variables
-----

<div class="alert alert-info"> 💡 <strong> Create Variables. </strong>

The following section will automatically create your modeling input variables and your model scoring variable for you. 

</div>

In [100]:
df_stats.loc[(df_stats['feature_type'].isin(['IP_ADDRESS', 'EMAIL_ADDRESS']))]

Unnamed: 0,feature_name,dtype,count,nunique,null,not_null,null_pct,nunique_pct,feature_type,feature_warning


In [109]:
# --- no changes just run this code block ---
def create_label(df, FRAUD_LABEL):
    """
    Returns a dictionary for the model labelSchema, by identifying the rare event as fraud / and common as not-fraud 
    
    Arguments:
    df          -- input dataframe 
    FRAUD_LABEL -- the name of the field that contains fraud label  
    
    Returns:
    labelSchema -- a dictionary containing labelKey & labelMapper 
    """
    label_summary = df[FRAUD_LABEL].value_counts()
    labelSchema = {'labelKey': FRAUD_LABEL,
                   "labelMapper" : { "FRAUD": [str(label_summary.idxmin())], 
                                     "LEGIT": [str(label_summary.idxmax())]}
                  }
    client.put_label(
                name = str(label_summary.idxmin()),
                description = 'FRAUD')
    
    client.put_label(
                name = str(label_summary.idxmax()),
                description = 'LEGIT')
    return labelSchema
    
# -- function to create all your variables --- 
def create_variables(df_stats, MODEL_NAME):
    """
    Returns a variable list of model input variables, checks to see if variable exists,
    and, if not, then it adds the variable to Fraud Detector 
    
    Arguments: 
    enrichment_features  -- dictionary of optional features, mapped to specific variable types enriched (CARD_BIN, USERAGENT)
    numeric_features     -- optional list of numeric field names 
    categorical_features -- optional list of categorical features 
    
    Returns:
    variable_list -- a list of variable dictionaries 
    
    """
    enrichment_features = df_stats.loc[(df_stats['feature_type'].isin(['IP_ADDRESS', 'EMAIL_ADDRESS']))].to_dict(orient="record")
    numeric_features = df_stats.loc[(df_stats['feature_type'].isin(['NUMERIC']))]['feature_name'].to_dict()
    categorical_features = df_stats.loc[(df_stats['feature_type'].isin(['CATEGORY']))]['feature_name'].to_dict()
    
    variable_list = []
    # -- first do the enrichment features
    for feature in enrichment_features: 
        variable_list.append( {'name' : feature['feature_name']})
        try:
            resp = client.get_variables(name=feature['feature_name'])
        except:
            print("Creating variable: {0}".format(feature['feature_name']))
            resp = client.create_variable(
                    name = feature['feature_name'],
                    dataType = 'STRING',
                    dataSource ='EVENT',
                    defaultValue = '<unknown>', 
                    description = feature['feature_name'],
                    variableType = feature['feature_type'] )
                
               
    # -- check and update the numeric features 
    for feature in numeric_features: 
        variable_list.append( {'name' : numeric_features[feature]})
        try:
            resp = client.get_variables(name=numeric_features[feature])
        except:
            print("Creating variable: {0}".format(numeric_features[feature]))
            resp = client.create_variable(
                    name = numeric_features[feature],
                    dataType = 'FLOAT',
                    dataSource ='EVENT',
                    defaultValue = '0.0', 
                    description = numeric_features[feature],
                    variableType = 'NUMERIC' )
             
    # -- check and update the categorical features 
    for feature in categorical_features: 
        variable_list.append( {'name' : categorical_features[feature]})
        try:
            resp = client.get_variables(name=categorical_features[feature])
        except:
            print("Creating variable: {0}".format(categorical_features[feature]))
            resp = client.create_variable(
                    name = categorical_features[feature],
                    dataType = 'STRING',
                    dataSource ='EVENT',
                    defaultValue = '<unknown>', 
                    description = categorical_features[feature],
                    variableType = 'CATEGORICAL' )
    
    # -- create a model score feature  
    model_feature = "{0}insightscore".format(MODEL_NAME)  
    # variable_list.append( {'name' : model_feature})
    try:
        resp = client.get_variables(name=model_feature)
    except:
        print("Creating variable: {0}".format(model_feature))
        resp = client.create_variable(
                name = model_feature,
                dataType = 'FLOAT',
                dataSource ='EXTERNAL_MODEL_SCORE',
                defaultValue = '0.0', 
                description = model_feature,
                variableType = 'NUMERIC' )
    
    return variable_list


model_variables = create_variables(df_stats, MODEL_NAME)
print("\n --- model variable dict --")
print(model_variables)


model_label = create_label(df, "EVENT_LABEL")
print("\n --- model label schema dict --")
print(model_label)

Creating variable: fraudmodel20201209insightscore

 --- model variable dict --
[{'name': 'r_y_d_e_r'}, {'name': 'j_l_q_m_p'}, {'name': 'v_o_g_p_t'}, {'name': 'a_x_w_d_r'}, {'name': 'j_k_p_g_w'}, {'name': 'f_u_y_f_q'}, {'name': 'p_x_x_d_o'}, {'name': 'e_z_h_o_k'}, {'name': 'm_q_m_p_h'}, {'name': 'u_x_p_x_p'}, {'name': 'n_g_i_d_g'}, {'name': 'o_r_e_q_p'}, {'name': 'z_l_u_c_r'}, {'name': 'w_p_l_z_k'}, {'name': 'o_p_o_f_n'}, {'name': 'w_v_t_p_w'}, {'name': 'a_m_z_w_x'}, {'name': 't_x_d_q_g'}, {'name': 'o_j_e_v_b'}, {'name': 'b_n_v_z_o'}, {'name': 'h_l_d_m_a'}, {'name': 'r_v_b_b_i'}, {'name': 'l_k_q_x_w'}, {'name': 'p_x_a_f_u'}, {'name': 'y_w_o_n_z'}, {'name': 'r_a_j_g_w'}, {'name': 'h_t_v_r_a'}, {'name': 'm_t_h_d_u'}, {'name': 'l_r_n_t_p'}, {'name': 'd_m_s_y_h'}, {'name': 'd_w_n_n_p'}, {'name': 'c_o_g_d_g'}, {'name': 'g_y_r_w_f'}, {'name': 'u_k_z_h_u'}, {'name': 'q_v_z_i_u'}, {'name': 'l_c_r_a_b'}, {'name': 'o_o_u_j_l'}, {'name': 'k_s_e_p_a'}, {'name': 'x_c_e_u_m'}, {'name': 'i_c_f_u_f'}, 

### 4. Create Entity and Event Types
-----
    
<div class="alert alert-info"> 💡 <strong> Entity and Event. </strong>
    
The following code block will automatically create your entity and event types for you.

</div>

In [113]:
# --- no changes just run this code block ---
response = client.put_entity_type(
    name        = ENTITY_TYPE,
    description = ENTITY_DESC
)
print("-- create entity --")
print(response)


response = client.put_event_type (
    name           = EVENT_TYPE,
    eventVariables = eventVariables,
    labels         = eventLabels,
    entityTypes    = [ENTITY_TYPE])
print("-- create event type --")
print(response)

-- create entity --
{'ResponseMetadata': {'RequestId': '9321b84c-487d-4f4b-9939-c879a30be5cb', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Wed, 09 Dec 2020 05:31:38 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2', 'connection': 'keep-alive', 'x-amzn-requestid': '9321b84c-487d-4f4b-9939-c879a30be5cb'}, 'RetryAttempts': 0}}
-- create event type --
{'ResponseMetadata': {'RequestId': '23d2eef9-1be3-4525-beb9-e592bb182b0d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Wed, 09 Dec 2020 05:31:38 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2', 'connection': 'keep-alive', 'x-amzn-requestid': '23d2eef9-1be3-4525-beb9-e592bb182b0d'}, 'RetryAttempts': 0}}


In [114]:
len (eventVariables)

81

### 5. Create & Train your Model
-----
    
<div class="alert alert-info"> 💡 <strong> Train Model. </strong>

The following section will automatically train and activate your model for you. 

</div>

In [115]:
# --- no changes; just run this code block. ---

# -- create our model --
response = client.create_model(
   description   =  MODEL_DESC,
   eventTypeName = EVENT_TYPE,
   modelId       = MODEL_NAME,
   modelType   = 'ONLINE_FRAUD_INSIGHTS')

print("-- initalize model --")
print(response)
# -- initializes the model, it's now ready to train -- 
response = client.create_model_version(
    modelId     = MODEL_NAME,
    modelType   = 'ONLINE_FRAUD_INSIGHTS',
    trainingDataSource = 'EXTERNAL_EVENTS',
    trainingDataSchema = trainingDataSchema,
    externalEventsDetail = {
        'dataLocation'     : S3_FILE_LOC,
        'dataAccessRoleArn': ARN_ROLE
    }
)
print("-- model training --")
print(response)


# -- model training takes time, we'll loop until it's complete  -- 
print("-- wait for model training to complete --")
stime = time.time()
while True:
    clear_output(wait=True)
    response = client.get_model_version(modelId=MODEL_NAME, modelType = "ONLINE_FRAUD_INSIGHTS", modelVersionNumber = '1.0')
    if response['status'] == 'TRAINING_IN_PROGRESS':
        print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
        time.sleep(60)  # -- sleep for 60 seconds 
    if response['status'] != 'TRAINING_IN_PROGRESS':
        print("Model status : " +  response['status'])
        break
        
etime = time.time()

# -- summarize -- 
print("\n --- model training complete  --")
print("Elapsed time : %s" % (etime - stime) + " seconds \n"  )
print(response)


-- initalize model --
{'ResponseMetadata': {'RequestId': 'f33d0e0b-d4f7-4bef-8e26-322f48ac2675', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Wed, 09 Dec 2020 05:31:46 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2', 'connection': 'keep-alive', 'x-amzn-requestid': 'f33d0e0b-d4f7-4bef-8e26-322f48ac2675'}, 'RetryAttempts': 0}}


ValidationException: An error occurred (ValidationException) when calling the CreateModelVersion operation: The CSV header does not contain the necessary variables

In [None]:
response = client.update_model_version_status (
    modelId = MODEL_NAME,
    modelType = 'ONLINE_FRAUD_INSIGHTS',
    modelVersionNumber = '1.0',
    status = 'ACTIVE'
)
print("-- activating model --")
print(response)

#-- wait until model is active 
print("--- waiting until model status is active ")
stime = time.time()
while True:
    clear_output(wait=True)
    response = client.get_model_version(modelId=MODEL_NAME, modelType = "ONLINE_FRAUD_INSIGHTS", modelVersionNumber = '1.0')
    if response['status'] != 'ACTIVE':
        print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
        time.sleep(60)  # sleep for 1 minute 
    if response['status'] == 'ACTIVE':
        print("Model status : " +  response['status'])
        break
        
etime = time.time()
print("Elapsed time : %s" % (etime - stime) + " seconds \n"  )
print(response)

In [None]:
# -- model performance summary -- 
auc = client.describe_model_versions(
    modelId= MODEL_NAME,
    modelVersionNumber='1.0',
    modelType='ONLINE_FRAUD_INSIGHTS',
    maxResults=10
)['modelVersionDetails'][0]['trainingResult']['trainingMetrics']['auc']


df_model = pd.DataFrame(client.describe_model_versions(
    modelId= MODEL_NAME,
    modelVersionNumber='1.0',
    modelType='ONLINE_FRAUD_INSIGHTS',
    maxResults=10
)['modelVersionDetails'][0]['trainingResult']['trainingMetrics']['metricDataPoints'])


plt.figure(figsize=(10,10))
plt.plot(df_model["fpr"], df_model["tpr"], color='darkorange',
         lw=2, label='ROC curve (area = %0.3f)' % auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title( MODEL_NAME + ' ROC Chart')
plt.legend(loc="lower right",fontsize=12)
plt.axvline(x = 0.02 ,linewidth=2, color='r')
plt.axhline(y = 0.73 ,linewidth=2, color='r')
plt.show()

### 6. Create Detector, generate Rules and assemble your Detector

-----
    
<div class="alert alert-info"> 💡 <strong> Generate Rules, Create and Publish a Detector. </strong>
    
The following section will automatically generate a number of fraud, investigate and approve rules based on the false positive rate and score thresholds of your model. These are just example rules that you could create, it is recommended that you fine tune your rules specifically to your business use case.
</div>

In [None]:
# -- initialize your detector -- 
response = client.put_detector(detectorId  = DETECTOR_NAME, 
                               description = DETECTOR_DESC, 
                               eventTypeName = EVENT_TYPE )

print(response)

In [104]:
# -- make rules -- 
model_stat = df_model.round(decimals=2)  

m = model_stat.loc[model_stat.groupby(["fpr"])["threshold"].idxmax()] 

def make_rule(x):
    rule = ""
    if x['fpr'] <= 0.05: 
        rule = "${0}_insightscore > {1}".format(MODEL_NAME,x['threshold'])
    if x['fpr'] == 0.06:
        rule = "${0}_insightscore <= {1}".format(MODEL_NAME,x['threshold_prev'])
    return rule
    
m["threshold_prev"] = m['threshold'].shift(1)
m['rule'] = m.apply(lambda x: make_rule(x), axis=1)

m['outcome'] = "approve"
m.loc[m['fpr'] <= 0.03, "outcome"] = "fraud"
m.loc[(m['fpr'] > 0.03) & (m['fpr'] <= 0.05), "outcome"] = "investigate"

print (" --- score thresholds 1% to 6% --- ")
print(m[["fpr", "tpr", "threshold", "rule", "outcome"]].loc[(m['fpr'] > 0.0 ) & (m['fpr'] <= 0.06)].reset_index(drop=True))


NameError: name 'df_model' is not defined

In [None]:
# -- create outcomes -- 
def create_outcomes(outcomes):
    """ create Fraud Detector Outcomes 
    
    """   
    for outcome in outcomes:
        print("creating outcome variable: {0} ".format(outcome))
        response = client.put_outcome(
                          name=outcome,
                          description=outcome)

# -- get distinct outcomes 
outcomes = m["outcome"].unique().tolist()

create_outcomes(outcomes)

In [None]:
rule_set = m[(m["fpr"] > 0.0) & (m["fpr"] <= 0.06)][["outcome", "rule"]].to_dict('records')
rule_list = []
for i, rule in enumerate(rule_set):
    ruleId = "rule{0}_{1}".format(i, MODEL_NAME)
    rule_list.append({"ruleId": ruleId, 
                      "ruleVersion" : '1',
                      "detectorId"  : DETECTOR_NAME
        
    })
    print("creating rule: {0}: IF {1} THEN {2}".format(ruleId, rule["rule"], rule['outcome']))
    try:
        response = client.create_rule(
            ruleId = ruleId,
            detectorId = DETECTOR_NAME,
            expression = rule['rule'],
            language = 'DETECTORPL',
            outcomes = [rule['outcome']]
            )
    except:
        print("this rule already exists in this detector")
rule_list    

In [None]:
client.create_detector_version(
    detectorId = DETECTOR_NAME,
    rules = rule_list,
    modelVersions = [{"modelId":MODEL_NAME, 
                      "modelType" : "ONLINE_FRAUD_INSIGHTS",
                      "modelVersionNumber" : "1.0"}],
    ruleExecutionMode = 'FIRST_MATCHED'
    )

print("\n -- detector created -- ")
print(response) 


In [None]:
response = client.update_detector_version_status(
    detectorId= DETECTOR_NAME,
    detectorVersionId='1',
    status='ACTIVE'
)
print("\n -- detector activated -- ")
print(response)

### 7. Make Predictions 
-----
    
<div class="alert alert-info"> 💡 <strong> Make Predictions. </strong>
    
The following section will apply your detector to the first 10 records in your training dataset. To apply your detector to more simply change the record_count, alternatively you can specify the full training data with the following: 

</div>

```python

record_count = df.shape()[0]

```

In [None]:
# -- this will apply your detector to the first 10 records of your trainig dataset. -- 
record_count = 10 
predicted_dat = []
pred_data = df[eventVariables].head(record_count).astype(str).to_dict(orient='records')
for rec in pred_data:
    eventId = uuid.uuid1()
    pred = client.get_event_prediction(detectorId=DETECTOR_NAME, 
                                       detectorVersionId='1',
                                       eventId = str(eventId),
                                       eventTypeName = EVENT_TYPE,
                                       eventTimestamp = timestampStr, 
                                       entities = [{'entityType': ENTITY_TYPE, 'entityId':str(eventId.int)}],
                                       eventVariables=rec) 
    
    rec["score"]   = pred['modelScores'][0]['scores']["{0}_insightscore".format(MODEL_NAME)]
    rec["outcome"] = pred['ruleResults'][0]['outcomes']
    predicted_dat.append(rec)
    

In [None]:
# -- review your predictons -- 
predictions = pd.DataFrame(predicted_dat)
head(predictions)

### Optionally Write Predictions to File

<div class="alert alert-info"> 💡 <strong> Write Predictions. </strong>

- You can write your prediction dataset to a CSV or Excel to manually review predictions
- Simply add a cell below and copy the code below

</div>



```python

# -- optionally write predictions to a CSV file -- 
predictions.to_csv(MODEL_NAME + ".csv", index=False)
# -- or to a XLS file 
predictions.to_excel(MODEL_NAME + ".xlsx", index=False)

```