# Anomaly Detection: Flagging Credit Card Fraud

Anomaly detection is the practice of examining specific data points and detecting rare occurences that seem suspicious because they're different from the established pattern of behaviors. This can be a complex yet interesting problem to solve within the realm of data science. Assessing and attempting to predict credit card fraud transactions is an excellent opportunity to learn more about Anomaly Detection and discover some useful tips and processes for solutions to this issue. 

This project will cover:

__1. Data Transformations__ 

__2. Feature Engineering__ 

__3. Handling Class Imbalance__

__4. Choosing Baseline Models__ 

__5. Performance Metrics__

## Background

First and foremost, we need to figure out what kind of problem we are trying to solve statistcally. With fraud detection an observation is classified as two things - Fraudulent or Not Fraudulent. This is a classification problem and therefore a [logistic regression](https://www.sciencedirect.com/topics/computer-science/logistic-regression#:~:text=Logistic%20regression%20is%20a%20process,%2Fno%2C%20and%20so%20on) problem. This is something to keep in mind as we build our model

Helpful Guide(s) for This Project: - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/BaselineFeatureTransformation.html
                                   - https://machinelearningmastery.com/what-is-imbalanced-classification/

## Loading Data and Preprocessing the Data

The first order of business will be ensuring we have the necessary libraries and packages for our project. i did not have the imbalance-learn library so I had to download that for this project.

In [None]:
#!pip install graphviz
!pip install imbalanced-learn
!pip install pandas
!pip install math 
!pip install sys
!pip install time
!pip install pickle
!pip install json
!pip install datetime
!pip intall random
!pip install scikit-learn
!pip install matplotlib
!pip install seaborn

In [1]:
import os
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import math
import sys
import time
import pickle
import json
import datetime
import random

#import sklearn
import sklearn
from sklearn import *

%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid', {'axes.facecolor': '0.9'})

from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier,export_graphviz
from sklearn.datasets import make_regression

# For imbalanced learning
import imblearn

import warnings
warnings.filterwarnings('ignore')

Next, read in our data sets. Upon first glance, this dataset looks like a text file. After opening it, you'll see that it is formatted as a  __json schema__, so we'll use the read_json function available in pandas to read in our file and then take a look at our data 

In [2]:
df = pd.read_json('transactions.txt', lines=True)

To get a better look at all the variables in our data set and get a sence of all of our data types to see if they make sense, we'll use dtypes function in pandas to take a look at all the columns in our dataframe

In [3]:
df.dtypes

accountNumber                 int64
customerId                    int64
creditLimit                   int64
availableMoney              float64
transactionDateTime          object
transactionAmount           float64
merchantName                 object
acqCountry                   object
merchantCountryCode          object
posEntryMode                 object
posConditionCode             object
merchantCategoryCode         object
currentExpDate               object
accountOpenDate              object
dateOfLastAddressChange      object
cardCVV                       int64
enteredCVV                    int64
cardLast4Digits               int64
transactionType              object
echoBuffer                   object
currentBalance              float64
merchantCity                 object
merchantState                object
merchantZip                  object
cardPresent                    bool
posOnPremises                object
recurringAuthInd             object
expirationDateKeyInMatch    

Lets take a look at some of our categories of interests to see their values, and their distribution. We'll first start of looking at our dependent varialble and true category of interest which is the __isFraud__ category. We have a total of 786363, almost 1 million transactions and 12,417 are fraudulant. This categorizes this dataset, as an imbalanced dataset since we don't have a very equitable distribution of Fraudulant and Genunine transaction values. We will deal with this issue later on in this project. Next lets look at our Customer, Credit Limit, and Transaction Amount distributions.  

In [4]:
print('Number of F/T:',df.groupby(['isFraud']).ngroups)
print(df['isFraud'].value_counts())

Number of F/T: 2
isFraud
False    773946
True      12417
Name: count, dtype: int64


In [5]:
12417/773946

0.016043754990658264

In [6]:
print('Number of Customers:',df.groupby(['accountNumber','customerId']).ngroups)
print('Number of Unique Countries:',df.groupby(['merchantCountryCode']).ngroups)
print(df['merchantCountryCode'].value_counts())

Number of Customers: 5000
Number of Unique Countries: 5
merchantCountryCode
US     778511
MEX      3143
CAN      2426
PR       1559
          724
Name: count, dtype: int64


In [7]:
print('Min Credit Limit:',df['creditLimit'].min())
print('Min Transaction Amount:',df['transactionAmount'].min())
print('Max Credit Limit:',df['creditLimit'].max())
print('Max Transaction Amount:',df['transactionAmount'].max())

Min Credit Limit: 250
Min Transaction Amount: 0.0
Max Credit Limit: 50000
Max Transaction Amount: 2011.54


## Question 2: Plot
Plot a histogram of the processed amounts of each transaction, the transactionAmount column.

Report any structure you find and any hypotheses you have about that structure.

## Question 3: Data Wrangling - Duplicate Transactions
You will notice a number of what look like duplicated transactions in the data set. One type of duplicated transaction is a reversed transaction, where a purchase is followed by a reversal. Another example is a multi-swipe, where a vendor accidentally charges a customer's card multiple times within a short time span.

Can you programmatically identify reversed and multi-swipe transactions?

What total number of transactions and total dollar amount do you estimate for the reversed transactions? For the multi-swipe transactions? (please consider the first transaction to be "normal" and exclude it from the number of transaction and dollar amount counts)

Did you find anything interesting about either kind of transaction?

## Question 4: Model
Fraud is a problem for any bank. Fraud can take many forms, whether it is someone stealing a single credit card, to large batches of stolen credit card numbers being used on the web, or even a mass compromise of credit card numbers stolen from a merchant via tools like credit card skimming devices.

Each of the transactions in the dataset has a field called isFraud. Please build a predictive model to determine whether a given transaction will be fraudulent or not. Use as much of the data as you like (or all of it).

Provide an estimate of performance using an appropriate sample, and show your work.

Please explain your methodology (modeling algorithm/method used and why, what features/data you found useful, what questions you have, and what you would do next with more time)



## Data Preprocessing Steps
Lets now do some analysis to preprocess our data, which essentially means we're going to look at our data and change some of the values in our dataset to make our dataset more digestible for the model we are going to create down the line. Some examples of this include:
1. Encoding some of our categorical values (turning categorical values into integers)
2. Creating other values based on other variables such as the datetime column 
3. Ensuring we have an appropriate index for our dataset 

We'll first create an "ID" column to ensure we have a column that can act as an index for the unique amount of transactions we have in our dataset. None of our other variables ensure this so we'll create one

In [8]:
df["ID"] = df.index + 1

Next we need to ensure all our boolean values such as the ones with *True*, *False* values are converted to integers. 
We also want to ensure our DateTime columns are formatted correctly. After applying the appropriate treatments for these values, lets check our dataframe.

In [9]:
#Convert all our Boolean Values into integers
df[['cardPresent','expirationDateKeyInMatch','isFraud']]=df[['cardPresent','expirationDateKeyInMatch','isFraud']].astype(int)
#Convert all our dates to coorect format for Pandas
df[['transactionDateTime','currentExpDate','accountOpenDate','dateOfLastAddressChange']] = df[['transactionDateTime','currentExpDate','accountOpenDate','dateOfLastAddressChange']].apply(pd.to_datetime)

In [10]:
df.head(25)

Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,echoBuffer,currentBalance,merchantCity,merchantState,merchantZip,cardPresent,posOnPremises,recurringAuthInd,expirationDateKeyInMatch,isFraud,ID
0,737265056,737265056,5000,5000.0,2016-08-13 14:27:32,98.55,Uber,US,US,2,1,rideshare,2023-06-01,2015-03-14,2015-03-14,414,414,1803,PURCHASE,,0.0,,,,0,,,0,0,1
1,737265056,737265056,5000,5000.0,2016-10-11 05:05:54,74.51,AMC #191138,US,US,9,1,entertainment,2024-02-01,2015-03-14,2015-03-14,486,486,767,PURCHASE,,0.0,,,,1,,,0,0,2
2,737265056,737265056,5000,5000.0,2016-11-08 09:18:39,7.47,Play Store,US,US,9,1,mobileapps,2025-08-01,2015-03-14,2015-03-14,486,486,767,PURCHASE,,0.0,,,,0,,,0,0,3
3,737265056,737265056,5000,5000.0,2016-12-10 02:14:50,7.47,Play Store,US,US,9,1,mobileapps,2025-08-01,2015-03-14,2015-03-14,486,486,767,PURCHASE,,0.0,,,,0,,,0,0,4
4,830329091,830329091,5000,5000.0,2016-03-24 21:04:46,71.18,Tim Hortons #947751,US,US,2,1,fastfood,2029-10-01,2015-08-06,2015-08-06,885,885,3143,PURCHASE,,0.0,,,,1,,,0,0,5
5,830329091,830329091,5000,5000.0,2016-04-19 16:24:27,30.76,In-N-Out #422833,US,US,2,1,fastfood,2020-01-01,2015-08-06,2015-08-06,885,885,3143,PURCHASE,,0.0,,,,1,,,0,0,6
6,830329091,830329091,5000,5000.0,2016-05-21 14:50:35,57.28,Krispy Kreme #685312,US,US,2,1,fastfood,2020-05-01,2015-08-06,2015-08-06,885,885,3143,PURCHASE,,0.0,,,,1,,,0,0,7
7,830329091,830329091,5000,5000.0,2016-06-03 00:31:21,9.37,Shake Shack #968081,US,US,5,1,fastfood,2021-01-01,2015-08-06,2015-08-06,885,885,3143,PURCHASE,,0.0,,,,1,,,0,0,8
8,830329091,830329091,5000,4990.63,2016-06-10 01:21:46,523.67,Burger King #486122,,US,2,1,fastfood,2032-08-01,2015-08-06,2015-08-06,885,885,3143,PURCHASE,,9.37,,,,1,,,0,0,9
9,830329091,830329091,5000,5000.0,2016-07-11 10:47:16,164.37,Five Guys #510989,US,US,5,8,fastfood,2020-04-01,2015-08-06,2015-08-06,885,885,3143,PURCHASE,,0.0,,,,1,,,0,0,10


Now we want to create a column that tracks the elapsed time between our transactions, which will in effect group them into the day(s) our transactions occured.
First we'll sort our transactions by date in ascending order to see the very first date of our first transactions.  

In [11]:
df.sort_values(by='transactionDateTime',ascending=True, inplace=True)

In [12]:
df.head(25)

Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,posConditionCode,merchantCategoryCode,currentExpDate,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,echoBuffer,currentBalance,merchantCity,merchantState,merchantZip,cardPresent,posOnPremises,recurringAuthInd,expirationDateKeyInMatch,isFraud,ID
640789,419104777,419104777,50000,50000.0,2016-01-01 00:01:02,44.09,Washington Post,US,US,9,1,subscriptions,2028-03-01,2015-05-30,2015-05-30,837,837,5010,PURCHASE,,0.0,,,,0,,,0,0,640790
28946,674577133,674577133,5000,5000.0,2016-01-01 00:01:44,329.57,staples.com,US,US,9,8,online_retail,2024-10-01,2015-08-19,2015-08-19,430,430,1693,PURCHASE,,0.0,,,,0,,,0,0,28947
222211,958438658,958438658,20000,20000.0,2016-01-01 00:01:47,164.57,cheapfast.com,US,US,5,1,online_retail,2023-04-01,2013-07-20,2013-07-20,445,445,2062,PURCHASE,,0.0,,,,0,,,0,0,222212
470320,851126461,851126461,10000,10000.0,2016-01-01 00:02:04,122.83,discount.com,US,US,2,8,online_retail,2025-07-01,2014-10-18,2014-10-18,667,667,7359,PURCHASE,,0.0,,,,0,,,0,0,470321
704106,148963316,148963316,2500,2500.0,2016-01-01 00:02:19,0.0,Fast Repair,US,US,5,1,auto,2026-12-01,2013-12-12,2013-12-12,542,542,1785,ADDRESS_VERIFICATION,,0.0,,,,0,,,0,0,704107
727644,974901832,974901832,250,250.0,2016-01-01 00:03:47,24.56,staples.com,US,US,5,1,online_retail,2032-05-01,2012-05-29,2012-05-29,290,290,9744,PURCHASE,,0.0,,,,0,,,0,0,727645
310263,811942128,811942128,5000,5000.0,2016-01-01 00:04:10,20.45,sears.com,US,US,2,1,online_retail,2029-08-01,2015-05-23,2015-05-23,948,948,4888,PURCHASE,,0.0,,,,0,,,0,1,310264
240190,380680241,380680241,5000,5000.0,2016-01-01 00:06:17,96.68,Fresh Flowers,US,US,5,1,online_gifts,2023-08-01,2014-06-21,2014-06-21,869,869,593,PURCHASE,,0.0,,,,0,,,0,0,240191
305450,676919786,676919786,250,250.0,2016-01-01 00:06:46,146.57,Dairy Queen #766986,US,US,5,1,fastfood,2020-12-01,2015-08-11,2015-08-11,111,111,3690,PURCHASE,,0.0,,,,1,,,0,0,305451
622596,588383631,588383631,5000,5000.0,2016-01-01 00:07:03,227.62,discount.com,US,US,2,1,online_retail,2021-09-01,2012-11-15,2012-11-15,792,792,557,PURCHASE,,0.0,,,,0,,,0,0,622597


Now that we can see the very first day of our transaction we can now use that first date as our basedate by which we will calculate the elapsed time of our transactions. 
We'll then create a new column named TX_Days tby which each observation will be calculated by subtracting the basedate of our dataset from the date of a specific transaction to calculate the elapsed time. After doing this we'll take a look at our dataframe to see our new column and its values 

In [None]:
basedate = pd.Timestamp('2016-01-01 00:01:02')
df['TX_Days'] = (df['transactionDateTime'] - basedate).dt.days

In [None]:
df

Now lets take a look at our minimum transaction and our maximum transaction dates and days to get an idea of how many days of transaction data we have. We can see that we have about almost a years worth of transaction data to deal with in our data set

In [None]:
print('Min Transaction Time:',df['transactionDateTime'].min(),'|','First Day:',df['TX_Days'].min())
print('Max Transaction Time:',df['transactionDateTime'].max(),'|','Last Day:',df['TX_Days'].max())

Next I'd like to impute the day of the week a specific transaction occurs into the dataframe (Monday, Tuesday...). This will be helpul later down the line for when we engineer some of our features for our models.

In [None]:
df['day_of_week'] = df['transactionDateTime'].dt.day_name()

In [None]:
df

Now lets take a look out our fraudulent transactions and take a look at some of their distributions. We'll put these values in a new data Frame called Fraud. 

In [None]:
Fraud = df[df['isFraud'] != 0]

In [None]:
Fraud.head(5)

In [None]:
print('Number of F/T:',Fraud.groupby(['isFraud']).ngroups)
print(Fraud['isFraud'].value_counts())

In [None]:
print('Weekend Numbers:',Fraud.groupby(['cardPresent']).ngroups)
print(Fraud['cardPresent'].value_counts())

Now lets take a look at our total transactions in our data set through a line graph. We will plot the amount of total transactions that occur during the year with using our customer ID values as well as our TX_Days values. 

In [None]:
#function to get the stats we want
def get_stats(df):
    #Number of transactions per day
    nb_tx_per_day=df.groupby(['TX_Days'])['customerId'].count()
    #Number of fraudulent transactions per day
    nb_fraud_per_day=df.groupby(['TX_Days'])['isFraud'].sum()
    #Number of fraudulent cards per day
    nb_fraudcard_per_day=df[df['isFraud']>0].groupby(['TX_Days']).customerId.nunique()
    
    return (nb_tx_per_day,nb_fraud_per_day,nb_fraudcard_per_day)

(nb_tx_per_day,nb_fraud_per_day,nb_fraudcard_per_day)=get_stats(df)

n_days=len(nb_tx_per_day)
tx_stats=pd.DataFrame({"value":pd.concat([nb_tx_per_day/50,nb_fraud_per_day,nb_fraudcard_per_day])})
tx_stats['stat_type']=["nb_tx_per_day"]*n_days+["nb_fraud_per_day"]*n_days+["nb_fraudcard_per_day"]*n_days
tx_stats=tx_stats.reset_index()

In [None]:
sns.set(style='darkgrid')
sns.set(font_scale=1.4)

fraud_and_transactions_stats_fig = plt.gcf()

fraud_and_transactions_stats_fig.set_size_inches(15, 8)

sns_plot = sns.lineplot(x="TX_Days", y="value", data=tx_stats, hue="stat_type", hue_order=["nb_tx_per_day","nb_fraud_per_day","nb_fraudcard_per_day"], legend=False)

sns_plot.set_title('Total transactions, and number of fraudulent transactions \n and number of compromised cards per day', fontsize=20)
sns_plot.set(xlabel = "Number of days since beginning of data generation", ylabel="Number")

sns_plot.set_ylim([0,60])

labels_legend = ["# transactions per day (/50)", "# fraudulent txs per day", "# fraudulent cards per day"]

sns_plot.legend(loc='upper left', labels=labels_legend,bbox_to_anchor=(1.05, 1), fontsize=15)

Now that we have preprocessed our data and have gotten ready for some rigourous engineering, we will now pickle our data. As we saw before this is a very large data set with almost 1 million observations. Breaking up our data into weekly batches will assist us with analyzing our transactions in smaller chunks and pickling our data will allow us to easlily bring in our newly reformatted and prepocessed dataset without having to go through all of those steps we painstakingly took to get to this point. It will also help us load our data faster. 
We will create a new directory called "pickled-data-raw" and will direct all of our batched transaction data into this file. After running the code below, you should see a new folder in the folder this jupyter notebook lives in and within that folder you should see a list of pickle files with all our daily transactions. 

In [None]:
DIR_OUTPUT = "./pickled-data-raw/"

if not os.path.exists(DIR_OUTPUT):
    os.makedirs(DIR_OUTPUT)

start_date = datetime.datetime.strptime("2016-01-01", "%Y-%m-%d")

for day in range(df.TX_Days.max()+1):
    
    transactions_day = df[df.TX_Days==day]
    
    date = start_date + datetime.timedelta(days=day)
    filename_output = date.strftime("%Y-%m-%d")+'.pkl'
    
    # Protocol=4 required for Google Colab
    transactions_day.to_pickle(DIR_OUTPUT+filename_output, protocol=4)

We should have about 365 files in this folder

In [None]:
lst = os.listdir("./pickled-data-raw/")
number_files = len(lst)
print (number_files)

## Feauture Transformations

Now its time to do some Feature Transformations. Before we do that, we have to create a function that will allow us to retrieve our data. Then we will load in some data to begin our feature transformations and to test out our read_from_files function. We'll load in data from **2016-06-25** to **2016-07-15** and use that as our dataset to create our new features. 

In [None]:
DIR_OUTPUT = "./pickled-data-raw/"

In [None]:
#Parameter will be the directory, the begin date which and end date which will collect all the files with all those dates in between to append it into a dataframe
def read_from_files(DIR_OUTPUT, begin_date, end_date):
    #read in all the files in the directory
    files = [os.path.join(DIR_OUTPUT, f) for f in os.listdir(DIR_OUTPUT) if f>=begin_date+'.pkl' and f<=end_date+'.pkl']
    #create dataframes for each of of our files
    frames = []
    #create a forloop to read in the pickled files and append them
    for f in files:
        df = pd.read_pickle(f)
        frames.append(df)
        del df
    df_final = pd.concat(frames)
    
    df_final=df_final.sort_values('ID')
    df_final.reset_index(drop=True,inplace=True)
    #  Note: -1 are missing values for real world data 
    df_final=df_final.replace([-1],0)
    
    return df_final

In [None]:
def save_object(obj, filename):
    with open(filename, 'wb') as output:
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

In [None]:
DIR_OUTPUT = "./pickled-data-raw/"

begin_date = '2016-06-25'
end_date = '2016-07-15'

df = read_from_files(DIR_OUTPUT, begin_date, end_date)
print('{0} transactions loaded, containing {1} fraudulent transactions'.format(len(df), df.isFraud.sum()))

In [None]:
df.head()

### Features Engineered around Time

For our first transformation, we want to create a feature that assesses whether a transaction accured during the weekend or not. We will build a function that will return a transactiondate to us as a value of 0 if it happened during the weekday and 1 if it occured during the weekend. 

In [None]:
def is_weekend(transactionDateTime):
    # Transform date into weekday (0 is Monday, 6 is Sunday)
    weekday = transactionDateTime.weekday()
    # Binary value: 0 if weekday, 1 if weekend
    is_weekend = weekday>=5
    
    return int(is_weekend)

In [None]:
%time df['TX_DURING_WEEKEND']=df.transactionDateTime.apply(is_weekend)

Next we'll also want to measure whether a transaction happened at night or not. In aquiring general knowledge about credit card fraud transactions, its been found that fraudulent transactions seem to be scewed toward the night time than other wise, so we'll create a function to measure that value and include it as possible feature in our model. 

In [None]:
def is_night(transactionDateTime):
    
    # Get the hour of the transaction
    tx_hour = transactionDateTime.hour
    # Binary value: 1 if hour less than 6, and 0 otherwise
    is_night = tx_hour<=6
    
    return int(is_night)

In [None]:
%time df['TX_DURING_NIGHT']=df.transactionDateTime.apply(is_night)

In [None]:
df

### Features Engineered around Customer Behavior

Next we'll create some features around the behavior of our customers. We want to see how customers spend their money within certain time intervals (1, 7, and 30 day windows). We want to see the total amount of transactions within that time frame as well as the average amount spent within that time frame too. We'll then take a look at our very first customer in our data frame to see what values our function will return. We'll do this with our *get_customer_spending_behaviour_features*. Our implementation relies on the Panda rolling function, which makes easy the computation of aggregates over a time window

In [None]:
df.head(10)

In [None]:
def get_customer_spending_behaviour_features(customer_transactions, windows_size_in_days=[1,7,30]):
    
    # Let us first order transactions chronologically
    customer_transactions=customer_transactions.sort_values('transactionDateTime')
    
    # The transaction date and time is set as the index, which will allow the use of the rolling function 
    customer_transactions.index=customer_transactions.transactionDateTime
    
    # For each window size
    for window_size in windows_size_in_days:
        
        # Compute the sum of the transaction amounts and the number of transactions for the given window size
        SUM_AMOUNT_TX_WINDOW=customer_transactions['transactionAmount'].rolling(str(window_size)+'d').sum()
        NB_TX_WINDOW=customer_transactions['transactionAmount'].rolling(str(window_size)+'d').count()
    
        # Compute the average transaction amount for the given window size
        # NB_TX_WINDOW is always >0 since current transaction is always included
        AVG_AMOUNT_TX_WINDOW=SUM_AMOUNT_TX_WINDOW/NB_TX_WINDOW
    
        # Save feature values
        customer_transactions['CUSTOMER_ID_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
        customer_transactions['CUSTOMER_ID_AVG_AMOUNT_'+str(window_size)+'DAY_WINDOW']=list(AVG_AMOUNT_TX_WINDOW)
    
    # Reindex according to transaction IDs
    customer_transactions.index=customer_transactions.ID
        
    # And return the dataframe with the new features
    return customer_transactions

In [None]:
spending_behaviour_customer_1 = get_customer_spending_behaviour_features(df[df.ID==5])
spending_behaviour_customer_1

Now that we see our function is working, we'll apply it to our whole dataframe and take a look at our dataset

In [None]:
df=df.groupby('customerId').apply(lambda x: get_customer_spending_behaviour_features(x, windows_size_in_days=[1,7,30]))
df=df.sort_values('transactionDateTime').reset_index(drop=True)

In [None]:
df

### Feature Engineered around Merchant Behavior

Now we want to assess our merchant behaviors to see if there are any features we can engineer around them. The main goal of this will be to get a risk score, which will give us a sense of the exposure a specific merchant has to fraudulent transactions. The risk score will be calculated as the average number of fraudulent transactions that occured at store over a certain time window. 

With our customer ID transformations we used 3 window sizes of 1,7, and 30 days. But we will have to treat merchant transactions differently. The time windows will have to be shifted back by a delay period to account for the fact that in reality, transactions are discovered to be fraudulent after a fraud investigation or a customer complaint. As a result, fraudulent labels, which are needed for the risk score are only available after the delay period. This means the labels we need to compute the risk score in reality are only available after this delay period, so we'll account for that in our Merchant ID Transformations. The delay period will be set to about 1 week week (7 days).

To get our risk score we'll have to define and create a *get_count_risk_rolling_window* function. The function takes as inputs the DataFrame of transactions for a given merchant, the delay period, and a list of window sizes(time intervals of 1,7,and 30 days). In the first stage, the number of transactions and fraudulent transactions are computed for the delay period (NB_TX_DELAY and NB_FRAUD_DELAY). In the second stage, the number of transactions and fraudulent transactions are computed for each window size plus the delay period (NB_TX_DELAY_WINDOW and NB_FRAUD_DELAY_WINDOW).  The number of transactions and fraudulent transactions that occurred for a given window size, shifted back by the delay period, is then obtained by simply computing the differences of the quantities obtained for the delay period, and the window size plus delay period.

The risk score is then retrieved by computing the proportion of fraudulent transactions for each window size (or 0 if no transaction occurred for the given window)

Our function will also return the number of transactions for each window size. This results in the addition of six new features: The risk and number of transactions, for three window sizes.

In [None]:
def get_count_risk_rolling_window(merchant_tx, delay_period=7, windows_size_in_days=[1,7,30], feature="merchantName"):
    
    merchant_tx = merchant_tx.sort_values('transactionDateTime')
    
    merchant_tx.index=merchant_tx.transactionDateTime
    
    NB_FRAUD_DELAY=merchant_tx['isFraud'].rolling(str(delay_period)+'d').sum()
    NB_TX_DELAY=merchant_tx['isFraud'].rolling(str(delay_period)+'d').count()
    
    for window_size in windows_size_in_days:
    
        NB_FRAUD_DELAY_WINDOW=merchant_tx['isFraud'].rolling(str(delay_period+window_size)+'d').sum()
        NB_TX_DELAY_WINDOW=merchant_tx['isFraud'].rolling(str(delay_period+window_size)+'d').count()
    
        NB_FRAUD_WINDOW=NB_FRAUD_DELAY_WINDOW-NB_FRAUD_DELAY
        NB_TX_WINDOW=NB_TX_DELAY_WINDOW-NB_TX_DELAY
    
        RISK_WINDOW=NB_FRAUD_WINDOW/NB_TX_WINDOW
        
        merchant_tx[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
        merchant_tx[feature+'_RISK_'+str(window_size)+'DAY_WINDOW']=list(RISK_WINDOW)
        
    merchant_tx.index=merchant_tx.ID
    
    # Replace NA values with 0 (all undefined risk scores where NB_TX_WINDOW is 0) 
    merchant_tx.fillna(0,inplace=True)
    
    return merchant_tx

In [None]:
get_count_risk_rolling_window(df[df.merchantName=='staples.com'], delay_period=7, windows_size_in_days=[1,7,30])

In [None]:
%time df=df.groupby('merchantName').apply(lambda x: get_count_risk_rolling_window(x, delay_period=7, windows_size_in_days=[1,7,30], feature="merchantName"))
df=df.sort_values('transactionDateTime').reset_index(drop=True)

In [None]:
df

This wraps up our feature engineering and now we will save our data yet again into our pickle file directory so we can save our dataframe with all the new features we've included. 

In [None]:
DIR_OUTPUT = "./pickled-data-raw/"

if not os.path.exists(DIR_OUTPUT):
    os.makedirs(DIR_OUTPUT)

start_date = datetime.datetime.strptime("2016-01-01", "%Y-%m-%d")

for day in range(df.TX_Days.max()+1):
    
    transactions_day = df[df.TX_Days==day]
    
    date = start_date + datetime.timedelta(days=day)
    filename_output = date.strftime("%Y-%m-%d")+'.pkl'
    
    # Protocol=4 required for Google Colab
    transactions_day.to_pickle(DIR_OUTPUT+filename_output, protocol=4)

Just to double check, we'll load in our data with the time frames we used when we first loaded in our data to engineer our features to see if everything looks correct. We will name our dataframe df_check and look to see if all the new features we engineered are in there.

In [None]:
DIR_OUTPUT = "./pickled-data-raw/"

begin_date = '2016-06-25'
end_date = '2016-07-15'

df_check = read_from_files(DIR_OUTPUT, begin_date, end_date)
print('{0} transactions loaded, containing {1} fraudulent transactions'.format(len(df_check), df_check.isFraud.sum()))

In [None]:
df_check.head()

## Preliminary Modelling

Lets read in files from **2016-06-25** and **2016-07-25**

In [None]:
def read_from_files(DIR_OUTPUT, begin_date, end_date):
    #read in all the files in the directory
    files = [os.path.join(DIR_OUTPUT, f) for f in os.listdir(DIR_OUTPUT) if f>=begin_date+'.pkl' and f<=end_date+'.pkl']
    #create dataframes for each of of our files
    frames = []
    #create a forloop to read in the pickled files and append them
    for f in files:
        df = pd.read_pickle(f)
        frames.append(df)
        del df
    df_final = pd.concat(frames)
    
    df_final=df_final.sort_values('ID')
    df_final.reset_index(drop=True,inplace=True)
    #  Note: -1 are missing values for real world data 
    df_final=df_final.replace([-1],0)
    
    return df_final

In [None]:
def save_object(obj, filename):
    with open(filename, 'wb') as output:
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

In [None]:
DIR_OUTPUT = "./pickled-data-raw/"

begin_date = '2016-06-25'
end_date = '2016-07-25'

df = read_from_files(DIR_OUTPUT, begin_date, end_date)
print('{0} transactions loaded, containing {1} fraudulent transactions'.format(len(df), df.isFraud.sum()))

Lets take a look at the number of transactions we have for each week. The first week of our dataset will be our training set, the next week will be the delay period, and the week after will be defined as the test set.

In [None]:
#function to get a our dataframe for all the statistics we need for our line graph to assess our test, delay, and training sets
def get_tx_stats(df, start_date_df = "2016-01-01"):
    
    #Number of transactions per day
    nb_tx_per_day=df.groupby(['TX_Days'])['customerId'].count()
    #Number of fraudulent transactions per day
    nb_fraudulent_transactions_per_day=df.groupby(['TX_Days'])['isFraud'].sum()
    #Number of compromised cards per day
    nb_compromised_cards_per_day=df[df['isFraud']==1].groupby(['TX_Days']).customerId.nunique()
    #Creating a new dataframe for our specific statistics
    tx_stats=pd.DataFrame({"nb_tx_per_day":nb_tx_per_day,
                           "nb_fraudulent_transactions_per_day":nb_fraudulent_transactions_per_day,
                           "nb_compromised_cards_per_day":nb_compromised_cards_per_day})

    tx_stats=tx_stats.reset_index()
    
    start_date = datetime.datetime.strptime(start_date_df, "%Y-%m-%d")
    tx_date=start_date+tx_stats['TX_Days'].apply(datetime.timedelta)
    
    tx_stats['tx_date']=tx_date
    
    return tx_stats

tx_stats=get_tx_stats(df, start_date_df="2016-01-01")

In [None]:
%%capture

# Plot the number of transactions per day, fraudulent transactions per day and fraudulent cards per day

def get_template_tx_stats(ax ,fs,
                          start_date_training,
                          title='',
                          delta_train=7,
                          delta_delay=7,
                          delta_test=7,
                          ylim=100):
    
    ax.set_title(title, fontsize=fs*1.5)
    ax.set_ylim([0, ylim])
    
    ax.set_xlabel('Date', fontsize=fs)
    ax.set_ylabel('Number', fontsize=fs)
    
    plt.yticks(fontsize=fs*0.7) 
    plt.xticks(fontsize=fs*0.7)    

    ax.axvline(start_date_training+datetime.timedelta(days=delta_train), 0,ylim, color="black")
    ax.axvline(start_date_test, 0, ylim, color="black")
    
    ax.text(start_date_training+datetime.timedelta(days=2), ylim-20,'Training period', fontsize=fs)
    ax.text(start_date_training+datetime.timedelta(days=delta_train+2), ylim-20,'Delay period', fontsize=fs)
    ax.text(start_date_training+datetime.timedelta(days=delta_train+delta_delay+2), ylim-20,'Test period', fontsize=fs)


cmap = plt.get_cmap('jet')
colors={'nb_tx_per_day':cmap(0), 
        'nb_fraudulent_transactions_per_day':cmap(200), 
        'nb_compromised_cards_per_day':cmap(250)}

fraud_and_transactions_stats_fig, ax = plt.subplots(1, 1, figsize=(15,8))

# Training period
start_date_training = datetime.datetime.strptime("2016-06-25", "%Y-%m-%d")
delta_train = delta_delay = delta_test = 7

end_date_training = start_date_training+datetime.timedelta(days=delta_train-1)

# Test period
start_date_test = start_date_training+datetime.timedelta(days=delta_train+delta_delay)
end_date_test = start_date_training+datetime.timedelta(days=delta_train+delta_delay+delta_test-1)

get_template_tx_stats(ax, fs=20,
                      start_date_training=start_date_training,
                      title='Total transactions, and number of fraudulent transactions \n and number of compromised cards per day',
                      delta_train=delta_train,
                      delta_delay=delta_delay,
                      delta_test=delta_test
                     )

ax.plot(tx_stats['tx_date'], tx_stats['nb_tx_per_day']/50, 'b', color=colors['nb_tx_per_day'], label = '# transactions per day (/50)')
ax.plot(tx_stats['tx_date'], tx_stats['nb_fraudulent_transactions_per_day'], 'b', color=colors['nb_fraudulent_transactions_per_day'], label = '# fraudulent txs per day')
ax.plot(tx_stats['tx_date'], tx_stats['nb_compromised_cards_per_day'], 'b', color=colors['nb_compromised_cards_per_day'], label = '# compromised cards per day')

ax.legend(loc = 'upper left',bbox_to_anchor=(1.05, 1),fontsize=20)

In [None]:
fraud_and_transactions_stats_fig

Lets define our training and testings sets and take a look at both dataframes

In [None]:
def get_train_test_set(df,
                       start_date_training,
                       delta_train=7,delta_delay=7,delta_test=7):
    
    # Get the training set data
    train_df = df[(df.transactionDateTime>=start_date_training) &
                               (df.transactionDateTime<start_date_training+datetime.timedelta(days=delta_train))]
    
    # Get the test set data
    test_df = []
    
    # Note: Cards known to be compromised after the delay period are removed from the test set
    # That is, for each test day, all frauds known at (test_day-delay_period) are removed
    
    # First, get known defrauded customers from the training set
    known_defrauded_customers = set(train_df[train_df.isFraud==1].customerId)
    
    # Get the relative starting day of training set (easier than TX_DATETIME to collect test data)
    start_tx_time_days_training = train_df.TX_Days.min()
    
    # Then, for each day of the test set
    for day in range(delta_test):
    
        # Get test data for that day
        test_df_day = df[df.TX_Days==start_tx_time_days_training+ delta_train+delta_delay+day]
        
        # Compromised cards from that test day, minus the delay period, are added to the pool of known defrauded customers
        test_df_day_delay_period = df[df.TX_Days==start_tx_time_days_training+
                                                                                delta_train+
                                                                                day-1]
        
        new_defrauded_customers = set(test_df_day_delay_period[test_df_day_delay_period.isFraud==1].customerId)
        known_defrauded_customers = known_defrauded_customers.union(new_defrauded_customers)
        
        test_df_day = test_df_day[~test_df_day.customerId.isin(known_defrauded_customers)]
        
        test_df.append(test_df_day)
        
    test_df = pd.concat(test_df)
    
    # Sort data sets by ascending order of transaction ID
    train_df=train_df.sort_values('ID')
    test_df=test_df.sort_values('ID')
    
    return (train_df, test_df)

In [None]:
(train_df, test_df)=get_train_test_set(df,start_date_training,
                                       delta_train=7,delta_delay=7,delta_test=7)

In [None]:
train_df

In [None]:
train_df[train_df.isFraud==1].shape

In [None]:
test_df

In [None]:
test_df.shape

In [None]:
test_df[test_df.isFraud==1].shape

In [None]:
726/(45524+726)

In [None]:
train_df.dtypes

### Decision Trees

We will first use Decision Trees as our first training Model

In [None]:
output_feature="isFraud"

input_features=['transactionAmount','TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'CUSTOMER_ID_NB_TX_1DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW', 'CUSTOMER_ID_NB_TX_7DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW', 'CUSTOMER_ID_NB_TX_30DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW', 'merchantName_NB_TX_1DAY_WINDOW',
       'merchantName_RISK_1DAY_WINDOW', 'merchantName_NB_TX_7DAY_WINDOW',
       'merchantName_RISK_7DAY_WINDOW', 'merchantName_NB_TX_30DAY_WINDOW',
       'merchantName_RISK_30DAY_WINDOW','cardPresent']

In [None]:
def fit_model_and_get_predictions(classifier, train_df, test_df, 
                                  input_features, output_feature="isFraud",scale=True):
    
    # By default, scales input data
    if scale:
        (train_df, test_df)=scaleData(train_df,test_df,input_features)
    
    # We first train the classifier using the `fit` method, and pass as arguments the input and output features
    start_time=time.time()
    classifier.fit(train_df[input_features], train_df[output_feature])
    training_execution_time=time.time()-start_time

    # We then get the predictions on the training and test data using the `predict_proba` method
    # The predictions are returned as a numpy array, that provides the probability of fraud for each transaction 
    start_time=time.time()
    predictions_test=classifier.predict_proba(test_df[input_features])[:,1]
    prediction_execution_time=time.time()-start_time
    
    predictions_train=classifier.predict_proba(train_df[input_features])[:,1]

    # The result is returned as a dictionary containing the fitted models, 
    # and the predictions on the training and test sets
    model_and_predictions_dictionary = {'classifier': classifier,
                                        'predictions_test': predictions_test,
                                        'predictions_train': predictions_train,
                                        'training_execution_time': training_execution_time,
                                        'prediction_execution_time': prediction_execution_time
                                       }
    
    return model_and_predictions_dictionary

In [None]:
classifier = sklearn.tree.DecisionTreeClassifier(max_depth = 2, random_state=0)

model_and_predictions_dictionary = fit_model_and_get_predictions(classifier, train_df, test_df, 
                                                                 input_features, output_feature,
                                                                 scale=False)

In [None]:
test_df['TX_FRAUD_PREDICTED']=model_and_predictions_dictionary['predictions_test']

The probability of fraud for all these transactions is of 0.015920. We can display the decision tree to understand how these probabilities were set

In [None]:
#Visualizing decision tree with maplotlib. Graphviz is another option

fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=300)

tree.plot_tree(classifier,
           feature_names = input_features, 
           class_names=True,
           filled = True);

In [None]:
def card_precision_top_k_day(df_day,top_k):
    
    # This takes the max of the predictions AND the max of label TX_FRAUD for each CUSTOMER_ID, 
    # and sorts by decreasing order of fraudulent prediction
    df_day = df_day.groupby('customerId').max().sort_values(by="predictions", ascending=False).reset_index(drop=False)
            
    # Get the top k most suspicious cards
    df_day_top_k=df_day.head(top_k)
    list_detected_compromised_cards=list(df_day_top_k[df_day_top_k.isFraud==1].customerId)
    
    # Compute precision top k
    card_precision_top_k = len(list_detected_compromised_cards) / top_k
    
    return list_detected_compromised_cards, card_precision_top_k

def card_precision_top_k(predictions_df, top_k, remove_detected_compromised_cards=True):

    # Sort days by increasing order
    list_days=list(predictions_df['TX_Days'].unique())
    list_days.sort()
    
    # At first, the list of detected compromised cards is empty
    list_detected_compromised_cards = []
    
    card_precision_top_k_per_day_list = []
    nb_compromised_cards_per_day = []
    
    # For each day, compute precision top k
    for day in list_days:
        
        df_day = predictions_df[predictions_df['TX_Days']==day]
        df_day = df_day[['predictions', 'customerId', 'isFraud']]
        
        # Let us remove detected compromised cards from the set of daily transactions
        df_day = df_day[df_day.customerId.isin(list_detected_compromised_cards)==False]
        
        nb_compromised_cards_per_day.append(len(df_day[df_day.isFraud==1].customerId.unique()))
        
        detected_compromised_cards, card_precision_top_k = card_precision_top_k_day(df_day,top_k)
        
        card_precision_top_k_per_day_list.append(card_precision_top_k)
        
        # Let us update the list of detected compromised cards
        if remove_detected_compromised_cards:
            list_detected_compromised_cards.extend(detected_compromised_cards)
        
    # Compute the mean
    mean_card_precision_top_k = np.array(card_precision_top_k_per_day_list).mean()
    
    # Returns precision top k per day as a list, and resulting mean
    return nb_compromised_cards_per_day,card_precision_top_k_per_day_list,mean_card_precision_top_k

def performance_assessment(predictions_df, output_feature='isFraud', 
                           prediction_feature='predictions', top_k_list=[100],
                           rounded=True):
    
    AUC_ROC = metrics.roc_auc_score(predictions_df[output_feature], predictions_df[prediction_feature])
    AP = metrics.average_precision_score(predictions_df[output_feature], predictions_df[prediction_feature])
    
    performances = pd.DataFrame([[AUC_ROC, AP]], 
                           columns=['AUC ROC','Average precision'])
    
    for top_k in top_k_list:
    
        _, _, mean_card_precision_top_k = card_precision_top_k(predictions_df, top_k)
        performances['Card Precision@'+str(top_k)]=mean_card_precision_top_k
        
    if rounded:
        performances = performances.round(3)
    
    return performances

The Card Precision top-k is the most pragmatic and interpretable measure. It takes into account the fact that investigators can only check a maximum of potentially fraudulent cards per day. It is computed by ranking, for every day in the test set, the most fraudulent transactions, and selecting the cards whose transactions have the highest fraud probabilities. The precision (proportion of actual compromised cards out of predicted compromised cards) is then computed for each day. The Card Precision top-k is the average of these daily precisions. The number will be set to 100 (that is, it is assumed that only 100 cards can be checked every day). The metric is described in detail in Chapter 4, Precision top-k metrics.

The Average Precision is a proxy for the Card Precision top-k, that integrates precisions for all possible values

The AUC ROC is an alternative measure to the Average Precision, which gives more importance to scores obtained with higher values. It is less relevant in practice since the performances that matter most are those for low values. We however also report it since it is the most widely used performance metric for fraud detection in the literature.

In [None]:
predictions_df=test_df
predictions_df['predictions']=model_and_predictions_dictionary['predictions_test']
    
performance_assessment(predictions_df, top_k_list=[100])

In [None]:
predictions_df['predictions']=0.5
    
performance_assessment(predictions_df, top_k_list=[100])

### Utilizing Standard Predictive Models

Lets now train our models on standard predictive models such Logistic Regression, Random Forest, XGBoost, as well as other iterations of Decision Trees. For this purpose, we will create a dictionary of sklearn classifiers that instantiates each of these classifiers. We then train and compute the predictions for each of these classifiers using the fit_model_and_get_predictions function.

In [None]:
def scaleData(train,test,features):
    scaler = sklearn.preprocessing.StandardScaler()
    scaler.fit(train[features])
    train[features]=scaler.transform(train[features])
    test[features]=scaler.transform(test[features])
    
    return (train,test)

In [None]:
classifiers_dictionary={'Logistic regression':sklearn.linear_model.LogisticRegression(random_state=0), 
                        'Decision tree with depth of two':sklearn.tree.DecisionTreeClassifier(max_depth=2,random_state=0), 
                        'Decision tree - unlimited depth':sklearn.tree.DecisionTreeClassifier(random_state=0), 
                        'Random forest':sklearn.ensemble.RandomForestClassifier(random_state=0,n_jobs=-1),
                        'XGBoost':xgboost.XGBClassifier(random_state=0,n_jobs=-1),
                       }

fitted_models_and_predictions_dictionary={}

for classifier_name in classifiers_dictionary:
    
    model_and_predictions = fit_model_and_get_predictions(classifiers_dictionary[classifier_name], train_df, test_df, 
                                                                                  input_features=input_features,
                                                                                output_feature=output_feature)
    fitted_models_and_predictions_dictionary[classifier_name]=model_and_predictions

In [None]:
def performance_assessment_model_collection(fitted_models_and_predictions_dictionary, 
                                            df, 
                                            type_set='test',
                                            top_k_list=[100]):

    performances=pd.DataFrame() 
    
    for classifier_name, model_and_predictions in fitted_models_and_predictions_dictionary.items():
    
        predictions_df= df
            
        predictions_df['predictions']=model_and_predictions['predictions_'+type_set]
        
        performances_model=performance_assessment(predictions_df, output_feature='isFraud', 
                                                   prediction_feature='predictions', top_k_list=top_k_list)
        performances_model.index=[classifier_name]
        
        performances=performances.append(performances_model)
        
    return performances

Lets take a look at the performances of this model on the test set

In [None]:
# performances on test set
df_performances=performance_assessment_model_collection(fitted_models_and_predictions_dictionary, test_df, 
                                                        type_set='test', 
                                                        top_k_list=[100])
df_performances

Lets take a look at the performances of this model on the training set

In [None]:
# performances on training set
df_performances=performance_assessment_model_collection(fitted_models_and_predictions_dictionary, train_df, 
                                                        type_set='train', 
                                                        top_k_list=[100])
df_performances

In [None]:
def execution_times_model_collection(fitted_models_and_predictions_dictionary):

    execution_times=pd.DataFrame() 
    
    for classifier_name, model_and_predictions in fitted_models_and_predictions_dictionary.items():
    
        execution_times_model=pd.DataFrame() 
        execution_times_model['Training execution time']=[model_and_predictions['training_execution_time']]
        execution_times_model['Prediction execution time']=[model_and_predictions['prediction_execution_time']]
        execution_times_model.index=[classifier_name]
        
        execution_times=execution_times.append(execution_times_model)
        
    return execution_times

In [None]:
# Execution times
df_execution_times=execution_times_model_collection(fitted_models_and_predictions_dictionary)
df_execution_times

### Plot AUC ROC Curve

In [None]:
DIR_OUTPUT = "./pickled-data-raw/"
begin_date = '2016-06-25'
end_date = '2016-07-25'

%time df=read_from_files(DIR_OUTPUT, begin_date, end_date)
print("{0} transactions loaded, containing {1} fraudulent transactions".format(len(df),df.isFraud.sum()))

start_date_training = datetime.datetime.strptime("2016-06-25", "%Y-%m-%d")
delta_train=delta_delay=delta_test=7

(train_df,test_df)=get_train_test_set(df,start_date_training,
                                      delta_train=delta_train,delta_delay=delta_delay,delta_test=delta_test)

output_feature="isFraud"

input_features=['transactionAmount','TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'CUSTOMER_ID_NB_TX_1DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW', 'CUSTOMER_ID_NB_TX_7DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW', 'CUSTOMER_ID_NB_TX_30DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW', 'merchantName_NB_TX_1DAY_WINDOW',
       'merchantName_RISK_1DAY_WINDOW', 'merchantName_NB_TX_7DAY_WINDOW',
       'merchantName_RISK_7DAY_WINDOW', 'merchantName_NB_TX_30DAY_WINDOW',
       'merchantName_RISK_30DAY_WINDOW','cardPresent']

classifiers_dictionary={'Logistic regression':sklearn.linear_model.LogisticRegression(random_state=0), 
                        'Decision tree with depth of two':sklearn.tree.DecisionTreeClassifier(max_depth=2,random_state=0), 
                        'Decision tree - unlimited depth':sklearn.tree.DecisionTreeClassifier(random_state=0), 
                        'Random forest':sklearn.ensemble.RandomForestClassifier(random_state=0,n_jobs=-1),
                        'XGBoost':xgboost.XGBClassifier(random_state=0,n_jobs=-1),
                       }

fitted_models_and_predictions_dictionary={}

for classifier_name in classifiers_dictionary:
    
    start_time=time.time()
    model_and_predictions = fit_model_and_get_predictions(classifiers_dictionary[classifier_name], train_df, test_df, 
                                                          input_features=input_features,
                                                          output_feature=output_feature)
    
    print("Time to fit the "+classifier_name+" model: "+str(round(time.time()-start_time,2)))
    
    fitted_models_and_predictions_dictionary[classifier_name]=model_and_predictions

In [None]:
train_df[train_df.isFraud==1].shape

In [None]:
def get_template_roc_curve(ax, title,fs,random=True):
    
    ax.set_title(title, fontsize=fs)
    ax.set_xlim([-0.01, 1.01])
    ax.set_ylim([-0.01, 1.01])
    
    ax.set_xlabel('False Positive Rate', fontsize=fs)
    ax.set_ylabel('True Positive Rate', fontsize=fs)
    
    if random:
        ax.plot([0, 1], [0, 1],'r--',label="AUC ROC Random = 0.5")

In [None]:
%%capture
roc_curve, ax = plt.subplots(1, 1, figsize=(5,5))

cmap = plt.get_cmap('jet')
colors={'Logistic regression':cmap(0), 'Decision tree with depth of two':cmap(200), 
        'Decision tree - unlimited depth':cmap(250),
        'Random forest':cmap(70), 'XGBoost':cmap(40)}

get_template_roc_curve(ax,title='Receiver Operating Characteristic Curve\nTest data',fs=15)
    
for classifier_name in classifiers_dictionary:
    
    model_and_predictions=fitted_models_and_predictions_dictionary[classifier_name]

    FPR_list, TPR_list, threshold = metrics.roc_curve(test_df[output_feature], model_and_predictions['predictions_test'])
    ROC_AUC = metrics.auc(FPR_list, TPR_list)

    ax.plot(FPR_list, TPR_list, 'b', color=colors[classifier_name], label = 'AUC ROC {0}= {1:0.3f}'.format(classifier_name,ROC_AUC))
    ax.legend(loc = 'upper left',bbox_to_anchor=(1.05, 1))

In [None]:
roc_curve

We can clearly see that our Log Regression and our ROC Decision tree with a depth of 2 works far better than our other models. 

### Precision-Recall Curve

In [None]:
def compute_AP(precision, recall):
    
    AP = 0
    
    n_thresholds = len(precision)
    
    for i in range(1, n_thresholds):
        
        if recall[i]-recall[i-1]>=0:
            
            AP = AP+(recall[i]-recall[i-1])*precision[i]
        
    return AP

In [None]:
def get_template_pr_curve(ax, title,fs, baseline=0.5):
    ax.set_title(title, fontsize=fs)
    ax.set_xlim([-0.01, 1.01])
    ax.set_ylim([-0.01, 1.01])
    
    ax.set_xlabel('Recall (True Positive Rate)', fontsize=fs)
    ax.set_ylabel('Precision', fontsize=fs)
    
    ax.plot([0, 1], [baseline, baseline],'r--',label='AP Random = {0:0.3f}'.format(baseline))

In [None]:
%%capture
pr_curve, ax = plt.subplots(1, 1, figsize=(6,6))
cmap = plt.get_cmap('jet')
colors={'Logistic regression':cmap(0), 'Decision tree with depth of two':cmap(200), 
        'Decision tree - unlimited depth':cmap(250),
        'Random forest':cmap(70), 'XGBoost':cmap(40)}

get_template_pr_curve(ax, "Precision Recall (PR) Curve\nTest data",fs=15,baseline=sum(test_df[output_feature])/len(test_df[output_feature]))
    
for classifier_name in classifiers_dictionary:
    
    model_and_predictions=fitted_models_and_predictions_dictionary[classifier_name]

    precision, recall, threshold = metrics.precision_recall_curve(test_df[output_feature], model_and_predictions['predictions_test'])
    precision=precision[::-1]
    recall=recall[::-1]
    
    AP = metrics.average_precision_score(test_df[output_feature], model_and_predictions['predictions_test'])
    
    ax.step(recall, precision, 'b', color=colors[classifier_name], label = 'AP {0}= {1:0.3f}'.format(classifier_name,AP))
    ax.legend(loc = 'upper left',bbox_to_anchor=(1.05, 1))
    
    
plt.subplots_adjust(wspace=0.5, hspace=0.8)

In [None]:
pr_curve

Our PR Curve gives us a different view of our classifiers. Compared to the ROC curve, these classifiers seem to be performing worse, due to the fact the the class imbalance problem is still addressed with PR performance metrics. The PR Curve is useful in highlighting the performance of fraud detection systems for low FPR values, but they still remain difficult to interpret from an operational point of view. When paired with the Precison top-k metrics, it becomes a little easier to interpret

#### Precision top-k 

In [None]:
DIR_OUTPUT = "./pickled-data-raw/"
begin_date = '2016-06-25'
end_date = '2016-07-25'

%time df=read_from_files(DIR_OUTPUT, begin_date, end_date)
print("{0} transactions loaded, containing {1} fraudulent transactions".format(len(df),df.isFraud.sum()))

start_date_training = datetime.datetime.strptime("2016-06-25", "%Y-%m-%d")
delta_train=delta_delay=delta_test=7

(train_df,test_df)=get_train_test_set(df,start_date_training,
                                      delta_train=delta_train,delta_delay=delta_delay,delta_test=delta_test)

output_feature="isFraud"

input_features=['transactionAmount','TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'CUSTOMER_ID_NB_TX_1DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW', 'CUSTOMER_ID_NB_TX_7DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW', 'CUSTOMER_ID_NB_TX_30DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW', 'merchantName_NB_TX_1DAY_WINDOW',
       'merchantName_RISK_1DAY_WINDOW', 'merchantName_NB_TX_7DAY_WINDOW',
       'merchantName_RISK_7DAY_WINDOW', 'merchantName_NB_TX_30DAY_WINDOW',
       'merchantName_RISK_30DAY_WINDOW','cardPresent']

classifier = sklearn.linear_model.LogisticRegression(random_state=0)

model_and_predictions_dictionary = fit_model_and_get_predictions(classifier, train_df, test_df, 
                                                                 input_features, output_feature)

In [None]:
assert len(model_and_predictions_dictionary['predictions_test'])==len(test_df)

In [None]:
test_df

In [None]:
predictions_df=test_df
predictions_df['predictions']=model_and_predictions_dictionary['predictions_test']
predictions_df[['ID','transactionDateTime','customerId','merchantName','transactionAmount','TX_Days','isFraud','predictions']].head()

In [None]:
predictions_df[predictions_df.isFraud==1]

In [None]:
def precision_top_k(predictions_df, top_k=100):

    # Sort days by increasing order
    list_days=list(predictions_df['TX_Days'].unique())
    list_days.sort()
    
    precision_top_k_per_day_list = []
    nb_fraudulent_transactions_per_day = []
    
    # For each day, compute precision top k
    for day in list_days:
        
        df_day = predictions_df[predictions_df['TX_Days']==day]
        df_day = df_day[['ID', 'customerId', 'isFraud', 'predictions']]
        
        nb_fraudulent_transactions_per_day.append(len(df_day[df_day.isFraud==1]))
        
        _, _precision_top_k = precision_top_k_day(df_day, top_k=top_k)
        
        precision_top_k_per_day_list.append(_precision_top_k)
        
    # Compute the mean
    mean_precision_top_k = np.round(np.array(precision_top_k_per_day_list).mean(),3)
    
    # Returns number of fraudulent transactions per day,
    # precision top k per day, and resulting mean
    return nb_fraudulent_transactions_per_day,precision_top_k_per_day_list,mean_precision_top_k

In [None]:
def card_precision_top_k_day(df_day,top_k):
    
    # This takes the max of the predictions AND the max of label TX_FRAUD for each CUSTOMER_ID, 
    # and sorts by decreasing order of fraudulent prediction
    df_day = df_day.groupby('customerId').max().sort_values(by="predictions", ascending=False).reset_index(drop=False)
            
    # Get the top k most suspicious cards
    df_day_top_k=df_day.head(top_k)
    list_detected_compromised_cards=list(df_day_top_k[df_day_top_k.isFraud==1].customerId)
    
    # Compute precision top k
    card_precision_top_k = len(list_detected_compromised_cards) / top_k
    
    return list_detected_compromised_cards, card_precision_top_k

In [None]:
nb_fraudulent_transactions_per_day_remaining,\
precision_top_k_per_day_list,\
mean_precision_top_k = card_precision_top_k(predictions_df=predictions_df, top_k=100)

print("Number of remaining fraudulent transactions: "+str(nb_fraudulent_transactions_per_day_remaining))
print("Precision top-k: "+str(precision_top_k_per_day_list))
print("Average Precision top-k: "+str(mean_precision_top_k))

We get an Average Precision Score of about 4.3% meaning out of the top 100 of the most suspicious transactions, about 4.3% were confirmed to be fraudulent. To better illustrate whats happenening with Precision K lets graph the number of fraudulent transactions, the number of remaining transactions for the test period, and the number of detected fraudulent transactions for the test period. 

In [None]:
tx_stats=get_tx_stats(df, start_date_df="2016-01-01")

# Add the remaining number of fraudulent transactions for the last 7 days (test period)
tx_stats.loc[14:20,'nb_fraudulent_transactions_per_day_remaining']=list(nb_fraudulent_transactions_per_day_remaining)
# Add precision top k for the last 7 days (test period) 
tx_stats.loc[14:20,'precision_top_k_per_day']=precision_top_k_per_day_list

In [None]:
%%capture

# Plot the number of transactions per day, fraudulent transactions per day and fraudulent cards per day

cmap = plt.get_cmap('jet')
colors={'precision_top_k_per_day':cmap(0), 
        'nb_fraudulent_transactions_per_day':cmap(200),
        'nb_fraudulent_transactions_per_day_remaining':cmap(250),
       }

fraud_and_transactions_stats_fig, ax = plt.subplots(1, 1, figsize=(15,8))

# Training period
start_date_training = datetime.datetime.strptime("2016-06-25", "%Y-%m-%d")
delta_train = delta_delay = delta_test = 7

end_date_training = start_date_training+datetime.timedelta(days=delta_train-1)

# Test period
start_date_test = start_date_training+datetime.timedelta(days=delta_train+delta_delay)
end_date_test = start_date_training+datetime.timedelta(days=delta_train+delta_delay+delta_test-1)

get_template_tx_stats(ax, fs=20,
                      start_date_training=start_date_training,
                      title='Number of fraudulent transactions per day \n and number of detected fraudulent transactions',
                      delta_train=delta_train,
                      delta_delay=delta_delay,
                      delta_test=delta_test,
                      ylim=150
                     )

ax.plot(tx_stats['tx_date'], tx_stats['nb_fraudulent_transactions_per_day'], 'b', color=colors['nb_fraudulent_transactions_per_day'], label = '# fraudulent txs per day - Original')
ax.plot(tx_stats['tx_date'], tx_stats['nb_fraudulent_transactions_per_day_remaining'], 'b', color=colors['nb_fraudulent_transactions_per_day_remaining'], label = '# fraudulent txs per day - Remaining')
ax.plot(tx_stats['tx_date'], tx_stats['precision_top_k_per_day']*100, 'b', color=colors['precision_top_k_per_day'], label = '# detected fraudulent txs per day')
ax.legend(loc = 'upper left',bbox_to_anchor=(1.05, 1),fontsize=20)
    

In [None]:
fraud_and_transactions_stats_fig

Lets pay special attention to the test period, which is what we used to really assess our Precision top k. We can see that in the testing period the number of total transactions varied between 30 and 50. Out of the remaining transactions left after taking out known compromised cards in the training period, we're left with about 10-25 transactions, and the fraud detecter we built was able to correctly detect about 1 to 10 or so transactions. Overall thats about less than 20% of actual fraudulent transactions that it has been able to detect. 

### Card Precision Top-K

This is similar to precision top k but instead of just looking flatly at transactions, we'll look at the amount of cards that have been compromised to see if our detector can correctly detect the amount of compromised cards out of the k cards which have the highest risk of frauds. We'll also plot it to get a visual idea of what is going on with our fraud detector system.

In [None]:
def card_precision_top_k(predictions_df, top_k):

    # Sort days by increasing order
    list_days=list(predictions_df['TX_Days'].unique())
    list_days.sort()
    
    card_precision_top_k_per_day_list = []
    nb_compromised_cards_per_day = []
    
    # For each day, compute precision top k
    for day in list_days:
        
        df_day = predictions_df[predictions_df['TX_Days']==day]
        df_day = df_day[['predictions', 'customerId', 'isFraud']]
        
        nb_compromised_cards_per_day.append(len(df_day[df_day.isFraud==1].customerId.unique()))
        
        _, card_precision_top_k = card_precision_top_k_day(df_day,top_k)
        
        card_precision_top_k_per_day_list.append(card_precision_top_k)
        
    # Compute the mean
    mean_card_precision_top_k = np.array(card_precision_top_k_per_day_list).mean()
    
    # Returns precision top k per day as a list, and resulting mean
    return nb_compromised_cards_per_day,card_precision_top_k_per_day_list,mean_card_precision_top_k

In [None]:
nb_compromised_cards_per_day_remaining\
,card_precision_top_k_per_day_list\
,mean_card_precision_top_k=card_precision_top_k(predictions_df=predictions_df, top_k=100)

print("Number of remaining compromised cards: "+str(nb_compromised_cards_per_day_remaining))
print("Precision top-k: "+str(card_precision_top_k_per_day_list))
print("Average Precision top-k: "+str(mean_card_precision_top_k))

As we can see our avg precision top k is about 0.055, meaning about 5.5% of fraudulent cards can be detected by our system.

In [None]:
# Compute the number of transactions per day, 
# fraudulent transactions per day and fraudulent cards per day
tx_stats=get_tx_stats(df, start_date_df="2016-01-01")

# Add the remaining number of compromised cards for the last 7 days (test period)
tx_stats.loc[14:20,'nb_compromised_cards_per_day_remaining']=list(nb_compromised_cards_per_day_remaining)

# Add the card precision top k for the last 7 days (test period) 
tx_stats.loc[14:20,'card_precision_top_k_per_day']=card_precision_top_k_per_day_list

In [None]:
%%capture

# Plot the number of transactions per day, fraudulent transactions per day and fraudulent cards per day

cmap = plt.get_cmap('jet')
colors={'card_precision_top_k_per_day':cmap(0), 
        'nb_compromised_cards_per_day':cmap(200),
        'nb_compromised_cards_per_day_remaining':cmap(250),
       }

fraud_and_transactions_stats_fig, ax = plt.subplots(1, 1, figsize=(15,8))

# Training period
start_date_training = datetime.datetime.strptime("2016-06-25", "%Y-%m-%d")
delta_train = delta_delay = delta_test = 7

end_date_training = start_date_training+datetime.timedelta(days=delta_train-1)

# Test period
start_date_test = start_date_training+datetime.timedelta(days=delta_train+delta_delay)
end_date_test = start_date_training+datetime.timedelta(days=delta_train+delta_delay+delta_test-1)

get_template_tx_stats(ax, fs=20,
                      start_date_training=start_date_training,
                      title='Number of fraudulent transactions per day \n and number of detected fraudulent transactions',
                      delta_train=delta_train,
                      delta_delay=delta_delay,
                      delta_test=delta_test,
                      ylim=150
                     )

ax.plot(tx_stats['tx_date'], tx_stats['nb_compromised_cards_per_day'], 'b', color=colors['nb_compromised_cards_per_day'], label = '# fraudulent txs per day - Original')
ax.plot(tx_stats['tx_date'], tx_stats['nb_compromised_cards_per_day_remaining'], 'b', color=colors['nb_compromised_cards_per_day_remaining'], label = '# compromised cards per day - Remaining')
ax.plot(tx_stats['tx_date'], tx_stats['card_precision_top_k_per_day']*100, 'b', color=colors['card_precision_top_k_per_day'], label = '# detected compromised cards per day')
ax.legend(loc = 'upper left', bbox_to_anchor=(1.05, 1), fontsize=20)
    

In [None]:
fraud_and_transactions_stats_fig

Very similar to our previous graph, we can see that when we take out the number of known compromised cards in the training period, we are left with about 10 to up to 25 cards that are compromised in the test period, and our system can detect about 2 to 8 or 10 

## Lets Build Our Model 

Now that we have decided on our performace metrics and also have a decent overview of what models we should use, we will now build upon our previous knowledge to build a model.

In [None]:
# Note: We load more data than three weeks, as the experiments in the next sections
# will require up to three months of data

# Load data from the 2018-06-11 to the 2018-09-14

DIR_OUTPUT = "./pickled-data-raw/"

BEGIN_DATE = "2016-06-25"
END_DATE = "2016-11-25"

print("Load  files")
%time df=read_from_files(DIR_OUTPUT, BEGIN_DATE, END_DATE)
print("{0} transactions loaded, containing {1} fraudulent transactions".format(len(df),df.isFraud.sum()))

output_feature="isFraud"

input_features=['transactionAmount','TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'CUSTOMER_ID_NB_TX_1DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW', 'CUSTOMER_ID_NB_TX_7DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW', 'CUSTOMER_ID_NB_TX_30DAY_WINDOW',
       'CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW', 'merchantName_NB_TX_1DAY_WINDOW',
       'merchantName_RISK_1DAY_WINDOW', 'merchantName_NB_TX_7DAY_WINDOW',
       'merchantName_RISK_7DAY_WINDOW', 'merchantName_NB_TX_30DAY_WINDOW',
       'merchantName_RISK_30DAY_WINDOW','cardPresent']

In [None]:
# Set the starting day for the training period, and the deltas
start_date_training = datetime.datetime.strptime("2018-06-25", "%Y-%m-%d")
delta_train=7
delta_delay=7
delta_test=7

In [None]:
def get_performances_train_test_sets(df, classifier,
                                     input_features, output_feature,
                                     start_date_training, 
                                     delta_train=7, delta_delay=7, delta_test=7,
                                     top_k_list=[100],
                                     type_test="Test", parameter_summary=""):

    # Get the training and test sets
    (train_df, test_df)=get_train_test_set(df,start_date_training,
                                           delta_train=delta_train,
                                           delta_delay=delta_delay,
                                           delta_test=delta_test)
    
    # Fit model
    start_time=time.time() 
    model_and_predictions_dictionary = fit_model_and_get_predictions(classifier, train_df, test_df, 
                                                                     input_features, output_feature)
    execution_time=time.time()-start_time
    
    # Compute fraud detection performances
    test_df['predictions']=model_and_predictions_dictionary['predictions_test']
    performances_df_test=performance_assessment(test_df, top_k_list=top_k_list)
    performances_df_test.columns=performances_df_test.columns.values+' '+type_test
    
    train_df['predictions']=model_and_predictions_dictionary['predictions_train']
    performances_df_train=performance_assessment(train_df, top_k_list=top_k_list)
    performances_df_train.columns=performances_df_train.columns.values+' Train'
    
    performances_df=pd.concat([performances_df_test,performances_df_train],axis=1)
    
    performances_df['Execution time']=execution_time
    performances_df['Parameters summary']=parameter_summary
    
    return performances_df

In [None]:
classifier = sklearn.tree.DecisionTreeClassifier(max_depth=2, random_state=0)

performances_df=get_performances_train_test_sets(df, classifier, 
                                                 input_features, output_feature,
                                                 start_date_training=start_date_training, 
                                                 delta_train=delta_train, 
                                                 delta_delay=delta_delay, 
                                                 delta_test=delta_test,
                                                 parameter_summary=2
                                                )

### Prequential Validation

As mentioned before, the point of building this fraud detection system is to maximize the detection of fraudulent transactions that will occur in the future. We introduced a delay period of 7 days to simulate how fraudulent transactions are usually discovered after a period of investigation. The model is trained on a set of past transactions, but the performance of a model on training data is often a bad indicator of the performance on future data. 
This due to **Overfitting**. Increasing the degree of freedom of a model (such as the depth of a decision tree) always allows for an increase in training performance, but this always leads to lower test performances. 

This can be solved with a step called **Validation**. Validation procedures aim to solve this issue by estimating on past data, the test performance of a prediction model by setting aside a prt of them. It splits past data into two or more sets, and play the role of the test set . The performance of a model on the validation set is used as an estimate of the performance that is expected on the test test. 

For this model we will use *Prequential Validation* which consists of using training sets of similar sizes, taken from older historical data.
Each fold shifts the training and validation sets by one block in the past. So as time goes on our delay period till our test set will grow. 

This function allows to create a custom function in sklearn for Card Precision top k. It is done by passing the transactions_df DataFrame as an argument into the function.

### Grid search
We will also employ the use of Grid Search, a tool in Sklearn that will alow us to fit and assess models with different [hyperparameters](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/). This will be done through the GridSearchCV funtion which main parameters are:

1. estimator: The estimator to use, which will be a decision tree in the following example.
2. param_grid: The set of hyperparameters for the estimator. We will vary the decision tree depth (max_depth) parameter, and set the random state (random_state) for reproduciblity.
3. scoring: The scoring functions to use. We will use the AUC ROC, Average Precision, and CP@100.
4. n_jobs: The number of cores to use. We will set it to -1, that is, to using all the available cores.
5. refit: Whether the model should be fitted with all data after the cross validation. This will be set to false, since we only require the results of the cross validation.
6. cv: The cross-validation splitting strategy. The prequential validation will be used, by passing the indices returned by the prequentialSplit function.

We'll set our parameters and instantiate a GridSearchCV object.