# Credit Card Fraud Detection Using Machine Learning in the Cloud

Credit card fraud has been a problem for businesses and financial institutions for decades, resulting, in recent years, in billions of dollars in losses on a yearly basis. To take on the large amount of data generated around financial transactions, large computing resources will be required. Additionally, to review large numbers of transactions in an efficient and timely manner, human review would not be suitable. Therefore, to address these challenges machine learning in the cloud seems to be the solution. With this project, we will be developing a machine learning model which accurately identifies fraudulent credit card transactions and deploy it to the cloud.

### Objectives
* Find the best model of Credit Card Fraud Prediction
* Compare Features importance
* Data Cleaning and Exploration
* Resampling Data
* Manage Imbalance Data
* Use a List of data that was never used in the train model(Slice).


### Import libraries

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime, date
import os
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.gridspec as gridspec
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### Load the File

In [2]:
train = pd.read_csv('fraudTrain.csv')
test = pd.read_csv('fraudTest1.csv')
#This file slice is used for evaluation of the model
slice = pd.read_csv('slice2000.csv')


### Creating the Dataframe and start the dataset from column 1 to remove the column Unnamed


In [3]:
df_train = train.iloc[0:,1:]
df_test = test.iloc[20000:,1:]
df_slice = slice.iloc[0:,1:]
#df_train.head(2)


### Functions that will be used to transform and analys the data


In [4]:
def is_weekend(tx_datetime):
    
    # Convert date into weekday (0 is Monday, 6 is Sunday)
    weekday = tx_datetime.weekday()
    # Binary value: 0 if weekday, 1 if weekend
    is_weekend = weekday>=5
    
    return int(is_weekend)

def is_night(tx_datetime):
    # Get the hour of the transaction
    tx_hour = tx_datetime.hour
    # Binary value: 1 if hour less than 6am and more than 10pm, and 0 otherwise
    is_night = (tx_hour<=6 and tx_hour>=22)
    return int(is_night)

def show_status(df):
    
    #Get the status about null values, cell type
    total_null = df.isnull().sum()
    percent_null = 100* (total_null/len(df))
    cell_type = df.dtypes
    unique_values = df.nunique()

    new_table = pd.concat([total_null,percent_null,cell_type,unique_values], axis=1)
    tb_columns = new_table.rename(columns = {0: 'Null Values', 1: '% of Null Values', 2: 'Type', 3:'Unique Values'})
    
    return tb_columns

#function to return highly correlated column above a threshold
def correlation(dataset, threshold):
    col_corr = set() # This set stores the highly correlated columns
    corr_matrix = dataset.corr() #correlation matrix
    #traversing the correlation matrix
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if corr_matrix.iloc[i,j] > threshold:
                colname = corr_matrix.columns[i] #selecting columns above threshold
                col_corr.add(colname) #adding columns to set
    return col_corr

In [5]:
show_status(df_train)

Unnamed: 0,Null Values,% of Null Values,Type,Unique Values
trans_date_trans_time,0,0.0,object,1274791
cc_num,0,0.0,int64,983
merchant,0,0.0,object,693
category,0,0.0,object,14
amt,0,0.0,float64,52928
first,0,0.0,object,352
last,0,0.0,object,481
gender,0,0.0,object,2
street,0,0.0,object,983
city,0,0.0,object,894



### Rename Columns


In [6]:
#train dataset
df_train.rename(columns={'trans_date_trans_time':'TX_DATETIME','cc_num':'ACCOUNT','amt':'AMOUNT', 'category':'CATEGORY'}, inplace=True)
#test dataset
df_test.rename(columns={'trans_date_trans_time':'TX_DATETIME','cc_num':'ACCOUNT','amt':'AMOUNT', 'category':'CATEGORY'}, inplace=True)


### Dataset Analysis


In [7]:
#train dataset
num_true_cases = len(df_train[df_train.is_fraud == 1])
num_false_cases = len(df_train[df_train.is_fraud == 0])
print('TRAINING - Number of Fraud cases:',num_true_cases,'-', np.round(num_true_cases / df_train.shape[0] * 100,2),'%')
print('TRAINING - Number of Normal cases:', num_false_cases,'-',np.round(num_false_cases / df_train.shape[0] * 100,2),'%')

#test dataset
num_true_cases = len(df_test[df_test.is_fraud == 1])
num_false_cases = len(df_test[df_test.is_fraud == 0])
print('TESTING - Number of Fraud cases:',num_true_cases,'-', np.round(num_true_cases / df_test.shape[0] * 100,2),'%')
print('TESTING - Number of Normal cases:', num_false_cases,'-',np.round(num_false_cases / df_test.shape[0] * 100,2),'%')

TRAINING - Number of Fraud cases: 7506 - 0.58 %
TRAINING - Number of Normal cases: 1289169 - 99.42 %
TESTING - Number of Fraud cases: 1977 - 0.38 %
TESTING - Number of Normal cases: 513742 - 99.62 %



It is clear that the amount of fraudulent transactions is not balanced


### Data Tranformation
* We use the functions to transform the data type of the column date and add new columns TX_DATETIME, Transaction_Date and Age


In [8]:
#train dataset
df_train['TX_DATETIME'] = pd.to_datetime(df_train['TX_DATETIME'], errors='coerce')
df_train['dob'] = pd.to_datetime(df_train['dob'], errors='coerce')
df_train['Transaction_Date'] = (df_train['TX_DATETIME']).dt.date.astype('datetime64[ns]')

#Manually calculate the age - seems to be faster
df_train['Age'] = pd.Timestamp('now').year - df_train['dob'].dt.year

#test dataset
df_test['TX_DATETIME'] = pd.to_datetime(df_test['TX_DATETIME'], errors='coerce')
df_test['dob'] = pd.to_datetime(df_test['dob'], errors='coerce')
df_test['Transaction_Date'] = (df_test['TX_DATETIME']).dt.date.astype('datetime64[ns]')
#Manually calculate the age - seems to be faster
df_test['Age'] = pd.Timestamp('now').year - df_test['dob'].dt.year

In [9]:
#Data Tranformation

#It takes 4 sec because it does comparation with the functions
%time df_train['TX_DURING_WEEKEND']=df_train.TX_DATETIME.apply(is_weekend)
%time df_train['TX_DURING_NIGHT']=df_train.TX_DATETIME.apply(is_night)
#it takes 90ms because it only converts the values using dt properties
%time df_train['DAY'] = df_train['TX_DATETIME'].dt.day
%time df_train['MONTH'] = df_train['TX_DATETIME'].dt.month
%time df_train['YEAR'] = df_train['TX_DATETIME'].dt.year

df_test['TX_DURING_WEEKEND']=df_test.TX_DATETIME.apply(is_weekend)
df_test['TX_DURING_NIGHT']=df_test.TX_DATETIME.apply(is_night)
df_test['DAY'] = df_test['TX_DATETIME'].dt.day
df_test['MONTH'] = df_test['TX_DATETIME'].dt.month
df_test['YEAR'] = df_test['TX_DATETIME'].dt.year

Wall time: 4.59 s
Wall time: 4.5 s
Wall time: 92 ms
Wall time: 91 ms
Wall time: 96 ms


In [10]:
df_train.head(2)

Unnamed: 0,TX_DATETIME,ACCOUNT,merchant,CATEGORY,AMOUNT,first,last,gender,street,city,...,merch_lat,merch_long,is_fraud,Transaction_Date,Age,TX_DURING_WEEKEND,TX_DURING_NIGHT,DAY,MONTH,YEAR
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,36.011293,-82.048315,0,2019-01-01,34,0,0,1,1,2019
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,49.159047,-118.186462,0,2019-01-01,44,0,0,1,1,2019


In [11]:
#Data Tranformation
#drop column that we don't need
df_train.drop(['dob','Transaction_Date','TX_DATETIME', 'ACCOUNT', 'first', 'last', 'street', 'city', 'zip', 'lat', 'long', 'city_pop', 'job', 'trans_num', 'unix_time', 'merch_lat', 'merch_long'] , axis=1, inplace=True)
df_test.drop(['dob','Transaction_Date','TX_DATETIME', 'ACCOUNT', 'first', 'last', 'street', 'city', 'zip', 'lat', 'long', 'city_pop', 'job', 'trans_num', 'unix_time', 'merch_lat', 'merch_long'] , axis=1, inplace=True)

show_status(df_train)

Unnamed: 0,Null Values,% of Null Values,Type,Unique Values
merchant,0,0.0,object,693
CATEGORY,0,0.0,object,14
AMOUNT,0,0.0,float64,52928
gender,0,0.0,object,2
state,0,0.0,object,51
is_fraud,0,0.0,int64,2
Age,0,0.0,int64,81
TX_DURING_WEEKEND,0,0.0,int64,2
TX_DURING_NIGHT,0,0.0,int64,1
DAY,0,0.0,int64,31


In [12]:
print("Number of is_fraud data",df_train['is_fraud'].value_counts())

Number of is_fraud data 0    1289169
1       7506
Name: is_fraud, dtype: int64



### Encoding Nominal Columns


In [13]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
ohe.fit(df_train[['CATEGORY']])

"""
#This function allow us to pass a dataframe that has the column name "Category" and transform the values into numerical values
It encodes the column into integer values using the OneHotEncoder from Scikitlearn
"""

def get_ohe(df):
    temp_df = pd.DataFrame(data=ohe.transform(df[['CATEGORY']]), columns=ohe.get_feature_names_out())
    df.drop(columns=['CATEGORY'], axis=1, inplace=True)
    df = pd.concat([df.reset_index(drop=True), temp_df], axis=1)
    return df


df_train = get_ohe(df_train)
df_test = get_ohe(df_test)

In [14]:
df_downsampled = df_train
df_downsampled

Unnamed: 0,merchant,AMOUNT,gender,state,is_fraud,Age,TX_DURING_WEEKEND,TX_DURING_NIGHT,DAY,MONTH,...,CATEGORY_grocery_pos,CATEGORY_health_fitness,CATEGORY_home,CATEGORY_kids_pets,CATEGORY_misc_net,CATEGORY_misc_pos,CATEGORY_personal_care,CATEGORY_shopping_net,CATEGORY_shopping_pos,CATEGORY_travel
0,"fraud_Rippin, Kub and Mann",4.97,F,NC,0,34,0,0,1,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,"fraud_Heller, Gutmann and Zieme",107.23,F,WA,0,44,0,0,1,1,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,fraud_Lind-Buckridge,220.11,M,ID,0,60,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"fraud_Kutch, Hermiston and Farrell",45.00,M,MT,0,55,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,fraud_Keeling-Crist,41.96,M,VA,0,36,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1296670,fraud_Reichel Inc,15.56,M,UT,0,61,1,0,21,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1296671,fraud_Abernathy and Sons,51.70,M,MD,0,43,1,0,21,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1296672,fraud_Stiedemann Ltd,105.93,M,NM,0,55,1,0,21,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1296673,"fraud_Reinger, Weissnat and Strosin",74.90,M,SD,0,42,1,0,21,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
#Gender
#Change gender from nominal to numerical
# Train data
df_downsampled['gender'] = df_downsampled['gender'].replace(['F','M'],[0,1])
# Test data
df_test['gender'] = df_test['gender'].replace(['F','M'],[0,1])

Feature Selection
We can identify new columns that were created by the OneHotEncoder

In [16]:
#Select Data
select_data = df_downsampled
select_data.columns

Index(['merchant', 'AMOUNT', 'gender', 'state', 'is_fraud', 'Age',
       'TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'DAY', 'MONTH', 'YEAR',
       'CATEGORY_entertainment', 'CATEGORY_food_dining',
       'CATEGORY_gas_transport', 'CATEGORY_grocery_net',
       'CATEGORY_grocery_pos', 'CATEGORY_health_fitness', 'CATEGORY_home',
       'CATEGORY_kids_pets', 'CATEGORY_misc_net', 'CATEGORY_misc_pos',
       'CATEGORY_personal_care', 'CATEGORY_shopping_net',
       'CATEGORY_shopping_pos', 'CATEGORY_travel'],
      dtype='object')

In [17]:
select_data = select_data[['merchant', 'AMOUNT', 'gender', 'state', 'is_fraud', 'Age',
       'TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'DAY', 'MONTH', 'YEAR',
       'CATEGORY_food_dining', 'CATEGORY_gas_transport',
       'CATEGORY_grocery_net', 'CATEGORY_grocery_pos',
       'CATEGORY_health_fitness', 'CATEGORY_home', 'CATEGORY_kids_pets',
       'CATEGORY_misc_net', 'CATEGORY_misc_pos', 'CATEGORY_personal_care',
       'CATEGORY_shopping_net', 'CATEGORY_shopping_pos', 'CATEGORY_travel']]
select_data

Unnamed: 0,merchant,AMOUNT,gender,state,is_fraud,Age,TX_DURING_WEEKEND,TX_DURING_NIGHT,DAY,MONTH,...,CATEGORY_grocery_pos,CATEGORY_health_fitness,CATEGORY_home,CATEGORY_kids_pets,CATEGORY_misc_net,CATEGORY_misc_pos,CATEGORY_personal_care,CATEGORY_shopping_net,CATEGORY_shopping_pos,CATEGORY_travel
0,"fraud_Rippin, Kub and Mann",4.97,0,NC,0,34,0,0,1,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,"fraud_Heller, Gutmann and Zieme",107.23,0,WA,0,44,0,0,1,1,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,fraud_Lind-Buckridge,220.11,1,ID,0,60,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"fraud_Kutch, Hermiston and Farrell",45.00,1,MT,0,55,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,fraud_Keeling-Crist,41.96,1,VA,0,36,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1296670,fraud_Reichel Inc,15.56,1,UT,0,61,1,0,21,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1296671,fraud_Abernathy and Sons,51.70,1,MD,0,43,1,0,21,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1296672,fraud_Stiedemann Ltd,105.93,1,NM,0,55,1,0,21,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1296673,"fraud_Reinger, Weissnat and Strosin",74.90,1,SD,0,42,1,0,21,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
select_data.columns

Index(['merchant', 'AMOUNT', 'gender', 'state', 'is_fraud', 'Age',
       'TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'DAY', 'MONTH', 'YEAR',
       'CATEGORY_food_dining', 'CATEGORY_gas_transport',
       'CATEGORY_grocery_net', 'CATEGORY_grocery_pos',
       'CATEGORY_health_fitness', 'CATEGORY_home', 'CATEGORY_kids_pets',
       'CATEGORY_misc_net', 'CATEGORY_misc_pos', 'CATEGORY_personal_care',
       'CATEGORY_shopping_net', 'CATEGORY_shopping_pos', 'CATEGORY_travel'],
      dtype='object')

In [19]:
from sklearn.model_selection import train_test_split
feature_cols = ['AMOUNT', 'gender', 'Age',
       'TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'DAY', 'MONTH', 'YEAR',
       'CATEGORY_food_dining', 'CATEGORY_gas_transport',
       'CATEGORY_grocery_net', 'CATEGORY_grocery_pos',
       'CATEGORY_health_fitness', 'CATEGORY_home', 'CATEGORY_kids_pets',
       'CATEGORY_misc_net', 'CATEGORY_misc_pos', 'CATEGORY_personal_care',
       'CATEGORY_shopping_net', 'CATEGORY_shopping_pos', 'CATEGORY_travel']
X = select_data[feature_cols]
y = select_data['is_fraud']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=42)

In [20]:
X_train

Unnamed: 0,AMOUNT,gender,Age,TX_DURING_WEEKEND,TX_DURING_NIGHT,DAY,MONTH,YEAR,CATEGORY_food_dining,CATEGORY_gas_transport,...,CATEGORY_grocery_pos,CATEGORY_health_fitness,CATEGORY_home,CATEGORY_kids_pets,CATEGORY_misc_net,CATEGORY_misc_pos,CATEGORY_personal_care,CATEGORY_shopping_net,CATEGORY_shopping_pos,CATEGORY_travel
992821,2.38,1,26,1,0,9,2,2020,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
833489,6.53,0,38,0,0,11,12,2019,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
517107,6.62,1,40,1,0,11,8,2019,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
166051,24.86,0,39,0,0,29,3,2019,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
473161,25.65,0,38,1,0,27,7,2019,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110268,72.38,0,57,0,0,4,3,2019,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
259178,2.33,0,58,0,0,9,5,2019,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
131932,118.27,0,44,0,0,13,3,2019,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
671155,5.60,1,80,1,0,13,10,2019,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0



Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. We use the sklearn MinMaxScaler


In [21]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [22]:
total = np.count_nonzero(y_test == 1)
total

2285

In [23]:
total = np.count_nonzero(y_test == 0)
total

386718


In order to deal with the use the Sampling technique SMOTE (Synthetic Minority Oversampling TEchnique) that consists of synthesizing elements
for the minority class, based on those that already exist. It works randomly picingk a point from the minority class and computing the k-nearest neighbors for this
point. The synthetic points are added between the chosen point and its neighbors.
This technique generates synthetic data for the minority class. https://imbalancedlearn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html


In [24]:
from imblearn.over_sampling import SMOTE
#balancing using SMOTE method
smote = SMOTE(sampling_strategy='minority', random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

In [25]:
y_train.value_counts()

0    902451
1    902451
Name: is_fraud, dtype: int64

### Model

DecisionTreeClassifier
We compared the model using entropy and gini for the Attribute Selection. 
The entropy is a measure of the randomness in the information being processed that vary from 0 to 1.
The gini Selection would also vary with values from 0 to 1. Gini select the class with the least value of gini index that get prefered when they are being ranked.

1 - criterion="entropy", max_depth=10,random_state=42, class_weight='balanced'
* This was the best configuration for our model, if we reduce the number of max_depth to 6, the recall reduces to 80%.

2 - criterion="gini", max_depth=20,random_state=42
* This configuration didn't provide good results and the recall was below 70 %

In [26]:
#Learning with Imbalanced Data
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
# class_weight='balanced' to balance the classes which we have 0 and 1 in the outcome column
model = DecisionTreeClassifier(criterion="entropy", max_depth=10,random_state=42, class_weight='balanced')
model = model.fit(X_train, y_train)

Confusion Matrix
- Accuracy
- Precision
- Recall

In [27]:
# Accuracy of train data
y_pred_train = model.predict(X_train)
print("Accuracy Train Data:",metrics.accuracy_score(y_train, y_pred_train))

# Accuracy of test data
y_pred_test = model.predict(X_test)
print("Accuracy Test Data:",metrics.accuracy_score(y_test, y_pred_test))

Accuracy Train Data: 0.9656502125877194
Accuracy Test Data: 0.9622033763235759


In [28]:
from sklearn.metrics import confusion_matrix
# Train data
print("Confusion Matrix")
pd.DataFrame(confusion_matrix(y_train,y_pred_train), columns=['Predicted Nagative', 'Predicted Positive'], index=['Actual Negative', 'Actual Positive'])


Confusion Matrix


Unnamed: 0,Predicted Nagative,Predicted Positive
Actual Negative,868724,33727
Actual Positive,28271,874180


In [29]:
# Train data
print('Classifiaction report: \n', metrics.classification_report(y_train,y_pred_train))

Classifiaction report: 
               precision    recall  f1-score   support

           0       0.97      0.96      0.97    902451
           1       0.96      0.97      0.97    902451

    accuracy                           0.97   1804902
   macro avg       0.97      0.97      0.97   1804902
weighted avg       0.97      0.97      0.97   1804902



In [30]:
# Test data
print("Confusion Matrix")
pd.DataFrame(confusion_matrix(y_test,y_pred_test), columns=['Predicted Nagative', 'Predicted Positive'], index=['Actual Negative', 'Actual Positive'])

Confusion Matrix


Unnamed: 0,Predicted Nagative,Predicted Positive
Actual Negative,372128,14590
Actual Positive,113,2172



The test dataset has around 2,259 fraudlent records and the ML model was able to identify 2139, It predicted incorrectly around 120. 
Also, the ML model falsily predicted 14537 normal transcation as fraudlents that would be around 0.039% of the total number of transactions.


In [31]:
# Test data
print('Classifiaction report: \n', metrics.classification_report(y_test,y_pred_test))

Classifiaction report: 
               precision    recall  f1-score   support

           0       1.00      0.96      0.98    386718
           1       0.13      0.95      0.23      2285

    accuracy                           0.96    389003
   macro avg       0.56      0.96      0.60    389003
weighted avg       0.99      0.96      0.98    389003




The DecisionTreeClassifier Algorithms have been implemented upon the processed train dataset and we used the test using the test dataset. Based on the metrics, we can see that the model provide a high recall values that this means that it can identify when the fradulent transaction is actually yes.



### Here we are going to evaluate the model with 20.000 transactions that weren't used in the train model

In [32]:

import pandas as pd

df_sample = df_slice
df_sample.rename(columns={'trans_date_trans_time':'TX_DATETIME','cc_num':'ACCOUNT','amt':'AMOUNT', 'category':'CATEGORY'}, inplace=True)
df_sample

Unnamed: 0,TX_DATETIME,ACCOUNT,merchant,CATEGORY,AMOUNT,first,last,gender,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,Columbia,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,Altonah,...,40.3207,-110.4360,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,Bellmore,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.495810,-74.196111,0
3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,Titusville,...,28.5697,-80.8191,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,Falmouth,...,44.2529,-85.0170,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,2020-06-28 12:19:29,3514865930894695,fraud_Bednar PLC,kids_pets,14.18,Christopher,Castaneda,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,...,32.9396,-105.8189,899,Naval architect,1967-08-30,d4d106d5c9ffc1701c5f09afc2508685,1372421969,32.562022,-106.265694,0
19996,2020-06-28 12:19:41,3514897282719543,fraud_Lubowitz-Walter,kids_pets,98.79,Steven,Faulkner,M,841 Cheryl Centers Suite 115,Farmington,...,42.9580,-77.3083,10717,Cytogeneticist,1952-10-13,229e743464bd721b5654c4b196387291,1372421981,42.193183,-78.054448,0
19997,2020-06-28 12:21:01,4584931703207308232,fraud_Stamm-Rodriguez,misc_pos,46.10,Amanda,Gomez,F,8152 Brittany Centers,Dallas,...,32.8463,-96.6972,1263321,"Designer, ceramics/pottery",1975-04-16,877fbd4ffa4750c24a6a18244b079bba,1372422061,32.729529,-97.358421,0
19998,2020-06-28 12:21:17,36078114201167,"fraud_Schumm, McLaughlin and Carter",food_dining,22.89,Christopher,Horn,M,956 Sanchez Highway,Mallie,...,37.2692,-82.9161,798,Facilities manager,1926-06-26,182ce724e8c30907706e40b7690a3999,1372422077,37.758363,-82.949838,0


### Data Tranformation
* We use the functions to transform the data type of the column date and add new columns TX_DATETIME, Transaction_Date and Age

In [33]:
df_sample['TX_DATETIME'] = pd.to_datetime(df_sample['TX_DATETIME'], errors='coerce')
df_sample['dob'] = pd.to_datetime(df_sample['dob'], errors='coerce')
df_sample['Transaction_Date'] = (df_sample['TX_DATETIME']).dt.date.astype('datetime64[ns]')
#Manually calculate the age - seems to be faster
df_sample['Age'] = pd.Timestamp('now').year - df_sample['dob'].dt.year

In [34]:
#It takes 4 sec because it does comparation with the functions
%time df_sample['TX_DURING_WEEKEND']=df_sample.TX_DATETIME.apply(is_weekend)
%time df_sample['TX_DURING_NIGHT']=df_sample.TX_DATETIME.apply(is_night)
#it takes 90ms because it only converts the values using dt properties
%time df_sample['DAY'] = df_sample['TX_DATETIME'].dt.day
%time df_sample['MONTH'] = df_sample['TX_DATETIME'].dt.month
%time df_sample['YEAR'] = df_sample['TX_DATETIME'].dt.year

Wall time: 68 ms
Wall time: 64 ms
Wall time: 2.01 ms
Wall time: 1.97 ms
Wall time: 2 ms


### Encoding Nominal Columns

In [35]:
# This line enconde the colum Category
df_sample = get_ohe(df_sample)

In [36]:
df_sample['is_fraud']

0        0
1        0
2        0
3        0
4        0
        ..
19995    0
19996    0
19997    0
19998    0
19999    0
Name: is_fraud, Length: 20000, dtype: int64

In [37]:
#Change gender from nominal to numerical

# Train data
df_sample['gender'] = df_sample['gender'].replace(['F','M'],[0,1])
# Test data
df_sample['gender'] = df_sample['gender'].replace(['F','M'],[0,1])

In [38]:
df_sample.drop(['dob','Transaction_Date','TX_DATETIME', 'ACCOUNT', 'first', 'last', 'street', 'city', 'zip', 'lat', 'long', 'city_pop', 'job', 'trans_num', 'unix_time', 'merch_lat', 'merch_long'] , axis=1, inplace=True)

select_data2 =df_sample
select_data2.columns

Index(['merchant', 'AMOUNT', 'gender', 'state', 'is_fraud', 'Age',
       'TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'DAY', 'MONTH', 'YEAR',
       'CATEGORY_entertainment', 'CATEGORY_food_dining',
       'CATEGORY_gas_transport', 'CATEGORY_grocery_net',
       'CATEGORY_grocery_pos', 'CATEGORY_health_fitness', 'CATEGORY_home',
       'CATEGORY_kids_pets', 'CATEGORY_misc_net', 'CATEGORY_misc_pos',
       'CATEGORY_personal_care', 'CATEGORY_shopping_net',
       'CATEGORY_shopping_pos', 'CATEGORY_travel'],
      dtype='object')


We divide the dataset in x, y values. The Label "is_fraud" is the value that we want to predict and the other columns are the values that we are going to determine the best features for the machine learning model.


In [39]:
X = select_data2[feature_cols]
Y = select_data2['is_fraud']

In [40]:
from sklearn import preprocessing
feature_cols = ['AMOUNT', 'gender', 'Age',
       'TX_DURING_WEEKEND', 'TX_DURING_NIGHT', 'DAY', 'MONTH', 'YEAR',
       'CATEGORY_food_dining', 'CATEGORY_gas_transport',
       'CATEGORY_grocery_net', 'CATEGORY_grocery_pos',
       'CATEGORY_health_fitness', 'CATEGORY_home', 'CATEGORY_kids_pets',
       'CATEGORY_misc_net', 'CATEGORY_misc_pos', 'CATEGORY_personal_care',
       'CATEGORY_shopping_net', 'CATEGORY_shopping_pos', 'CATEGORY_travel']

#Reindex is used in case that we don't find the column. This line will match all the columns from the model
X = X.reindex(columns=feature_cols).fillna(0)

# Normalization
X = scaler.transform(X)

In [41]:
X

array([[6.79101775e-05, 1.00000000e+00, 4.56790123e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.05297286e-03, 0.00000000e+00, 1.85185185e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.47065696e-03, 0.00000000e+00, 4.32098765e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [1.64663925e-03, 0.00000000e+00, 3.70370370e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [7.99222465e-04, 1.00000000e+00, 9.75308642e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.28677081e-04, 0.00000000e+00, 4.07407407e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [42]:
y_ = model.predict(X)
y_

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [43]:
print("Accuracy Test Data:",metrics.accuracy_score(Y, y_))

Accuracy Test Data: 0.96155


In [44]:
len(y_)

20000

In [45]:
total = np.count_nonzero(y_ == 1)
total

829

In [46]:
total = np.count_nonzero(y_ == 0)
total

19171

In [47]:
print("Number of is_fraud data",df_sample['is_fraud'].value_counts())

Number of is_fraud data 0    19932
1       68
Name: is_fraud, dtype: int64


In [48]:
print('Classifiaction report: \n', metrics.classification_report(Y, y_))

Classifiaction report: 
               precision    recall  f1-score   support

           0       1.00      0.96      0.98     19932
           1       0.08      0.94      0.14        68

    accuracy                           0.96     20000
   macro avg       0.54      0.95      0.56     20000
weighted avg       1.00      0.96      0.98     20000



In [49]:
print("Confusion Matrix")

pd.DataFrame(confusion_matrix(Y, y_), columns=['Predicted Nagative', 'Predicted Positive'], index=['Actual Negative', 'Actual Positive'])

Confusion Matrix


Unnamed: 0,Predicted Nagative,Predicted Positive
Actual Negative,19167,765
Actual Positive,4,64



The slice dataset has around 68 fraudlent records that we know and the ML model was able to identify 63, It predicted incorrectly around 5 records. 
Also, the ML model falsily predicted 726 normal transcation as fraudlents that would be around 0.029% of the total number of transactions.


TP = TRUE POSITIVE
TN = TRUE NEGATIVE
FP = FALSE POSITIVE
FN = FALSE NEGATIVE
#accuracy = (TP+TN)/Total

In [50]:
#Copy the results values from the table above

tp = 64
tn = 19167
fp = 765
fn = 4

total = tp+tn+fp+fn

actualPositive=tp+fp
actualNegative=fn+tn
predPositive = tp + fn
predNegative = fp + tn

acc = (tp+tn)/total
print(acc)

0.96155


In [51]:
#how often is wrong
rate = (fp+fn)/total
print(rate)

0.03845


In [52]:
#true positive rate
print((tp/actualPositive))

0.07720144752714113


In [53]:
#false positive rate
print((fp/actualNegative))

0.03990402169944186


In [54]:
#True negative rate
print((tn/actualNegative))

0.9997913515205258


In [55]:
#recall
print(tp/(fn+tp))

0.9411764705882353


In the model above, it is clear that we have good precision and high value of recall that results in a good prediction for fraudlent transactions