# Credit Card Fraud Detection With Machine Learning in Python

This example shows use of classification to help credit card company to detect potential fraud cases. 
Original example can be found [here](https://medium.com/codex/credit-card-fraud-detection-with-machine-learning-in-python-ac7281991d87).

### Notes on running this example:

By defaults runs use Bodo. Hence, data is distributed in chunks across processes.

The current results are based on running on a local Macbook Pro. You can also run it on our platform using, for example, one **m5.12xlarge** instance (24 cores, 192GiB memory)

The dataset can be downloaded from Kaggle [here](https://www.kaggle.com/mlg-ulb/creditcardfraud) or use S3 Bucket (`s3://bodo-examples-data/creditcard/creditcard.csv`)

To run the code:
1. Make sure you add your AWS account credentials to access the data (if using S3 bucket link). 
2. If you want to run the example using pandas only (without Bodo):
    1. Comment lines magic expression (`%%px`) and bodo decorator (`@bodo.jit`) from all the code cells.
    2. Then, re-run cells from the beginning.
3. For xgboost package, build it from source with MPI enabled (this step is already done on Bodo Platform).

### Start an IPyParallel cluster (skip if running on Bodo Platform)
Run the following code in a cell to start an IPyParallel cluster. 4 cores are used in this example. You will skip this step and next one (Verify your IPyParallel cluster) if you are using bodo's platform.

In [35]:
import ipyparallel as ipp
import psutil; n = min(psutil.cpu_count(logical=False), 4)
rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 4 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>

  0%|                                                                                                   | 0/4 [00:00<?, ?engine/s][A
100%|███████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00,  1.72s/engine][A


### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [36]:
%%px
import bodo
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")


%px:   0%|                                                                                               | 0/4 [00:00<?, ?tasks/s][A
%px:   0%|                                                                                               | 0/4 [00:00<?, ?tasks/s][A
%px:   0%|                                                                                               | 0/4 [00:00<?, ?tasks/s][A
%px:   0%|                                                                                               | 0/4 [00:00<?, ?tasks/s][A
%px:   0%|                                                                                               | 0/4 [00:00<?, ?tasks/s][A
%px:   0%|                                                                                               | 0/4 [00:00<?, ?tasks/s][A
%px:   0%|                                                                                               | 0/4 [00:00<?, ?tasks/s][A
%px:   0%|                                                   

[stdout:1] Hello World from rank 1. Total ranks=4


[stdout:2] Hello World from rank 2. Total ranks=4


[stdout:3] Hello World from rank 3. Total ranks=4


[stdout:0] Hello World from rank 0. Total ranks=4


%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.54tasks/s]


## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Numpy to work with arrays
 - scikit-learn to build and evaluate classification models
 - xgboost for xgboost classifier model algorithm

In [22]:
%%px
import warnings
warnings.filterwarnings("ignore")

import bodo
import numpy as np
import pandas as pd
import time
from sklearn.preprocessing import StandardScaler # data normalization
from sklearn.model_selection import train_test_split # data split
from sklearn.linear_model import LogisticRegression # Logistic regression algorithm
from sklearn.ensemble import RandomForestClassifier # Random forest tree algorithm
from xgboost import XGBClassifier # XGBoost algorithm
from sklearn.svm import LinearSVC # SVM classification algorithm
from sklearn.metrics import accuracy_score # evaluation metric

%px: 100%|█████████████████████████████████████| 4/4 [00:00<00:00,  5.07tasks/s]


In [23]:
%%px
import os
os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key_id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_aws_secret_access_key"
os.environ["AWS_DEFAULT_REGION"] = "us-east-2"

## Data Processing and EDA
1. Load dataset
2. Compute the percentage of fraud cases in the overall recorded transcations.
3. Get a statistical view of both fraud and non-fraud transaction amount data

In [24]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def load_data():
    start = time.time()
    df = pd.read_csv('s3://bodo-examples-data/creditcard/creditcard.csv')
    df.drop('Time', axis = 1, inplace = True)
    end = time.time()
    print("Read Time: ", (end-start))
    return df

df = load_data()
df.head()

%px:   0%|                                             | 0/4 [00:22<?, ?tasks/s]

[stdout:0] Read Time:  23.584205590999773


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
142404,-0.439952,0.683758,1.225814,0.639113,0.716765,0.089295,0.657718,0.034213,-0.472679,0.352698,...,0.127443,0.58859,-0.145728,-0.3229,-0.299744,-0.310295,0.248317,-0.011493,19.0,0
142405,-4.868108,1.26442,-5.167885,3.193648,-3.045621,-2.096166,-6.44561,2.422536,-3.214055,-8.745973,...,1.269205,0.057657,0.629307,-0.168432,0.443744,0.276539,1.441274,-0.127944,12.31,1
142406,1.013114,-0.334412,1.305208,0.837406,-1.126833,-0.064321,-0.594753,0.147737,0.53636,-0.120472,...,0.017079,0.11221,-0.016084,0.595033,0.201073,0.278215,0.007457,0.030762,66.6,0
142407,0.969231,-0.233554,0.238473,0.145793,-0.545741,-0.97068,0.347393,-0.209522,-0.342571,-0.100331,...,-0.36282,-1.417272,0.162136,0.541628,-0.079465,0.268702,-0.101237,0.028234,141.0,0
142408,-0.856523,1.080875,1.866956,1.729941,-0.161741,0.028789,0.401787,0.043774,-0.213916,0.155907,...,0.007365,0.077392,-0.221906,0.394141,0.237225,-0.080102,-0.291408,0.09214,2.6,0


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
71202,1.053976,-0.604455,1.557519,1.498009,-1.072328,1.230171,-1.159346,0.394912,1.903301,-0.495095,...,0.056082,0.629473,-0.337026,-0.525248,0.728233,-0.060297,0.117995,0.041644,59.35,0
71203,-0.631327,0.343072,2.160554,0.756337,0.038585,1.26564,-0.246002,0.596733,0.134493,-0.369403,...,0.051855,0.208031,-0.196802,-0.895435,-0.074086,-0.258828,0.161268,0.105416,28.75,0
71204,1.300457,-0.065674,0.255506,-0.208697,-0.676098,-1.207634,-0.047985,-0.239209,0.289825,-0.157913,...,0.000959,0.004184,-0.02981,0.488943,0.284097,1.45145,-0.117399,-0.004609,9.99,0
71205,-4.578021,-4.940216,2.601071,1.361639,5.145845,-3.301529,-3.542126,0.718833,0.650869,-0.42388,...,0.431009,-0.348956,0.632237,-0.491981,0.739572,-0.432633,-0.114659,0.179794,1.0,0
71206,-0.327506,0.362003,0.852281,2.129654,1.688551,5.177962,-0.404106,1.245253,-0.484987,0.378006,...,0.145614,0.511053,-0.038532,1.051707,-0.124673,0.394902,0.18946,0.16406,98.89,0


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
213606,2.14894,-1.703039,-0.041126,-1.602874,-1.740815,0.148274,-1.774732,0.078852,-0.988161,1.587067,...,-0.08599,0.266928,0.227984,-0.484336,-0.483575,-0.243972,0.064046,-0.035189,52.0,0
213607,2.178347,-1.607482,-1.440763,-1.992181,-0.222548,1.460699,-1.445977,0.44408,-1.331721,1.63566,...,-0.094917,0.236083,0.258351,-0.987394,-0.334356,-0.076864,0.042121,-0.073258,20.0,0
213608,-0.021745,0.768869,-0.52733,-1.279209,0.687124,0.231025,0.716421,0.053917,0.078592,-1.528934,...,-0.353079,-0.98049,0.043599,-1.605573,-0.868668,0.835737,-0.082145,0.04725,89.9,0
213609,2.141501,-0.874732,-1.903605,-0.937012,-0.153103,-0.737772,-0.194992,-0.172563,-0.624846,0.996264,...,0.270332,0.69395,0.015639,0.723209,0.269247,-0.045152,-0.079194,-0.077988,41.96,0
213610,1.188243,-2.428035,-1.23325,-0.327497,-1.048043,0.519697,-0.362632,0.013181,-0.350346,0.738172,...,-0.323173,-1.523661,0.037543,0.066202,-0.854472,0.167269,-0.113237,0.036931,494.95,0


%px: 100%|█████████████████████████████████████| 4/4 [00:22<00:00,  5.53s/tasks]


In [25]:
%%px
df.shape

[0;31mOut[0:5]: [0m(71202, 30)

[0;31mOut[1:5]: [0m(71202, 30)

[0;31mOut[2:5]: [0m(71202, 30)

[0;31mOut[3:5]: [0m(71201, 30)

In [26]:
%%px
@bodo.jit(distributed=['df'])
def data_processing(df):
    cases = len(df)
    nonfraud_count = len(df[df.Class == 0])
    fraud_count = len(df[df.Class == 1])
    fraud_percentage = round(fraud_count/nonfraud_count*100, 2)
    print('--------------------------------------------')
    print('Total number of cases are ', cases)
    print('Number of Non-fraud cases are ', nonfraud_count)
    print('Number of fraud cases are', fraud_count)
    print('Percentage of fraud cases is ', fraud_percentage)
    print('--------------------------------------------')    
    nonfraud_cases = df[df.Class == 0]
    fraud_cases = df[df.Class == 1]
    print('--------------------------------------------')
    print('NON-FRAUD CASE AMOUNT STATS')
    print(nonfraud_cases.Amount.describe())
    print('FRAUD CASE AMOUNT STATS')    
    print(fraud_cases.Amount.describe())
    print('--------------------------------------------')    

data_processing(df)

%px:   0%|                                             | 0/4 [00:08<?, ?tasks/s]

[stdout:0] --------------------------------------------
Total number of cases are  284807
Number of Non-fraud cases are  284315
Number of fraud cases are 492
Percentage of fraud cases is  0.17
--------------------------------------------
--------------------------------------------
NON-FRAUD CASE AMOUNT STATS
count    142180.000000
mean         85.946181
std         253.677243
min           0.000000
25%           5.367500
50%          20.160000
75%          74.000000
max       25691.160000
Name: Amount, dtype: float64
FRAUD CASE AMOUNT STATS
count     223.000000
mean      128.431839
std       269.879672
min         0.000000
25%         1.000000
50%         9.290000
75%       120.140000
max      2125.870000
Name: Amount, dtype: float64
--------------------------------------------


%px: 100%|█████████████████████████████████████| 4/4 [00:08<00:00,  2.24s/tasks]


## Feature Selection & Data Split

### 1. Normalize `Amount` variable
`Amount` variable varies when compared to the rest of the variables. To reduce its range of values, we normalize it using the `StandardScaler` 

In [27]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def sc(df):
    start = time.time()    
    sc = StandardScaler()
    amount = df['Amount'].values
    amount = amount.reshape(-1,1)
    sc.fit(amount)
    df['Amount'] = (sc.transform(amount)).ravel()
    print("StandardScaler time: ", time.time() - start)
    print(df['Amount'].head(10))
    
sc(df)

[stdout:0] StandardScaler time:  0.1321291479980573
0    0.244964
1   -0.342475
2    1.160686
3    0.140534
4   -0.073403
5   -0.338556
6   -0.333279
7   -0.190107
8    0.019392
9   -0.338516
Name: Amount, dtype: float64


### 2. Split the data into a training set and testing set 

In [28]:
%%px
@bodo.jit(distributed=['df', 'X_train', 'X_test', 'y_train', 'y_test'], cache=True)
def data_split(df):
    X = df.drop('Class', axis = 1).values
    y = df['Class'].values.astype(np.int64)
    start = time.time()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, train_size=0.8, random_state = 0)
    print("train_test_split time: ", time.time() - start)    
    print('X_train samples :', X_train[:1])
    print('X_test samples :', X_test[0:1])
    print('y_train samples :', y_train[0:20])
    print('y_test samples :', y_test[0:20])    
    return X_train, X_test, y_train, y_test
    
X_train, X_test, y_train, y_test = data_split(df)

%px:   0%|                                             | 0/4 [00:17<?, ?tasks/s]

[stdout:1] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:2] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:3] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:0] train_test_split time:  0.16786602800129913
X_train samples : [[-0.17709329  0.46713663  1.06874256 -0.71203205  0.43178271 -0.38513781
   1.06494145 -0.35528126 -0.2876163  -0.41322759 -1.08167281 -0.29098953
   0.56153543 -0.1808912   0.85587617  0.35369438 -0.75285806 -0.23026866
   0.58922556  0.25942083 -0.22669549 -0.62933132 -0.04370642 -0.53871266
  -0.1193268   0.80287598 -0.24561037 -0.21531057 -0.09047517]]
X_test samples : [[ 1.36499590e+00 -1.46193922e+00  1.07395319e+00 -1.33329524e+00
  -1.78276225e+00  6.08754300e-01 -1.76444440e+00  3.49516092e-01
  -1.21171109e+00  1.38409011e+00  6.68424917e-01 -2.06308091e-01
  -2.22823788e-02 -6.71134197e-01 -6.11257803e-01 -4.63479843e-01
   6.61550418e-01  2.55860946e-01 -2.81391319e-02 -3.49638738e-01
  -1.81838541e-01  9.71567719e-04  3.69133500e-02 -3.30926142e-01
   1.69776735e-01 -1.62328838e-01  8.63403848e-02  1.68786012e-02
  -2.51278195e-01]]
y_train samples : [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
y_test s

%px: 100%|█████████████████████████████████████| 4/4 [00:17<00:00,  4.49s/tasks]


## Modeling
Here we have built four different types of classification models and evaluate these models using accuracy score metrics provided by scikit-learn package.

#### 1. Logistic Regression

In [29]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    lr_yhat = lr.predict(X_test)
    print("LogisticRegression fit and predict time: ", time.time()-start)
    print('Accuracy score of the Logistic Regression model is {}'.format(accuracy_score(y_test, lr_yhat)))
    
    
lr_model(X_train, X_test, y_train, y_test)

%px:   0%|                                             | 0/4 [00:01<?, ?tasks/s]

[stdout:0] LogisticRegression fit and predict time:  0.22463374700237182
Accuracy score of the Logistic Regression model is 0.998718443874864


%px: 100%|█████████████████████████████████████| 4/4 [00:01<00:00,  3.31tasks/s]


#### 2. Random Forest Tree

In [30]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier(max_depth = 4)
    rf.fit(X_train, y_train)
    rf_yhat = rf.predict(X_test)
    print("RandomForestClassifier fit and predict time: ", time.time()-start)    
    print('Accuracy score of the Random Forest Tree model is {}'.format(accuracy_score(y_test, rf_yhat)))

rf_model(X_train, X_test, y_train, y_test)

%px:   0%|                                             | 0/4 [00:13<?, ?tasks/s]

[stdout:0] RandomForestClassifier fit and predict time:  12.63523052399978
Accuracy score of the Random Forest Tree model is 0.9988939995084443


%px: 100%|█████████████████████████████████████| 4/4 [00:13<00:00,  3.39s/tasks]


#### 3. XGBoost Model

In [31]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def xgb_model(X_train, X_test, y_train, y_test):  
    start = time.time()
    xgb = XGBClassifier(max_depth = 4)
    xgb.fit(X_train, y_train)
    xgb_yhat = xgb.predict(X_test)
    print("XGBClassifier fit and predict time: ", time.time()-start) 
    print('Accuracy score of the XGBoost model is {}'.format(accuracy_score(y_test, xgb_yhat)))

xgb_model(X_train, X_test, y_train, y_test)







XGBClassifier fit and predict time:  13.991874313000153
Accuracy score of the XGBoost model is 0.9989291106351603


%px: 100%|█████████████████████████████████████| 4/4 [00:13<00:00,  3.32s/tasks]


#### 4. SVM

In [32]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lsvc_model(X_train, X_test, y_train, y_test):  
    start = time.time()
    lsvc = LinearSVC(random_state=42)
    lsvc.fit(X_train, y_train)
    lsvc_yhat = lsvc.predict(X_test)
    print("LinearSVC fit and predict time: ", time.time()-start) 
    print('Accuracy score of the Linear Support Vector Classification model is {}'.format(accuracy_score(y_test, lsvc_yhat)))

lsvc_model(X_train, X_test, y_train, y_test)

[stdout:0] LinearSVC fit and predict time:  0.17593514299733215
Accuracy score of the Linear Support Vector Classification model is 0.998718443874864


In [38]:
# To stop the cluster run the following command. 
rc.cluster.stop_cluster_sync()