# Credit Card Fraud Detection With Machine Learning in Python

This example shows use of classification to help credit card company to detect potential fraud cases. 
Original example can be found [here](https://medium.com/codex/credit-card-fraud-detection-with-machine-learning-in-python-ac7281991d87) (the dataset is downloaded from Kaggle [here](https://www.kaggle.com/mlg-ulb/creditcardfraud)).

### Start an IPyParallel cluster (skip if running on Bodo Platform)
Run the following code in a cell to start an IPyParallel cluster. 8 cores are used in this example. You will skip this step and next one (Verify your setup) if you are using bodo's platform.

In [1]:
import ipyparallel as ipp
import psutil; n = min(psutil.cpu_count(logical=False), 8)
rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|██████████| 8/8 [00:10<00:00,  1.27s/engine]


### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [2]:
%%px
import bodo
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

[stdout:1] Hello World from rank 1. Total ranks=8


[stdout:7] Hello World from rank 7. Total ranks=8


[stdout:2] Hello World from rank 2. Total ranks=8


[stdout:3] Hello World from rank 3. Total ranks=8


[stdout:4] Hello World from rank 4. Total ranks=8


[stdout:0] Hello World from rank 0. Total ranks=8


[stdout:5] Hello World from rank 5. Total ranks=8


[stdout:6] Hello World from rank 6. Total ranks=8


## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Numpy to work with arrays
 - scikit-learn to build and evaluate classification models
 - xgboost for xgboost classifier model algorithm

In [3]:
%%px
import warnings
warnings.filterwarnings("ignore")

import bodo
import numpy as np
import pandas as pd
import time
from sklearn.preprocessing import StandardScaler # data normalization
from sklearn.model_selection import train_test_split # data split
from sklearn.linear_model import LogisticRegression # Logistic regression algorithm
from sklearn.ensemble import RandomForestClassifier # Random forest tree algorithm
from sklearn.svm import LinearSVC # SVM classification algorithm
from sklearn.metrics import accuracy_score # evaluation metric

## Data Processing and EDA
1. Load dataset
2. Compute the percentage of fraud cases in the overall recorded transcations.
3. Get a statistical view of both fraud and non-fraud transaction amount data

In [4]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def load_data():
    start = time.time()
    df = pd.read_csv('s3://bodo-example-data/creditcard/creditcard.csv')
    df.drop('Time', axis = 1, inplace = True)
    end = time.time()
    print("Read Time: ", (end-start))
    return df

df = load_data()

%px:   0%|          | 0/8 [00:21<?, ?tasks/s]

[stdout:0] Read Time:  12.212493971000185


%px: 100%|██████████| 8/8 [00:21<00:00,  2.65s/tasks]


In [5]:
%%px
df.shape

[0;31mOut[0:4]: [0m(35601, 30)

[0;31mOut[2:4]: [0m(35601, 30)

[0;31mOut[1:4]: [0m(35601, 30)

[0;31mOut[4:4]: [0m(35601, 30)

[0;31mOut[3:4]: [0m(35601, 30)

[0;31mOut[6:4]: [0m(35601, 30)

[0;31mOut[7:4]: [0m(35600, 30)

[0;31mOut[5:4]: [0m(35601, 30)

In [6]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def data_processing(df):
    cases = len(df)
    nonfraud_cases = df[df.Class == 0]
    fraud_cases = df[df.Class == 1]
    nonfraud_count = len(nonfraud_cases)
    fraud_count = len(fraud_cases)
    fraud_percentage = round(fraud_count / nonfraud_count * 100, 2)
    print("--------------------------------------------")
    print("Total number of cases are ", cases)
    print("Number of Non-fraud cases are ", nonfraud_count)
    print("Number of fraud cases are", fraud_count)
    print("Percentage of fraud cases is ", fraud_percentage)
    print("--------------------------------------------")
    print("--------------------------------------------")
    print("NON-FRAUD CASE AMOUNT STATS")
    print(nonfraud_cases.Amount.describe())
    print("FRAUD CASE AMOUNT STATS")
    print(fraud_cases.Amount.describe())
    print("--------------------------------------------")  

data_processing(df)

%px:   0%|          | 0/8 [00:10<?, ?tasks/s]

[stdout:0] --------------------------------------------
Total number of cases are  284807
Number of Non-fraud cases are  284315
Number of fraud cases are 492
Percentage of fraud cases is  0.17
--------------------------------------------
--------------------------------------------
NON-FRAUD CASE AMOUNT STATS
count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64
FRAUD CASE AMOUNT STATS
count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64
--------------------------------------------


%px: 100%|██████████| 8/8 [00:10<00:00,  1.29s/tasks]


## Feature Selection & Data Split

### 1. Normalize `Amount` variable
`Amount` variable varies when compared to the rest of the variables. To reduce its range of values, we normalize it using the `StandardScaler` 

In [7]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def sc(df):
    start = time.time()    
    sc = StandardScaler()
    amount = df['Amount'].values
    amount = amount.reshape(-1,1)
    sc.fit(amount)
    df['Amount'] = (sc.transform(amount)).ravel()
    print("StandardScaler time: ", time.time() - start)
    print(df['Amount'].head(10))
    
sc(df)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] StandardScaler time:  0.007625819999702799
0    0.244964
1   -0.342475
2    1.160686
3    0.140534
4   -0.073403
5   -0.338556
6   -0.333279
7   -0.190107
8    0.019392
9   -0.338516
Name: Amount, dtype: float64


%px: 100%|██████████| 8/8 [00:01<00:00,  7.59tasks/s]


### 2. Split the data into a training set and testing set 

In [8]:
%%px
@bodo.jit(distributed=['df', 'X_train', 'X_test', 'y_train', 'y_test'], cache=True)
def data_split(df):
    X = df.drop('Class', axis = 1).values
    y = df['Class'].values.astype(np.int64)
    start = time.time()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, train_size=0.8, random_state = 0)
    print("train_test_split time: ", time.time() - start)    
    print('X_train samples :', X_train[:1])
    print('X_test samples :', X_test[0:1])
    print('y_train samples :', y_train[0:20])
    print('y_test samples :', y_test[0:20])    
    return X_train, X_test, y_train, y_test
    
X_train, X_test, y_train, y_test = data_split(df)

%px:   0%|          | 0/8 [00:23<?, ?tasks/s]

[stdout:1] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:4] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:0] train_test_split time:  0.3300760199999786
X_train samples : [[ 1.26585198 -0.1075269   0.47981009 -0.25184768 -0.60788305 -0.52676387
  -0.34531619  0.02365571  0.21786475 -0.03118412  1.10872232  0.59520842
  -0.28786759  0.38413387  0.61749279  0.71709663 -0.71275056  0.23189477
   0.44609545 -0.07168675 -0.10932857 -0.35212209  0.04888506  0.03685774
   0.11879204  0.90961715 -0.07721121 -0.00528953 -0.34963112]]
X_test samples : [[-0.2500975   0.86416881  1.71780059  0.42077176  0.45859239 -0.57885514
   0.94800926 -0.51391496 -0.31043625 -0.02031068  0.55475393  0.09970827
   0.43210406 -0.79585124  1.45354931 -0.47625902  0.39926659 -0.69608176
   0.13674418  0.14104106 -0.19687718 -0.20118382  0.02023838  0.37742844
  -0.90739478  0.02715091 -0.29036732 -0.27941574 -0.33619755]]
y_train samples : [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
y_test samples : [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


[stdout:6] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:5] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:3] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:2] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


[stdout:7] X_train samples : []
X_test samples : []
y_train samples : []
y_test samples : []


%px: 100%|██████████| 8/8 [00:23<00:00,  2.89s/tasks]


## Modeling
Here we have built four different types of classification models and evaluate these models using accuracy score metrics provided by scikit-learn package.

#### 1. Logistic Regression

In [9]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    lr_yhat = lr.predict(X_test)
    print("LogisticRegression fit and predict time: ", time.time()-start)
    print('Accuracy score of the Logistic Regression model is {}'.format(accuracy_score(y_test, lr_yhat)))
    
    
lr_model(X_train, X_test, y_train, y_test)

%px:   0%|          | 0/8 [00:02<?, ?tasks/s]

[stdout:0] LogisticRegression fit and predict time:  0.5100183559998186
Accuracy score of the Logistic Regression model is 0.998718443874864


%px: 100%|██████████| 8/8 [00:02<00:00,  2.92tasks/s]


#### 2. Random Forest Tree

In [10]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier(max_depth = 4)
    rf.fit(X_train, y_train)
    rf_yhat = rf.predict(X_test)
    print("RandomForestClassifier fit and predict time: ", time.time()-start)    
    print('Accuracy score of the Random Forest Tree model is {}'.format(accuracy_score(y_test, rf_yhat)))

rf_model(X_train, X_test, y_train, y_test)

%px:   0%|          | 0/8 [00:17<?, ?tasks/s]

[stdout:0] RandomForestClassifier fit and predict time:  14.590287068999714
Accuracy score of the Random Forest Tree model is 0.9988588883817282


%px: 100%|██████████| 8/8 [00:17<00:00,  2.14s/tasks]


#### 3. SVM

In [11]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lsvc_model(X_train, X_test, y_train, y_test):  
    start = time.time()
    lsvc = LinearSVC(random_state=42)
    lsvc.fit(X_train, y_train)
    lsvc_yhat = lsvc.predict(X_test)
    print("LinearSVC fit and predict time: ", time.time()-start) 
    print('Accuracy score of the Linear Support Vector Classification model is {}'.format(accuracy_score(y_test, lsvc_yhat)))

lsvc_model(X_train, X_test, y_train, y_test)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] LinearSVC fit and predict time:  0.5518690680000873
Accuracy score of the Linear Support Vector Classification model is 0.9987008883115059


%px: 100%|██████████| 8/8 [00:00<00:00, 22.03tasks/s]


#### 4. XGBoost Model

In [12]:
%%px
from xgboost import XGBClassifier # XGBoost algorithm

@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def xgb_model(X_train, X_test, y_train, y_test):  
    start = time.time()
    xgb = XGBClassifier(max_depth = 4)
    xgb.fit(X_train, y_train)
    xgb_yhat = xgb.predict(X_test)
    print("XGBClassifier fit and predict time: ", time.time()-start) 
    print('Accuracy score of the XGBoost model is {}'.format(accuracy_score(y_test, xgb_yhat)))

xgb_model(X_train, X_test, y_train, y_test)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]





%px:   0%|          | 0/8 [00:00<?, ?tasks/s]











%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

XGBClassifier fit and predict time:  8.144538540999747
Accuracy score of the XGBoost model is 0.9989466661985184


%px: 100%|██████████| 8/8 [00:08<00:00,  1.03s/tasks]


In [13]:
# To stop the cluster run the following command. 
rc.cluster.stop_cluster_sync()

Stopping controller
Controller stopped: {'exit_code': 0, 'pid': 47768, 'identifier': 'ipcontroller-1652912185-7wpt-46887'}
Stopping engine(s): 1652912186
engine set stopped 1652912186: {'exit_code': 0, 'pid': 48282, 'identifier': 'ipengine-1652912185-7wpt-1652912186-46887'}
