# Credit Card Fraud Detection With Machine Learning in Python

This example shows use of classification to help credit card company to detect potential fraud cases. 
Original example can be found [here](https://medium.com/codex/credit-card-fraud-detection-with-machine-learning-in-python-ac7281991d87).

### Notes on running this example:

By defaults runs use Bodo. Hence, data is distributed in chunks across processes.

The current results are based on running on a local Macbook Pro. You can also run it on our platform using, for example, one **m5.12xlarge** instance (24 cores, 192GiB memory)

The dataset can be downloaded from Kaggle [here](https://www.kaggle.com/mlg-ulb/creditcardfraud) or use S3 Bucket (`s3://bodo-examples-data/creditcard/creditcard.csv`)

To run the code:
1. Make sure you add your AWS account credentials to access the data (if using S3 bucket link). 
2. If you want to run the example using pandas only (without Bodo):
    1. Comment lines magic expression (`%%px`) and bodo decorator (`@bodo.jit`) from all the code cells.
    2. Then, re-run cells from the beginning.
3. For xgboost package, build it from source with MPI enabled (this step is already done on Bodo Platform).

### Start an IPyParallel cluster (skip if running on Bodo Platform)
Run the following code in a cell to start an IPyParallel cluster. 4 cores are used in this example. You will skip this step and next one (Verify your IPyParallel cluster) if you are using bodo's platform.

In [1]:
import ipyparallel as ipp
import psutil; n = min(psutil.cpu_count(logical=False), 8)
rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|███████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:06<00:00,  1.17engine/s]


### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [2]:
%%px
import bodo
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

%px:   0%|                                                                                               | 0/8 [00:02<?, ?tasks/s]

[stdout:6] Hello World from rank 6. Total ranks=8


[stdout:1] Hello World from rank 1. Total ranks=8


[stdout:2] Hello World from rank 2. Total ranks=8


[stdout:7] Hello World from rank 7. Total ranks=8


[stdout:0] Hello World from rank 0. Total ranks=8


[stdout:5] Hello World from rank 5. Total ranks=8


[stdout:3] Hello World from rank 3. Total ranks=8


%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  2.84tasks/s]

[stdout:4] Hello World from rank 4. Total ranks=8





## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Numpy to work with arrays
 - scikit-learn to build and evaluate classification models
 - xgboost for xgboost classifier model algorithm

In [3]:
%%px
import warnings
warnings.filterwarnings("ignore")

import bodo
import numpy as np
import pandas as pd
import time
from sklearn.preprocessing import StandardScaler # data normalization
from sklearn.model_selection import train_test_split # data split
from sklearn.linear_model import LogisticRegression # Logistic regression algorithm
from sklearn.ensemble import RandomForestClassifier # Random forest tree algorithm
from xgboost import XGBClassifier # XGBoost algorithm
from sklearn.svm import LinearSVC # SVM classification algorithm
from sklearn.metrics import accuracy_score # evaluation metric

In [4]:
%%px
import json
import os

path_to_conn_creds = "credentials.json"
with open(path_to_conn_creds) as f:
    creds = json.load(f)

os.environ["AWS_ACCESS_KEY_ID"] = creds["aws"]["aws_access_key_id"]
os.environ["AWS_SECRET_ACCESS_KEY"] = creds["aws"]["aws_secret_access_key"]
os.environ["AWS_DEFAULT_REGION"] = "us-east-2"

## Data Processing and EDA
1. Load dataset
2. Compute the percentage of fraud cases in the overall recorded transcations.
3. Get a statistical view of both fraud and non-fraud transaction amount data

In [5]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def load_data():
    start = time.time()
    df = pd.read_csv('s3://bodo-examples-data/creditcard/creditcard.csv')
    df.drop('Time', axis = 1, inplace = True)
    end = time.time()
    print("Read Time: ", (end-start))
    return df

df = load_data()

%px:   0%|                                                                                               | 0/8 [00:54<?, ?tasks/s]

[stdout:0] Read Time:  56.47175404700101


%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:54<00:00,  6.82s/tasks]


In [6]:
%%px
df.shape

[0;31mOut[1:5]: [0m(35601, 30)

[0;31mOut[0:5]: [0m(35601, 30)

[0;31mOut[2:5]: [0m(35601, 30)

[0;31mOut[3:5]: [0m(35601, 30)

[0;31mOut[7:5]: [0m(35600, 30)

[0;31mOut[4:5]: [0m(35601, 30)

[0;31mOut[5:5]: [0m(35601, 30)

[0;31mOut[6:5]: [0m(35601, 30)

In [7]:
%%px
@bodo.jit(distributed=['df'])
def data_processing(df):
    cases = len(df)
    nonfraud_count = len(df[df.Class == 0])
    fraud_count = len(df[df.Class == 1])
    fraud_percentage = round(fraud_count/nonfraud_count*100, 2)
    print('--------------------------------------------')
    print('Total number of cases are ', cases)
    print('Number of Non-fraud cases are ', nonfraud_count)
    print('Number of fraud cases are', fraud_count)
    print('Percentage of fraud cases is ', fraud_percentage)
    print('--------------------------------------------')    
    nonfraud_cases = df[df.Class == 0]
    fraud_cases = df[df.Class == 1]
    print('--------------------------------------------')
    print('NON-FRAUD CASE AMOUNT STATS')
    print(nonfraud_cases.Amount.describe())
    print('FRAUD CASE AMOUNT STATS')    
    print(fraud_cases.Amount.describe())
    print('--------------------------------------------')    

data_processing(df)

%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.43s/tasks]


AlreadyDisplayedError: 8 errors

mpiexec error output:
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 69602 RUNNING AT ali-bodo-mac.local
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Abort trap: 6 (signal 6)

engine set stopped 1644898763: {'exit_code': 6, 'pid': 69598, 'identifier': 'ipengine-1644898762-b819-1644898763-69579'}


In [8]:
%restart_ipy

UsageError: Line magic function `%restart_ipy` not found.


In [9]:
%%px
restart_ipy

NoEnginesRegistered: This operation requires engines. Try client.wait_for_engines(n) to wait for engines to register.

## Feature Selection & Data Split

### 1. Normalize `Amount` variable
`Amount` variable varies when compared to the rest of the variables. To reduce its range of values, we normalize it using the `StandardScaler` 

In [None]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def sc(df):
    start = time.time()    
    sc = StandardScaler()
    amount = df['Amount'].values
    amount = amount.reshape(-1,1)
    sc.fit(amount)
    df['Amount'] = (sc.transform(amount)).ravel()
    print("StandardScaler time: ", time.time() - start)
    print(df['Amount'].head(10))
    
sc(df)

### 2. Split the data into a training set and testing set 

In [None]:
%%px
@bodo.jit(distributed=['df', 'X_train', 'X_test', 'y_train', 'y_test'], cache=True)
def data_split(df):
    X = df.drop('Class', axis = 1).values
    y = df['Class'].values.astype(np.int64)
    start = time.time()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, train_size=0.8, random_state = 0)
    print("train_test_split time: ", time.time() - start)    
    print('X_train samples :', X_train[:1])
    print('X_test samples :', X_test[0:1])
    print('y_train samples :', y_train[0:20])
    print('y_test samples :', y_test[0:20])    
    return X_train, X_test, y_train, y_test
    
X_train, X_test, y_train, y_test = data_split(df)

## Modeling
Here we have built four different types of classification models and evaluate these models using accuracy score metrics provided by scikit-learn package.

#### 1. Logistic Regression

In [None]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    lr_yhat = lr.predict(X_test)
    print("LogisticRegression fit and predict time: ", time.time()-start)
    print('Accuracy score of the Logistic Regression model is {}'.format(accuracy_score(y_test, lr_yhat)))
    
    
lr_model(X_train, X_test, y_train, y_test)

#### 2. Random Forest Tree

In [None]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier(max_depth = 4)
    rf.fit(X_train, y_train)
    rf_yhat = rf.predict(X_test)
    print("RandomForestClassifier fit and predict time: ", time.time()-start)    
    print('Accuracy score of the Random Forest Tree model is {}'.format(accuracy_score(y_test, rf_yhat)))

rf_model(X_train, X_test, y_train, y_test)

#### 3. XGBoost Model

In [None]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def xgb_model(X_train, X_test, y_train, y_test):  
    start = time.time()
    xgb = XGBClassifier(max_depth = 4)
    xgb.fit(X_train, y_train)
    xgb_yhat = xgb.predict(X_test)
    print("XGBClassifier fit and predict time: ", time.time()-start) 
    print('Accuracy score of the XGBoost model is {}'.format(accuracy_score(y_test, xgb_yhat)))

xgb_model(X_train, X_test, y_train, y_test)

#### 4. SVM

In [None]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lsvc_model(X_train, X_test, y_train, y_test):  
    start = time.time()
    lsvc = LinearSVC(random_state=42)
    lsvc.fit(X_train, y_train)
    lsvc_yhat = lsvc.predict(X_test)
    print("LinearSVC fit and predict time: ", time.time()-start) 
    print('Accuracy score of the Linear Support Vector Classification model is {}'.format(accuracy_score(y_test, lsvc_yhat)))

lsvc_model(X_train, X_test, y_train, y_test)

In [None]:
# To stop the cluster run the following command. 
rc.cluster.stop_cluster_sync()

mpiexec error output:
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 67999 RUNNING AT ali-bodo-mac.local
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Abort trap: 6 (signal 6)

engine set stopped 1644882427: {'exit_code': 6, 'pid': 67991, 'identifier': 'ipengine-1644882426-fbmp-1644882427-67973'}


In [9]:
%%px
restart_ipy

NoEnginesRegistered: This operation requires engines. Try client.wait_for_engines(n) to wait for engines to register.