# Credit Card Fraud Detection With Machine Learning in Python

This example shows use of classification to help credit card company to detect potential fraud cases. 
Original example can be found [here](https://medium.com/codex/credit-card-fraud-detection-with-machine-learning-in-python-ac7281991d87).

### Notes on running this example:

By defaults runs use Bodo. Hence, data is distributed in chunks across processes.

The current results are based on running on one **m5.12xlarge** instance (24 cores, 192GiB memory)

The dataset can be downloaded from Kaggle [here](https://www.kaggle.com/mlg-ulb/creditcardfraud) or use S3 Bucket (`s3://bodo-examples-data/creditcard/creditcard.csv`)

To run the code:
1. Make sure you add your AWS account credentials to access the data (if using S3 bucket link). 
2. If you want to run the example using pandas only (without Bodo):
    1. Comment lines magic expression (`%%px`) and bodo decorator (`@bodo.jit`) from all the code cells.
    2. Then, re-run cells from the beginning.
3. For xgboost package, build it from source with MPI enabled (this step is already done on Bodo Platform).

## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Numpy to work with arrays
 - scikit-learn to build and evaluate classification models
 - xgboost for xgboost classifier model algorithm

In [1]:
%%px
import warnings
warnings.filterwarnings("ignore")

import bodo
import numpy as np
import pandas as pd
import time
from sklearn.preprocessing import StandardScaler # data normalization
from sklearn.model_selection import train_test_split # data split
from sklearn.linear_model import LogisticRegression # Logistic regression algorithm
from sklearn.ensemble import RandomForestClassifier # Random forest tree algorithm
from xgboost import XGBClassifier # XGBoost algorithm
from sklearn.svm import LinearSVC # SVM classification algorithm
from sklearn.metrics import accuracy_score # evaluation metric

In [2]:
%%px
import os
os.environ["AWS_ACCESS_KEY_ID"] = "your_access_key_id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret_access_key"
os.environ["AWS_DEFAULT_REGION"] = "us-east-2"

## Data Processing and EDA
1. Load dataset
2. Compute the percentage of fraud cases in the overall recorded transcations.
3. Get a statistical view of both fraud and non-fraud transaction amount data

In [3]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def load_data():
    start = time.time()
    df = pd.read_csv('s3://bodo-examples-data/creditcard/creditcard.csv')
    df.drop('Time', axis = 1, inplace = True)
    end = time.time()
    print("Read Time: ", (end-start))
    return df

df = load_data()


[stdout:0] Read Time:  0.9684278964996338


In [4]:
%%px
@bodo.jit(distributed=['df'])
def data_processing(df):
    cases = len(df)
    nonfraud_count = len(df[df.Class == 0])
    fraud_count = len(df[df.Class == 1])
    fraud_percentage = round(fraud_count/nonfraud_count*100, 2)
    print('--------------------------------------------')
    print('Total number of cases are ', cases)
    print('Number of Non-fraud cases are ', nonfraud_count)
    print('Number of fraud cases are', fraud_count)
    print('Percentage of fraud cases is ', fraud_percentage)
    print('--------------------------------------------')    
    nonfraud_cases = df[df.Class == 0]
    fraud_cases = df[df.Class == 1]
    print('--------------------------------------------')
    print('NON-FRAUD CASE AMOUNT STATS')
    print(nonfraud_cases.Amount.describe())
    print('FRAUD CASE AMOUNT STATS')    
    print(fraud_cases.Amount.describe())
    print('--------------------------------------------')    

data_processing(df)

[stdout:0] 
--------------------------------------------
Total number of cases are  284807
Number of Non-fraud cases are  284315
Number of fraud cases are 492
Percentage of fraud cases is  0.17
--------------------------------------------
--------------------------------------------
NON-FRAUD CASE AMOUNT STATS
count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64
FRAUD CASE AMOUNT STATS
count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64
--------------------------------------------


## Feature Selection & Data Split

### 1. Normalize `Amount` variable
`Amount` variable varies when compared to the rest of the variables. To reduce its range of values, we normalize it using the `StandardScaler` 

In [5]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def sc(df):
    start = time.time()    
    sc = StandardScaler()
    amount = df['Amount'].values
    amount = amount.reshape(-1,1)
    sc.fit(amount)
    df['Amount'] = (sc.transform(amount)).ravel()
    print("StandardScaler time: ", time.time() - start)
    print(df['Amount'].head(10))
    
sc(df)

[stdout:0] 
StandardScaler time:  0.06013607978820801
0    0.244964
1   -0.342475
2    1.160686
3    0.140534
4   -0.073403
5   -0.338556
6   -0.333279
7   -0.190107
8    0.019392
9   -0.338516
Name: Amount, dtype: float64


### 2. Split the data into a training set and testing set 

In [6]:
%%px
@bodo.jit(distributed=['df'], cache=True)
def data_split(df):
    X = df.drop('Class', axis = 1).values
    y = df['Class'].values.astype(np.int64)
    start = time.time()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, train_size=0.8, random_state = 0)
    print("train_test_split time: ", time.time() - start)    
    print('X_train samples :', X_train[:1])
    print('X_test samples :', X_test[0:1])
    print('y_train samples :', y_train[0:20])
    print('y_test samples :', y_test[0:20])    
    return X_train, X_test, y_train, y_test
    
X_train, X_test, y_train, y_test = data_split(df)

[stdout:0] 
train_test_split time:  0.001360177993774414
X_train samples : [[-1.27923083 -0.15330333  3.29631037  3.32044136  1.13901754  0.54234305
  -0.72992832 -0.05177411  0.92271182  0.84594969  1.38923569 -2.44018135
   1.09921626  0.76496092 -1.32315853 -0.28971316  0.65615985  0.77523608
   1.52883351  0.0286392  -0.40974641 -0.34257494 -0.49329682 -0.01704646
  -0.10740384  0.10116408 -0.19794013 -0.43565366 -0.35322939]]
X_test samples : [[-1.46317756  1.53882499  0.78746456 -0.10219179 -0.62638956 -0.35972277
  -0.21373356  1.08709589 -0.76867278 -0.32041171  1.37707623  0.95539556
  -0.29877328  1.00682748  0.36061527  0.28524413  0.05467376 -0.27212213
  -0.17142973 -0.02906554 -0.08161953 -0.40711758  0.13531346  0.19218949
  -0.23768612  0.08036209  0.13556789  0.03238003 -0.31328851]]
y_train samples : [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
y_test samples : [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


## Modeling
Here we have built four different types of classification models and evaluate these models using accuracy score metrics provided by scikit-learn package.

#### 1. Logistic Regression

In [7]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    lr_yhat = lr.predict(X_test)
    print("LogisticRegression fit and predict time: ", time.time()-start)
    print('Accuracy score of the Logistic Regression model is {}'.format(accuracy_score(y_test, lr_yhat)))
    
    
lr_model(X_train, X_test, y_train, y_test)

[stdout:0] 
LogisticRegression fit and predict time:  0.07842183113098145
Accuracy score of the Logistic Regression model is 0.9993330525133389


#### 2. Random Forest Tree

In [8]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier(max_depth = 4)
    rf.fit(X_train, y_train)
    rf_yhat = rf.predict(X_test)
    print("RandomForestClassifier fit and predict time: ", time.time()-start)    
    print('Accuracy score of the Random Forest Tree model is {}'.format(accuracy_score(y_test, rf_yhat)))

rf_model(X_train, X_test, y_train, y_test)

[stdout:0] 
RandomForestClassifier fit and predict time:  4.058526992797852
Accuracy score of the Random Forest Tree model is 0.9994208087615838


#### 3. XGBoost Model

In [9]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def xgb_model(X_train, X_test, y_train, y_test):  
    start = time.time()
    xgb = XGBClassifier(max_depth = 4)
    xgb.fit(X_train, y_train)
    xgb_yhat = xgb.predict(X_test)
    print("XGBClassifier fit and predict time: ", time.time()-start) 
    print('Accuracy score of the XGBoost model is {}'.format(accuracy_score(y_test, xgb_yhat)))

xgb_model(X_train, X_test, y_train, y_test)

[stdout:0] 
XGBClassifier fit and predict time:  1.8004610538482666
Accuracy score of the XGBoost model is 0.9980342600393148


#### 4. SVM

In [10]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lsvc_model(X_train, X_test, y_train, y_test):  
    start = time.time()
    lsvc = LinearSVC(random_state=42)
    lsvc.fit(X_train, y_train)
    lsvc_yhat = lsvc.predict(X_test)
    print("LinearSVC fit and predict time: ", time.time()-start) 
    print('Accuracy score of the Linear Support Vector Classification model is {}'.format(accuracy_score(y_test, lsvc_yhat)))

lsvc_model(X_train, X_test, y_train, y_test)

[stdout:0] 
LinearSVC fit and predict time:  0.13797593116760254
Accuracy score of the Linear Support Vector Classification model is 0.9993330525133389
