### Detecting Fraudulent Credit Card Transactions Using Ensemble Methods (Baselines)

In [87]:
#Packages Used
import pandas as pd
import numpy as np
import sklearn.metrics as metrics
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

In [2]:
#load fraud data
data = pd.read_csv('creditcard.csv')

In [74]:
print('Number of Records:',data.shape[0])
print('Number of Variables:',data.shape[1])
data.head()

Number of Records: 284807
Number of Variables: 31


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [88]:
y = data.Class
y = np.asarray(y)
X = data.drop(columns = 'Class')
X = np.asarray(X)
print('Transactions:',len(y),', Fraudulent Transactions:',sum(y),', Ratio:',len(y)/sum(y),'to 1')

Transactions: 284807 , Fraudulent Transactions: 492 , Ratio: 578.8760162601626 to 1


### Baseline Models:

#### 1. Only Predicting Not Fraud

In [78]:
#Define Function to quickly calculate relevant model performance metrics
def matrix_scores(model,X_train,X_test,y_train,y_test):
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    scores = metrics.precision_recall_fscore_support(pred_test,y_test,average='binary')
    labels = ['Precision:','Recall:','F-Score:','Support:']
    print('Accuracy:',metrics.accuracy_score(pred_test,y_test))
    for i in range(3):
        print(labels[i],scores[i])
    return scores

In [91]:
#First need to train test split data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

#Standard Scale Data
scalar = StandardScaler()
scalar.fit(X_train)
X_train = scalar.transform(X_train)
X_test = scalar.transform(X_test)

In [92]:
# Never Predicting Fraud
pred_test = np.zeros(len(y_test))
pred_train = np.zeros(len(y_train))
print('Accuracy: %',100*metrics.accuracy_score(pred_test,y_test))

Accuracy: % 99.80337769039008


Simply always predicting that a transaction is not fraud will be accurate %99.8 of the time however it will not be able to actually stop any fraudulent activity, this is an instructive note when considering the validity of relying solely on accuracy when using heavily skewed data.

#### 2. Logistic Regression

A good starting point for many binary classification problems is Logistic Regression as it is simply a regression transformed by a sigmoid function which can give us a good baseline for our later more advanced models.

In [93]:
#Simple Logistic Regression Classifier
logreg = linear_model.LogisticRegression()
logistic_model = logreg.fit(X_train,y_train)

In [94]:
logistic_scores = matrix_scores(logistic_model,X_train,X_test,y_train,y_test)

Accuracy: 0.9991924440855307
Precision: 0.6696428571428571
Recall: 0.8928571428571429
F-Score: 0.7653061224489796


#### 3. Random Forest

Ensemble methods are another popular choice for simple binary classification problems such as this. Using Random Forest Ensembles shows a great increase in performance over the simple logistic model across each metric.

In [95]:
#RF Model
RF = RandomForestClassifier()
RF_model = RF.fit(X_train,y_train)

In [96]:
RF_scores = matrix_scores(RF_model,X_train,X_test,y_train,y_test)

Accuracy: 0.9995435553526912
Precision: 0.7857142857142857
Recall: 0.9777777777777777
F-Score: 0.8712871287128713


#### 4. Boosting Models

Another ensemble method that has become popular is gradient boosting. This section shows the performance of and adaptive gradient boost (Ada Boost) and the perfromance of Extreme Gradient Boosting Methods (XGBoost)

In [97]:
#Ada Boost Model
AdaBoost = AdaBoostClassifier()
Ada_model = AdaBoost.fit(X_train,y_train)

In [98]:
Ada_scores = matrix_scores(Ada_model,X_train,X_test,y_train,y_test)

Accuracy: 0.9992275552122467
Precision: 0.6875
Recall: 0.8953488372093024
F-Score: 0.7777777777777778


In [99]:
#XGBoost
XGBoost = XGBClassifier()
XG_model = XGBoost.fit(X_train,y_train)

In [100]:
XG_scores = matrix_scores(XG_model,X_train,X_test,y_train,y_test)

Accuracy: 0.9994908886626171
Precision: 0.7678571428571429
Recall: 0.9662921348314607
F-Score: 0.8557213930348259


In terms of the overall performance metrics (Recall, Precision and F-Score) Random Forest and XGBoost seem to perform the best so far the next step would be to tune hyperparameters to see how each of these models can be improved further. 