## Detecting Credit Card Fraud

Since we are working with preprocessed data, most of the features we are working with is obsecured. Good news is that we won't have to do much, if at all, data cleaning.

In [10]:
# Let's load the necessary packages and the data
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sklearn
from sklearn import ensemble
from sklearn import linear_model
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

df = pd.read_csv('creditcard.csv')
df.head(5)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Since principal component analysis was already done, we can trust that the features have already been reduced to describe the data. So let's move on to see if we can build a model using the data. In the documents for the data, it was mentioned that there is a very small number of fraudulent transactions so let's take a look for ourselves.

In [2]:
# In order to see how we can split the data, I want to see how many records we are working with
df.shape

(284807, 31)

In [8]:
# 280k+ rows, so since the percentage of fraudulent transactions is very low, let's leave a sizeable test data.
X = df.drop(['Class'], axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=26)
print(X_train.shape[0], X_test.shape[0], y_train.shape[0], y_test.shape[0])

213605 71202 213605 71202


Let's use the vanilla logistic regression model to see how it does compared to ridge regression.

In [15]:
lr = linear_model.LogisticRegression()
lr_model = lr.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
c_mat = confusion_matrix(y_pred, y_test)

print('Coefficients:\n', lr_model.coef_)
print('R2 Score: ', lr_model.score(X_test, y_test))
print('Confusion Matrix:\n', c_mat)


Coefficients:
 [[-5.43948445e-05  2.19314009e-01 -5.28668488e-01 -9.15529531e-01
   2.99220268e-01  5.27528829e-02 -1.11981292e-01  3.61869665e-01
  -3.87720797e-01 -4.48078223e-01 -3.14584475e-01 -4.08661295e-01
   6.90020716e-02 -3.29886771e-01 -5.44685809e-01 -6.61713972e-01
  -2.96509390e-01 -3.49597782e-01  1.42322435e-01  1.51382199e-03
   2.56742282e-01  4.51000776e-01  6.06613482e-01  2.08903073e-01
   1.50033018e-02 -6.39675656e-01  1.58511669e-01 -9.37577361e-02
   5.71954889e-02 -5.79680261e-03]]
R2 Score:  0.9989607033510295
Confusion Matrix:
 [[71040    44]
 [   30    88]]


Out of 71202 records, only 118 were flagged as fraudulent. Since the nature of the dataset has a skewed distribution, it is hard to celebrate our model's accuracy of 99.9%. It is also worrying that we flagged 44 legitimate transactions as fraudulent, while missing 30 fraudulent transactions. Put in context of total transactions, sensitivity(Type I or false positive) was 44/71084 or 0.06% and specificity(Type II or false negative) was 30/118 or 25.42%. In this case, we are more interested in falsely identified fraudulent transactions, since this means that we missed 25% of all fradulent transactions.