#Credit Card Fraud Detection
###by Daniel Pham

##README

When running this .ipynb file on google collab, you **MAY** have to restart the runtime in order to use the imblearn library.

If this is not being ran in any other IDE, make sure the imblearn library is installed in order for this code to be properly runned.

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

!pip install -U imbalanced-learn
from imblearn.over_sampling import SMOTE

Requirement already up-to-date: imbalanced-learn in /usr/local/lib/python3.7/dist-packages (0.8.0)


The credit card fraud problem has given us known labels so I will look to do a supervised technique.

Since the output is binary in nature (fraud or not fraud), I decided that I will use **logistic regression** as my method for this project.

We then can take a look at the weights to see what are the main features that contribute to fraud.

#Exploring the problem

In [None]:
#if uploading the CSV to google collab, it takes a while for it to be fully uploaded so please wait
df = pd.read_csv("creditcard.csv")

In [None]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [None]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.91956e-15,5.688174e-16,-8.769071e-15,2.782312e-15,-1.552563e-15,2.010663e-15,-1.694249e-15,-1.927028e-16,-3.137024e-15,1.768627e-15,9.170318e-16,-1.810658e-15,1.693438e-15,1.479045e-15,3.482336e-15,1.392007e-15,-7.528491e-16,4.328772e-16,9.049732e-16,5.085503e-16,1.537294e-16,7.959909e-16,5.36759e-16,4.458112e-15,1.453003e-15,1.699104e-15,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,1.08885,1.020713,0.9992014,0.9952742,0.9585956,0.915316,0.8762529,0.8493371,0.8381762,0.8140405,0.770925,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,-24.58826,-4.797473,-18.68371,-5.791881,-19.21433,-4.498945,-14.12985,-25.1628,-9.498746,-7.213527,-54.49772,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,-0.5354257,-0.7624942,-0.4055715,-0.6485393,-0.425574,-0.5828843,-0.4680368,-0.4837483,-0.4988498,-0.4562989,-0.2117214,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,-0.09291738,-0.03275735,0.1400326,-0.01356806,0.05060132,0.04807155,0.06641332,-0.06567575,-0.003636312,0.003734823,-0.06248109,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,0.4539234,0.7395934,0.618238,0.662505,0.4931498,0.6488208,0.5232963,0.399675,0.5008067,0.4589494,0.1330408,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,23.74514,12.01891,7.848392,7.126883,10.52677,8.877742,17.31511,9.253526,5.041069,5.591971,39.4209,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


Since this problem is about credit card fraud, there is usually a large class imbalance, so let's take a look at how poor is the class imbalance.

In [None]:
class_count = pd.value_counts(df['Class'])
print(class_count)

0    284315
1       492
Name: Class, dtype: int64


As shown, there is a huge class imbalance. Only 492 cases of fraud out of over 280,000 samples!

The algorithm may not be trained properly with too few samples of the minority class.

To solve this class imbalance, I will  oversample the miniority class (fraud) by **SMOTE** instead of undersampling the majority class (not fraud).

Furthermore, this data is already cleaned up for us by the providers. All the features except for a couple are converted into PCA values. The features that were not converted into PCA values may need to normalized but we'll see.

#Creating Training and Test Sets

To properly train a logistic regression model, we will first need to make 2 separate sets from the data: a training set to train the model and a test set to test how accurate the good the trained model is.

The training set will contain 80% of the data while the test set will contain the rest (20% of the data).

In [None]:
training_set, test_set = train_test_split(df, test_size = 0.2)

#checking if it's done correctly

training_set.info()
test_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 227845 entries, 228044 to 158499
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    227845 non-null  float64
 1   V1      227845 non-null  float64
 2   V2      227845 non-null  float64
 3   V3      227845 non-null  float64
 4   V4      227845 non-null  float64
 5   V5      227845 non-null  float64
 6   V6      227845 non-null  float64
 7   V7      227845 non-null  float64
 8   V8      227845 non-null  float64
 9   V9      227845 non-null  float64
 10  V10     227845 non-null  float64
 11  V11     227845 non-null  float64
 12  V12     227845 non-null  float64
 13  V13     227845 non-null  float64
 14  V14     227845 non-null  float64
 15  V15     227845 non-null  float64
 16  V16     227845 non-null  float64
 17  V17     227845 non-null  float64
 18  V18     227845 non-null  float64
 19  V19     227845 non-null  float64
 20  V20     227845 non-null  float64
 21  V21  

Creating a variable that contains just *labels* for the samples for both training and test set to use for scikit-learn's library.

In [None]:
training_labels = training_set['Class']
test_labels = test_set['Class']

#checking if it's done correctly
training_labels.head()

228044    0
13108     0
258199    0
83554     0
46084     0
Name: Class, dtype: int64

Dropping the 'Class' column to create a variable of just *features* for both training and test set.

In [None]:
training_features = training_set.drop(['Class'], axis = 1)
test_features = test_set.drop(['Class'], axis = 1)

#checking if it's done correctly
training_features.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
228044,145335.0,2.065967,-0.327213,-2.221817,-0.713805,0.43534,-1.028121,0.45189,-0.285918,0.407632,0.016286,0.858679,0.489616,-1.233011,0.975564,-0.61204,-0.618106,-0.110769,-0.281928,0.741601,-0.197287,0.142837,0.469987,0.000287,0.792088,0.240474,1.027339,-0.15649,-0.093711,27.0
13108,23003.0,-0.250748,0.228252,1.997959,-0.668625,-0.653781,-0.404783,-0.247943,0.087898,2.572726,-1.660099,0.791595,-2.08971,1.127353,1.387759,0.860167,-1.21294,1.22673,0.228861,0.64506,-0.07112,0.050256,0.62555,0.084838,0.355037,-0.857608,-0.769203,0.269768,0.205272,11.85
258199,158539.0,-1.121313,0.771984,1.068908,-1.380343,0.517473,0.200858,0.500902,0.033177,1.29802,0.070078,-1.622049,-0.870944,-1.109992,-0.394997,0.651399,0.28861,-0.863619,0.131083,0.090543,0.198547,-0.380745,-0.619611,0.001221,-0.042411,-0.106341,-0.264659,0.31757,0.066396,7.68
83554,59914.0,-1.029989,1.269654,2.026286,0.984254,0.853755,0.189979,1.372185,-0.79893,-0.061175,2.021282,1.551693,-0.458205,-1.458916,-0.545649,-0.371239,0.431947,-1.26789,-0.298401,-1.89103,-0.012271,-0.04201,0.161717,-0.16981,0.141905,-0.391466,-0.469039,-1.348529,-0.787572,8.13
46084,42633.0,-0.335568,0.982971,1.595383,1.03439,0.292466,0.067714,1.041377,-0.185117,-1.02162,0.269348,1.988393,0.981614,0.405207,0.301178,0.742268,-0.538149,-0.102792,-0.535909,0.784026,0.109466,-0.333262,-0.86285,0.147012,0.156859,-0.410384,-0.765186,-0.207621,-0.19397,52.0


We have 4 new variables. Now we can refer to the features and labels of the training and test sets.

#Applying Scikit-learn's Logistic Regression **without** SMOTE sampling

First, we want to see how the logistic regression algorithm will fare without applying oversampling to the dataset.

In [None]:
#create the trained algorithm by using the training set
logreg = LogisticRegression(max_iter=1000)
logreg.fit(training_features, training_labels)

#test the algorithm with the test set
logreg_score = logreg.score(test_features, test_labels)
print("Logistic regression accuracy without oversampling:", logreg_score)

Logistic regression accuracy without oversampling: 0.9991748885221726


Scikit-learn's library fitted the model to the dataset really well.

Now I will cross-validate the model over 10-folds of sampling to see if it returns similar accuracies.

In [None]:
cv_scores = cross_val_score(logreg, training_features, training_labels, cv=10)

print("Cross Validation Scores without oversampling: \n", cv_scores)

cv_scores = pd.Series(cv_scores)
print("Cross Validation Minimum:", cv_scores.min())
print("Cross Validation Mean:", cv_scores.mean()) 
print("Cross Validation Max:", cv_scores.max())

Cross Validation Scores without oversampling: 
 [0.99938556 0.99921001 0.99868334 0.99916612 0.9992539  0.99916608
 0.99916608 0.9990783  0.99920997 0.99920997]
Cross Validation Minimum: 0.9986833443054641
Cross Validation Mean: 0.9991529332034554
Cross Validation Max: 0.9993855606758832


Looking at the cross validation scores, our model overall is having similar accuracy scores as to our trained accuracy. But is it actually predicting the class that we are interested in (the samples that are fraud) correctly?



It is time to look at the confusion matrix and F-scores

In [None]:
#having the trained model predict over the test set
predictions = logreg.predict(test_features)

#confusion matrix from comparing the predictions to the data's actual labels
cm =  confusion_matrix(test_labels, predictions)

print("Confusion matrix without oversampling: \n", cm)

Confusion matrix without oversampling: 
 [[56839    18]
 [   29    76]]


To interpret this...

**Upper Left Corner**: The amount of correctly classified by the model for non-fraudulent transactions.

**Upper Right Corner**: The amount of incorrectly classified as fraudulent transactions, but the actual label is non-fraudulent.

**Lower Left Corner**: The amount of incorrectly classified as non-fraudulent transactions, but the actual label is fraudulent.

**Lower Right Corner**: The amount of correctly classified by the model for fraudulent transactions.

From an eyeball glance of the confusion matrix, the model is predicting very well for the non-fraudulent transactions. But that is not what we're interested looking at.

When taking a look at fraudulent cases for the confusion matrix, we are getting a rather unsatisfactory result as expected. The model is correctly predicting fraudulent cases at least 70% of the time. From an industry point of view, this model is most likely not good enough.

To further reinforce the results, let's take a look at the classification report for the cunfusion matrix.

In [None]:
report = classification_report(test_labels, predictions)

print("Classification report without oversampling: \n", report)

Classification report without oversampling: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56857
           1       0.81      0.72      0.76       105

    accuracy                           1.00     56962
   macro avg       0.90      0.86      0.88     56962
weighted avg       1.00      1.00      1.00     56962



**Precision** refers to how what percentage of positive class identifications are actually correct while **recall** refers to what percentage of positive classes are identified correctly. And F1-score combines the precision and recall score together.


According to this, the precision and recall is excellent for the non-fraudulent cases and somewhat okay for the fraudulent cases. The F1-score seem to suggest that overall, the model did decent.

Let's do one last check to see how this model of no oversampling fares by the ROC curve.

ROC curve is a performance measurement which tells how capable the model is at distinguishing different classes. It scores between 0.5 to 1.0

In [None]:
auc_score = roc_auc_score(test_labels, predictions)

print("AUC score without oversampling:", auc_score)

AUC score without oversampling: 0.8617464700497574


This area-under-the-curve (AUC) score is rather good. This suggests that the model here can be used for credit card fraud detection, but that does not mean we cannot do better. 

#Applying Scikit-learn's Logistic Regression **with** SMOTE sampling

As mentioned earlier, this problem has a huge class imbalance and the model may not be trained properly due to this imbalance. Therefore, it is now time to implement SMOTE oversampling, and hopefully, the algorithm would improve.

We need to now create new variables for the **training** set using SMOTE.



In [None]:
smote = SMOTE(random_state = 10, sampling_strategy = 1.0)
training_features_smote, training_labels_smote = smote.fit_resample(training_features, training_labels)

Now we just need to repeat what we did earlier, except with new SMOTE training set variables.

Creating a logistic regression model with SMOTE variables.

In [None]:
logreg_smote = LogisticRegression(max_iter=1000)
logreg_smote.fit(training_features_smote, training_labels_smote)

logreg_score_smote = logreg_smote.score(test_features, test_labels)
print("Logistic regression accuracy with oversampling:", logreg_score_smote)

Logistic regression accuracy with oversampling: 0.9825146588954039


We see here that the logistic regression model with SMOTE resulted in a slightly lower overall accuracy than the model without SMOTE.

This might be due to the model sacrificing some accuracy to better predict the class that we are interested in due to it being oversampled.

In [None]:
cv_scores_smote = cross_val_score(logreg_smote, training_features_smote, training_labels_smote, cv=10)

print("Cross Validation Scores with oversampling: \n", cv_scores_smote)

cv_scores_smote = pd.Series(cv_scores_smote)
print("Cross Validation Minimum:", cv_scores_smote.min())
print("Cross Validation Mean:", cv_scores_smote.mean()) 
print("Cross Validation Max:", cv_scores_smote.max())

Cross Validation Scores with oversampling: 
 [0.97247868 0.95414578 0.97410534 0.9695331  0.96975292 0.97340192
 0.95284782 0.97164274 0.9734453  0.95495812]
Cross Validation Minimum: 0.9528478160515267
Cross Validation Mean: 0.9666311722807948
Cross Validation Max: 0.9741053372021454


Like before, the cross validation results are producing similar accuracy as to our trained SMOTE model's accuracy. 

Now let's take a look if the model is actually accurately predicting the fraudulent cases correctly. 

In [None]:
predictions_smote = logreg_smote.predict(test_features)

cm_smote =  confusion_matrix(test_labels, predictions_smote)

print("Confusion matrix with oversampling: \n", cm_smote)

Confusion matrix with oversampling: 
 [[55869   988]
 [    8    97]]


As a reminder...

**Upper Left Corner**: The amount of correctly classified by the model for non-fraudulent transactions.

**Upper Right Corner**: The amount of incorrectly classified as fraudulent transactions, but the actual label is non-fraudulent.

**Lower Left Corner**: The amount of incorrectly classified as non-fraudulent transactions, but the actual label is fraudulent.

**Lower Right Corner**: The amount of correctly classified by the model for fraudulent transactions.

As a result of the model being trained through SMOTE, the model has at least a 90% accuracy in predicting in the class that we are interested in! That is a big improvement compared to previous 70% accuracy.

However, we do see that it is predicting more non-fraudulent cases as fraudulent. But, that is fine since it is better to be safe than sorry.

Credit card companies can better protect their consumers and get people to verify their transactions if the algorithm think it is fraudulent case.

Now let's see what the classification report says.

In [None]:
report_smote = classification_report(test_labels, predictions_smote)

print("Classification report with oversampling: \n", report_smote)

Classification report without oversampling: 
               precision    recall  f1-score   support

           0       1.00      0.98      0.99     56857
           1       0.09      0.92      0.16       105

    accuracy                           0.98     56962
   macro avg       0.54      0.95      0.58     56962
weighted avg       1.00      0.98      0.99     56962



As another a reminder...

**Precision** refers to how what percentage of positive class identifications are actually correct while **recall** refers to what percentage of positive classes are identified correctly. And F1-score combines the precision and recall score together.

We see here that the results further reinforce what I was saying earlier. The low score in precision resulted from the model predicting more non-fraudulent cases as fraudulent. But in exchange, the model is predicting fraudulent cases much more accurately. The F-score may look very poor, but in actuality, this is a much better model than the one without oversampling.

Lastly, let's see what the ROC curve suggests about the model.

In [None]:
auc_score_smote = roc_auc_score(test_labels, predictions_smote)

print("AUC score with oversampling:", auc_score_smote)

AUC score without oversampling: 0.953216297863395


This AUC score is quite excellent and would be great one to use in the industry.

When compared to the previous score, this score suggests that this model, which was trained through oversampling the minority class, is the better one.

#Conclusions

In conclusion, logistic regression is a great model to use for credit card detection.

Since there was a huge class imbalance, training the model with SMOTE oversampling yielded a much better algorithm for the class that we are interested in than training the model without it.

Now let's see which features in particular heavily influenced if a transaction was fraudulent or not.

In [None]:
#this contains all the weights in algorithm
weights = logreg_smote.coef_

#this contains all the feature names
features = df.columns.values

#creates a dictionary of the features and their associated weights
weights_dict = {}
for weight, feat in zip(weights[0,:], features):
    weights_dict[feat] = weight

#prints all key value pairs in dictionary
for element in weights_dict.items():
  print(element)

('Time', -3.757874976432943e-05)
('V1', 0.15019753375998096)
('V2', -0.2489840494760921)
('V3', -0.8587742651232082)
('V4', 0.8637960973558975)
('V5', 0.26399804892451945)
('V6', -0.3029587023427311)
('V7', -0.23103062793153795)
('V8', -0.4395015697357681)
('V9', -0.5360104612444051)
('V10', -0.6424805605493271)
('V11', 0.2349376690388481)
('V12', -0.8696767603767634)
('V13', -0.44404124544872386)
('V14', -1.4416911904021594)
('V15', -0.2613064095415009)
('V16', -0.6275713551943845)
('V17', -1.0287285616573794)
('V18', -0.12192251237622763)
('V19', 0.2052061882717862)
('V20', 0.03372111805303771)
('V21', 0.11090756685767993)
('V22', 0.36430690113643127)
('V23', 0.06889949398808184)
('V24', 0.08887837331977193)
('V25', -0.2228668997492212)
('V26', -0.06125643451590009)
('V27', 0.07206251068071699)
('V28', 0.0333622415501257)
('Amount', -0.0004798187593282112)


Since this is logistic regression, negative weights would make the model incline to output a 0 (aka the class we are NOT interested in) while positive weights would make the model incline to output a 1 (aka the class we are interested in).

The features "Time", "V14", and "V17" seem to be the most significant features to influencing non-fraudulent cases.

The features "V4", "V19", and "V22" seem to be the most significant features to influencing fraudulent cases.