---

# Credit Card Fraud Detection

This project aims to correctly classify fraudulent transactions based off of 28 transaction factors, and the amount of
money involved in each transaction.

The [datasets provided by Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud) are imbalanced, and as such a variety
of methods and models will be used in order to explore some of the methods available to deal with imbalanced
classification problems. These problems are commonplace in reality, for example, within disease diagnoses, or customer
churn prediction.

***Note:*** *The data visualisations used within this jupyter notebook file use plotly and are interactive. To view
these figures please enter the GitHub url for this notebook into the [nbviewer site](http://nbviewer.jupyter.org/).*

---

In [2]:
# imports

import os

from collections import Counter

import numpy as np # calculations and functions

import pandas as pd  # data processing / wrangling
pd.set_option('display.max_columns', None)

import matplotlib
import matplotlib.pyplot as plt # data visualisation
%matplotlib inline

from IPython.core.display import display  # displaying DataFrames

import plotly  # data visualisation
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook_connected'
colors = px.colors.qualitative.Plotly

import sklearn  # data preprocessing and machine learning algorithms
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import imblearn  # data preprocessing and resampling
from imblearn.over_sampling import SMOTE

import tensorflow as tf  # deep learning models
from tensorflow import keras

print('numpy version      :', np.__version__)
print('scipy version      :', np.__version__)
print('pandas version     :', pd.__version__)
print('matplotlib version :', matplotlib.__version__)
print('plotly version     :', plotly.__version__)
print('sklearn version    :', sklearn.__version__)
print('imblearn version   :', imblearn.__version__)
print('imblearn version   :', imblearn.__version__)
print('tensorflow version :', tf.__version__)

if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

numpy version      : 1.20.3
scipy version      : 1.20.3
pandas version     : 1.3.0
matplotlib version : 3.4.2
plotly version     : 5.1.0
sklearn version    : 0.24.2
imblearn version   : 0.8.0
imblearn version   : 0.8.0
tensorflow version : 2.5.0
Default GPU Device: /device:GPU:0


---

## Part 1 - Exploratory Data Analysis

To begin, the dataset will be loaded from a CSV file to a Pandas DataFrame object.

In [3]:
# load data either from local dir or google drive
data_path = os.getcwd() + os.sep + 'input' + os.sep + 'creditcard.csv'
try:
    df = pd.read_csv(data_path)
except FileNotFoundError as e:
    print(e)
    print('Pulling data from google drive...', end=' ')
    url = 'https://drive.google.com/file/d/1CvdihQ3YUzllrrT60JKiszIONRFNIbxL/view?usp=sharing'
    url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
    df = pd.read_csv(url)

Next, let's inspect the DataFrame to get to grips with the various features of the dataset and look for any anomalies or
problem areas.

In [4]:
# display section of data
print('Section of DataFrame:')
display(df.head())

# display metrics for each column in DataFrame
print('Metrics for each DataFrame column:')
display(df.describe())

# display data types for each column in dataframe
print('Data types for each DataFrame column:')
display(df.dtypes)

Section of DataFrame:


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Metrics for each DataFrame column:


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,2.239053e-15,1.673327e-15,-1.247012e-15,8.190001e-16,1.207294e-15,4.887456e-15,1.437716e-15,-3.772171e-16,9.564149e-16,1.039917e-15,6.406204e-16,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,1.08885,1.020713,0.9992014,0.9952742,0.9585956,0.915316,0.8762529,0.8493371,0.8381762,0.8140405,0.770925,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,-24.58826,-4.797473,-18.68371,-5.791881,-19.21433,-4.498945,-14.12985,-25.1628,-9.498746,-7.213527,-54.49772,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,-0.5354257,-0.7624942,-0.4055715,-0.6485393,-0.425574,-0.5828843,-0.4680368,-0.4837483,-0.4988498,-0.4562989,-0.2117214,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,-0.09291738,-0.03275735,0.1400326,-0.01356806,0.05060132,0.04807155,0.06641332,-0.06567575,-0.003636312,0.003734823,-0.06248109,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,0.4539234,0.7395934,0.618238,0.662505,0.4931498,0.6488208,0.5232963,0.399675,0.5008067,0.4589494,0.1330408,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,23.74514,12.01891,7.848392,7.126883,10.52677,8.877742,17.31511,9.253526,5.041069,5.591971,39.4209,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


Data types for each DataFrame column:


Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

There are a couple of noticeable areas of this dataset; the number of features is large and may require some feature
engineering in order to determine the correct features to feed an algorithm, and the mean value for the Class column is
close to 0, indicating that the majority of the transactions are non-fraudulent (Class = 0).

Let's take a closer look at the proportions of the fraudulent transactions.

In [5]:
nonfraud_count = len(df[df.Class == 0])
fraud_count = len(df[df.Class == 1])
fraud_percentage = round((fraud_count / (nonfraud_count + fraud_count)) * 100, 2)

print('TRANSACTION CLASS PROPORTIONS')
print('-----------------------------')
print('Total            :', nonfraud_count + fraud_count)
print('Non-Fraudulent   :', nonfraud_count)
print('Fraudulent       :', fraud_count)
print(f'Fraud Percentage : {fraud_percentage}%')

TRANSACTION CLASS PROPORTIONS
-----------------------------
Total            : 284807
Non-Fraudulent   : 284315
Fraudulent       : 492
Fraud Percentage : 0.17%


Talk about undersampling/oversampling

Finally, we will also examine the *Amount* statistics of the fraudulent and non-fraudulent transactions separately.

In [6]:
df_nonfraud = df[df.Class == 0]
df_fraud = df[df.Class == 1]

print('TRANSACTION AMOUNT STATISTICS')
print('-----------------------------')
print('NON-FRAUDULENT TRANSACTIONS')
print('-----------------------------')
print(df_nonfraud.Amount.describe())
print('-----------------------------')
print('FRAUDULENT TRANSACTIONS')
print('-----------------------------')
print(df_fraud.Amount.describe())

TRANSACTION AMOUNT STATISTICS
-----------------------------
NON-FRAUDULENT TRANSACTIONS
-----------------------------
count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64
-----------------------------
FRAUDULENT TRANSACTIONS
-----------------------------
count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64


We can see from this analysis that the average transaction amount is significantly higher for fraudulent transactions,
when compared to non-fraudulent transactions.

This likely suggests that the *Amount* feature will be an important one for our model to consider when predicting the
classification of a transaction.

---

## Part 2 - Data Cleaning

As seen in Part 1, the data in all columns are either floats or integers, so no encoding is required.

The *Time* column will not be useful for predicting the fraud *Class*, and so this column will be removed from the
dataset.

In [7]:
df.drop(columns='Time', inplace=True)
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Finally, the data ranges for each column must be scaled for our predictive algorithm to perform well.

In [8]:
scaler = MinMaxScaler()
cols_to_scale = [col for col in df.columns if col != 'Class']
print('Columns scaled:', cols_to_scale)
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
df.head()

Columns scaled: ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.935192,0.76649,0.881365,0.313023,0.763439,0.267669,0.266815,0.786444,0.475312,0.5106,0.252484,0.680908,0.371591,0.635591,0.446084,0.434392,0.737173,0.655066,0.594863,0.582942,0.561184,0.522992,0.663793,0.391253,0.585122,0.394557,0.418976,0.312697,0.005824,0
1,0.978542,0.770067,0.840298,0.271796,0.76612,0.262192,0.264875,0.786298,0.453981,0.505267,0.381188,0.744342,0.48619,0.641219,0.38384,0.464105,0.727794,0.640681,0.55193,0.57953,0.55784,0.480237,0.666938,0.33644,0.58729,0.446013,0.416345,0.313423,0.000105,0
2,0.935217,0.753118,0.868141,0.268766,0.762329,0.281122,0.270177,0.788042,0.410603,0.513018,0.322422,0.706683,0.503854,0.640473,0.511697,0.357443,0.763381,0.644945,0.386683,0.585855,0.565477,0.54603,0.678939,0.289354,0.559515,0.402727,0.415489,0.311911,0.014739,0
3,0.941878,0.765304,0.868484,0.213661,0.765647,0.275559,0.266803,0.789434,0.414999,0.507585,0.271817,0.71091,0.487635,0.636372,0.289124,0.415653,0.711253,0.788492,0.467058,0.57805,0.559734,0.510277,0.662607,0.223826,0.614245,0.389197,0.417669,0.314371,0.004807,0
4,0.938617,0.77652,0.864251,0.269796,0.762975,0.263984,0.268968,0.782484,0.49095,0.524303,0.236355,0.724477,0.552509,0.608406,0.349419,0.434995,0.724243,0.650665,0.62606,0.584615,0.561327,0.547271,0.663392,0.40127,0.566343,0.507497,0.420561,0.31749,0.002724,0


---

## Part 3 - Splitting the Data

It is important for a predictive model to have been tested on data that it has not already seen. So, before moving on,
we must first split our dataset into two; a training set, and a testing set.

We'll use 80% of our data for training, and the remaining 20% for testing, although other distributions can be selected.

In [9]:
# static data splitting

X = df.drop(columns='Class').values  # independent variables / features / input
y = df['Class'].values  # dependent variable / output

# randomly splits the X and y data into 80% training data and 20% test data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=101,
    stratify=y
)

---

## Part 4 - Balancing the Dataset

A significant issue with this dataset, as identified in Part 1, is its imbalance, with fraudulent transactions making
up just 0.17% of all samples.

### What is an imbalanced dataset?

*A classification data set with skewed class proportions is called imbalanced.*<br>
*Classes that make up a large proportion of the data set are called majority classes.*<br>
*Those that make up a smaller proportion are called minority classes.*

###  Why is an imbalanced dataset a problem?

Let's create a simple function designed to predict whether a transaction is fraudulent or not.

In [10]:
def always_nonfraud():
    return 0

The above function will always predict a *Class* of 0 (non-fraudulent transaction) for any sample you give it.

This, of course, will do a terrible job of accurately detecting a fraudulent transaction, however, how well will its
predictions perform on our dataset? Let's test it out.

In [11]:
Class_pred = [always_nonfraud() for i in range(len(df))]  # just a list of of 0s
print(f"Accuracy Score : {round(accuracy_score(df['Class'], Class_pred)*100, 3)}%")

Accuracy Score : 99.827%


Wow! Our prediction function accurately predicted whether a transaction was fraudulent over 99% of the time! Great, time
to deploy the model right?

Well, no. Whilst this model has a high accuracy score, in the case of detecting fraud, what we're most concerned about
is how well the model accurately detects a fraudulent transaction as fraud.

For assessing our model's performance, we can use metrics other than accuracy:

| Metric | Explanation | Formula |
| ------ | ----------- | ------- |
| Precision / Positive<br> Predictive Value (PPV) | A measure of a classifiers exactness. Low precision may indicate a large number of false positives. | $$\frac{TP}{FP+TP}$$ |
| Recall / Sensitivity /<br> True Positive Rate | A measure of a classifiers completeness. Low recall indicates many false negatives| $$\frac{TP}{FN+TP}$$ |
| F1 Score | The balance (harmonic mean) between precision and recall. | $$2\times\frac{(precision \times recall)}{(precision + recall}$$|

Each of these metrics can be calculated for each class individually. What we're most concerned with is maximising the F1
score for the fraudulent class.

### How can we balance an imbalanced dataset?

Balancing data sets can be performed in many ways, a few of the most popular methods are:
- Under sampling
- Ensemble under sampling
- Over sampling
- Synthetic minority oversampling technique (SMOTE)

Let's discuss the practicalities of each of these methods.

#### Under Sampling

The first method we'll discuss is random under sampling. This method randomly selects samples from the majority class
(non-fraudulent transactions, *Class*=0) and discards all other samples.

This can be achieved using:

```df_nonfraud_under = df_nonfraud.sample(len(df_fraud))```

However, this method results in the loss of a lot of data. If the overall dataset is very large, and the number of
minority class samples is sufficient for training of the model, this naive method is fast, efficient, and works well.
With this dataset, however, the number of minority class samples is very small, and thus under sampling is not
recommended.

#### Ensemble Under Sampling

This modified version of the above under sampling method avoids the loss of majority class data by creating ensembles of
minority and majority datasets. Each dataset will have identical, duplicated minority class samples, and a subsection of
the original majority class samples.

Whilst this avoids the loss of data, the method still relies on having a sufficiently large number of minority class
samples to begin with, which we don't have.

#### Over Sampling

Where under sampling randomly deletes majority class samples, random over sampling randomly duplicates minority class
samples until a desired class distribution is achieved.

This, again, avoids the loss of data associated with naive, random under sampling. However, the duplicated data may
result in over-weighting certain feature values for the minority class. To avoid this, a more sophisticated method can
be employed to generate the additional minority class samples, instead of random duplication.

#### Synthetic Minority Oversampling Technique (SMOTE)

SMOTE is a version of oversampling, in which new minority class samples are created by a sort-of interpolation between
existing minority class samples feature vectors.

The steps involved in SMOTE are as follows:

1. Identify a feature vector and its nearest neighbor.
2. Take the difference between the two feature vectors.
3. Multiply the difference by a random number between 0 and 1 (typically), to give new synthetic feature vector.
4. Repeat for each feature vector.

This is the method we will be using for this dataset.

***Note:*** It is important to split the dataset into training and testing sets before oversampling your data. This is
because your oversampled data may (usually will) contain duplicates. If you oversample, then split, you may have
duplicate samples in your training and testing data, meaning your model will be predicting outcomes of samples it's
already seen in the training data... a bad idea.

In [12]:
print('Original training dataset shape  :', Counter(y_train))

# resample training data
sm = SMOTE(random_state=101)  # initialize SMOTE class
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print('Resampled training dataset shape :', Counter(y_train_res))

Original training dataset shape  : Counter({0: 227451, 1: 394})
Resampled training dataset shape : Counter({0: 227451, 1: 227451})


---

## Part 5 - Logistic Regression Model

The first algorithm we'll employ for this project is a logistic regression classifier.

In this model, the probabilities describing the possible outcomes of a single trial are modeled using a
logistic function:

$$\hat{y}=\frac{1}{1+e^{-z}}$$

where:
- $$\hat{y}$$ is the output of logistic regression model (prediction)
- $$z=b+w_{1}x_{1}+w_{2}x_{2}+...+w_{N}x_{N}$$ (also called log odds)
    - $$w$$ are the weights
    - $$b$$ is the bias
    - $$x$$ are the feature values


$$f(x)=\frac{L}{1+e^{-k\left(x-x_{0}\right)}}$$

The loss function for logistic regression is not simply squared errors, as it is for linear regression, as this would
result in a loss function with many local minima; making finding the global minima hard. Instead, we use **Log Loss**:

$$L(y, \hat{y})=\sum_{(x,y)\in{D}}-y\log{\hat{y}}-(1-y)\log{(1-\hat{y})}$$

where:
- $$(x,y)\in{D}$$ are the pairs of $$x$$ and $$y$$ values in the dataset $$D$$
- $$y$$ is the label (0 or 1 as this is a classification model)
- $$\hat{y}$$ is the predicted output of the logistic function above (some real number between 0 and 1)

It's also important to understand the use of regularisation for logistic regression. Regularisation, in this context,
will prevent the loss function from attempting to drive the loss to 0 for datasets with high dimensionality. Without
regularisation, the loss function will do this due to the presence of asymptotes within logistic regression.

We will be using the default L<sub>2</sub> regularisation, however other options, such as Early Stopping or
L<sub>1</sub> regularisation, are available.

You can read more about L<sub>1</sub> and L<sub>2</sub> regularisation
[here](https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c).

### Using Imbalanced Training Data

We will implement two logistic regression models, for the purpose of demonstrating the effect of balancing the dataset.
To begin, let's build and fit a model using the imbalanced training dataset and see how it performs.

In [13]:
# imbalanced dataset

# build model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# get predictions
y_hat_im = clf.predict(X_test)

# show classification report
print(classification_report(y_test, y_hat_im))
lr_im_report = classification_report(y_test, y_hat_im, output_dict=True)
lr_im_precision = lr_im_report['1']['precision']
lr_im_recall = lr_im_report['1']['recall']
lr_im_f1 = lr_im_report['1']['f1-score']

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.89      0.55      0.68        98

    accuracy                           1.00     56962
   macro avg       0.94      0.78      0.84     56962
weighted avg       1.00      1.00      1.00     56962



So we can see that the overall accuracy is high (~100%), which is expected due to the majority class bias. When we look
at recall, however, we can see that the model is doing a poor job of correctly detecting fraudulent transactions.

### Using Balanced Training Data

Next, let's try implementing the same logistic regression algorithm, but using the resampled training data.

In [14]:
# resampled dataset

# build model
clf = LogisticRegression(max_iter=500)  # increase iterations to achieve convergence
clf.fit(X_train_res, y_train_res)

# get predictions
y_hat_lr_res = clf.predict(X_test)

# show classification report
print(classification_report(y_test, y_hat_lr_res))
lr_res_report = classification_report(y_test, y_hat_lr_res, output_dict=True)
lr_res_precision = lr_res_report['1']['precision']
lr_res_recall = lr_res_report['1']['recall']
lr_res_f1 = lr_res_report['1']['f1-score']

              precision    recall  f1-score   support

           0       1.00      0.97      0.99     56864
           1       0.06      0.93      0.11        98

    accuracy                           0.97     56962
   macro avg       0.53      0.95      0.55     56962
weighted avg       1.00      0.97      0.99     56962



The recall has increased to 93%! That's good news for detecting the fraudulent transactions, however, the precision
has decreased dramatically.

This means that whilst the majority of the fraudulent transactions were correctly classified as fraudulent, there were
also a significant number of non-fraudulent transactions that were incorrectly classified as fraudulent. This, in turn,
has reduced the f1-score, because, as we remember, the f1-score is a balance between precision and recall.

Let's visualise this precision-recall tradeoff using the confusion matrix.

In [15]:
# get confusion matrix
conf_matrix = confusion_matrix(y_test, y_hat_lr_res, normalize='true')

# plot confusion matrix as heatmap
ticks = ['non-fraud', 'fraud']
fig = px.imshow(
    conf_matrix,
    labels = dict(x='Predicted', y='Actual', color='Clf %'),
    x=ticks,
    y=ticks,
    color_continuous_scale='Emrld'
)
fig.show()

So, although the precision for this model, trained on the resampled dataset, is low, as a proportion of all non-fraud
transactions, the amount misclassified as fraudulent is low.

It's a question of the context of the model; whether this tradeoff is sufficient, or if a better balance of
precision and recall is required.

### Using Weighted Logistic Regression

Finally, let's try implementing one more logistic regression model for this prediction problem; weighted logistic
regression.

Instead of balancing the dataset, we can instead weight the classes proportional to their occurrence for the purpose of
penalising incorrect classifications of fraudulent transactions more than non-fraudulent transactions.

It turns out, ```class_weight``` is a hyperparameter which can be passed to the ```LogisticRegression()``` class. The
question now becomes; what weight should we ascribe to the majority and minority class?

Let's remind ourselves of the proportions of each class within our imbalanced dataset.

In [16]:
# show class proportions
fraud_prop = (df['Class'].value_counts() / df.shape[0])[0]
nonfraud_prop = (df['Class'].value_counts() / df.shape[0])[1]
print('Fraudulent transactions     :', fraud_prop)
print('Non-fraudulent transactions :', nonfraud_prop)

Fraudulent transactions     : 0.9982725143693799
Non-fraudulent transactions : 0.001727485630620034


So let's start by ascribing the inverse of the class' proportions as the weights of each class.

In [17]:
w = {0: nonfraud_prop, 1: fraud_prop}

clf = LogisticRegression(random_state=101, class_weight=w)
clf.fit(X_train, y_train)
y_hat_lr_simple_weighted = clf.predict(X_test)
print(classification_report(y_test, y_hat_lr_simple_weighted))
lr_simple_weighted_report = classification_report(y_test, y_hat_lr_simple_weighted, output_dict=True)
lr_simple_weighted_precision = lr_simple_weighted_report['1']['precision']
lr_simple_weighted_recall = lr_simple_weighted_report['1']['recall']
lr_simple_weighted_f1 = lr_simple_weighted_report['1']['f1-score']

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.34      0.90      0.50        98

    accuracy                           1.00     56962
   macro avg       0.67      0.95      0.75     56962
weighted avg       1.00      1.00      1.00     56962



That's already worked pretty well! Our recall has increased, however, not by as much as the resampled dataset. But, the
precision for this model is much higher than the model which used the resampled dataset, resulting in a higher overall
f1-score.

Whilst this is a good start, how can we improve upon the weights used in order to maximise the f1-score?

We can use grid search to iterate over a number of hyperparameters, and return the hyperparemeters which optimise for a
specific metric. In this case, let's optimise for the f1 score.

In [18]:
# set range for class weights
weights = np.linspace(0.0,0.99,200)

# create dictionary grid for grid search
hyperparam_grid = {'class_weight': [{0:x, 1:1.0-x} for x in weights]}

# define model
clf = LogisticRegression(random_state=101)

# define evaluation procedure
grid = GridSearchCV(
    clf,
    hyperparam_grid,  # hyperparemeters to iterate over
    scoring = 'f1',   # metric to optimise for
    n_jobs = -1       # number of jobs to run in parallel (-1 = use all processors)
)

# get optimal weights
grid.fit(X_train, y_train)
print(f'Best score: {grid.best_score_} with param: {grid.best_params_}')

# get predictions and evaluate model
y_hat_lr_weighted = grid.predict(X_test)
print(classification_report(y_test, y_hat_lr_weighted))
lr_weighted_report = classification_report(y_test, y_hat_lr_weighted, output_dict=True)
lr_weighted_precision = lr_weighted_report['1']['precision']
lr_weighted_recall = lr_weighted_report['1']['recall']
lr_weighted_f1 = lr_weighted_report['1']['f1-score']

Best score: 0.770913118840252 with param: {'class_weight': {0: 0.029849246231155778, 1: 0.9701507537688442}}
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.82      0.81      0.81        98

    accuracy                           1.00     56962
   macro avg       0.91      0.90      0.91     56962
weighted avg       1.00      1.00      1.00     56962



---

## Part 6 - Random Forest Models

Random forests are essentially an extension of the simple, decision tree model. Instead of building a single decision
tree, random forest models build many decision trees as an ensemble, and then use the outputs of each tree to make a
final prediction.

For imbalanced datasets, random forest models (RF) can present some advantages over other classification models.
Firstly, class weights can be incorporated directly into RF models, making the model "cost sensitive" to the minority
class. Secondly, RF algorithms can include sampling techniques, such as SMOTE, and ensemble learning, selectively
"over-growing" decision trees for the minority class.

Let's first implement an RF model, using the SMOTE resampled dataset.

In [19]:
clf = RandomForestClassifier()

clf.fit(X_train_res, y_train_res)
y_hat_rf = clf.predict(X_test)

print(classification_report(y_test, y_hat_rf))
rf_report = classification_report(y_test, y_hat_rf, output_dict=True)
rf_precision = rf_report['1']['precision']
rf_recall = rf_report['1']['recall']
rf_f1 = rf_report['1']['f1-score']

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.91      0.87      0.89        98

    accuracy                           1.00     56962
   macro avg       0.96      0.93      0.94     56962
weighted avg       1.00      1.00      1.00     56962



Whilst the recall for this RF model is not quite as high as recall achieved by the weighted logistic regression model,
the precision, and thus the f1 score, are both considerably higher. Again, this is a question of the context around the
prediction problem. Whilst maximising the f1 score may be the best approach for this project, if the project was instead
focused on predicting cancer diagnoses, maximising the recall may be preferable.

However, for our task, this model seems to do a great job of predicting fraudulent transactions, using only the SMOTE
oversampling method. This makes this approach more lightweight than the grid search, logistic regression approach.

For now, the random forest model looks like the best predictor, but can we achieve an even higher f1-score by using an
artificial neural network?

---

## Part 7 - Artificial Neural Networks

An advantage that neural networks pose, over more restrictive machine learning algorithms, is the size of their
hypothesis space. This means, using an artificial neural network (ANN), we may be able to find a prediction function
which better approximates the Bayes prediction function (the best possible prediction function).

However, a downside of many ANNs is their interpretability, which is often worse than other, "simpler" models. If our
goal is simply catching as many cases of fraud as possible, while minimising the workload associated with reviewing
non-fraudulent transactions which have been mis-classified as fraudulent, then an ANN may be just what we're looking
for.

### Simple ANN

For this project, we'll be using the keras API, from the tensorflow package, to build our ANN. Let's start by creating
a sequential model, which is just a linear stack of layers, each consisting of a single tensor.

In [20]:
num_input_cols = X.shape[1]

# create ANN model layout
model = keras.Sequential([
    keras.layers.Dense(  # hidden layer 1
        25,  # number of neurons in hidden layer
        input_shape = (num_input_cols,),  # shape of input layer
        activation = 'relu'  # easy to compute relu, so we use for hidden layer
    ),
    keras.layers.Dense(  # hidden layer 2
        20,
        activation = 'relu'
    ),
    keras.layers.Dense(  # output layer
        1,
        activation = 'sigmoid'  # want an output between 0 and 1
    )
])

# specify model hyperparameters
model.compile(
    optimizer = 'adam',  # common optimizer
    loss = 'binary_crossentropy',  # because output is binary
    metrics = ['accuracy']
)

# train model
model.fit(X_train_res, y_train_res, epochs=3, batch_size=64)  # test first using 3 epochs (iterations)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1fecb5f0940>

The accuracy seemed relatively high, even after just 5 epochs. Of course, we know accuracy is not always the best
indicator of model performance, however, given our training data is balanced, this is encouraging enough to fit the
model using a higher number of epochs.

This will take some time, even while using a GPU.

In [21]:
model.fit(X_train_res, y_train_res, epochs=20, batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x1feda114670>

Now that we have a model which is looking good, we can evaluate the model on our test data.

In [22]:
model.evaluate(X_test, y_test)



[0.05374006927013397, 0.9804431200027466]

Finally, let's use the model to get our predictions for the test data.

In [23]:
y_hat = model.predict(X_test)
print(y_hat[:10])

[[1.7648445e-05]
 [7.1615591e-10]
 [1.5705483e-06]
 [2.4021651e-07]
 [4.4155535e-05]
 [1.7251563e-06]
 [2.9857394e-11]
 [1.0038499e-09]
 [4.6098005e-08]
 [1.1782865e-03]]


As the final layer of our ANN model uses a sigmoid function, the output (predictions) will be some real number between
0 and 1. Of course, our actual outputs should be discrete, either 0 or 1. What the model is outputting can be thought of
as a probability of a sample having *Class* 1.

So, let's round each output in the predictions to either 0 or 1 to get the binary classifications.

In [24]:
y_hat_ann = []
for element in y_hat:
    if element >= 0.5:
        y_hat_ann.append(1)
    else:
        y_hat_ann.append(0)

print(y_hat_ann[:10])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Now we have our predictions in the correct format, let's take a look at how well our ANN model did at detecting
fraudulent transactions.

In [25]:
print(classification_report(y_test, y_hat_ann))
ann_report = classification_report(y_test, y_hat_ann, output_dict=True)
ann_precision = ann_report['1']['precision']
ann_recall = ann_report['1']['recall']
ann_f1 = ann_report['1']['f1-score']

              precision    recall  f1-score   support

           0       1.00      0.98      0.99     56864
           1       0.08      0.92      0.14        98

    accuracy                           0.98     56962
   macro avg       0.54      0.95      0.56     56962
weighted avg       1.00      0.98      0.99     56962



We managed to get the recall quite high, but similar to the logistic regression model, trained on the resampled data,
this ANN model has low precision, and thus a low f1 score.

### Weighted ANN

Let's try using the same model, but this time we'll fit the model to the imbalanced training data, and instead pass in
class weights.

In [26]:
weights = {0:1, 1:99}

model.fit(X_train, y_train, epochs=20, batch_size=32, class_weight=weights)

y_hat = model.predict(X_test)

y_hat_ann_weighted = []
for element in y_hat:
    if element >= 0.5:
        y_hat_ann_weighted.append(1)
    else:
        y_hat_ann_weighted.append(0)

print(classification_report(y_test, y_hat_ann_weighted))
ann_weighted_report = classification_report(y_test, y_hat_ann_weighted, output_dict=True)
ann_weighted_precision = ann_weighted_report['1']['precision']
ann_weighted_recall = ann_weighted_report['1']['recall']
ann_weighted_f1 = ann_weighted_report['1']['f1-score']

Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
              precision    recall  f1-score   support

           0       1.00      0.98      0.99     56864
           1       0.08      0.92      0.15        98

    accuracy                           0.98     56962
   macro avg       0.54      0.95      0.57     56962
weighted avg       1.00      0.98      0.99     56962



Using class weights for our ANN model improved both the precision, and the recall. However, the precision is still
incredibly low. We should note however, we've used really simple weights for the class weights, and a low number of
epochs (essentially loops of the ANN through feedforward and backpropagation).

The ANN model we built is also pretty simple, with only a few layers. If we wanted to move forward with the ANN
model, we may choose to optimise the parameters of the model (e.g. number of layers, activation functions, nodes in
layers), using a similar method to the grid search method we used earlier in this notebook.

However, for the purposes of this project, we'll leave the ANN development here, as the random forest method resulted in
an acceptable f1 score.

---

## Part 8 - Model Performance Comparison

Let's now view the performance metrics of the models we've tested in this notebook visually, to get an idea of the
tradeoffs of each model.

First, we'll add all the Class 1 metrics from each model to a dictionary.

In [27]:
metrics = {
    'Imbalanced LR': {
        'precision': lr_im_precision,
        'recall': lr_im_recall,
        'f1': lr_im_f1
    },
    'Resampled LR': {
        'precision': lr_res_precision,
        'recall': lr_res_recall,
        'f1': lr_res_f1
    },
    'Simple Weighted LR': {
        'precision': lr_simple_weighted_precision,
        'recall': lr_simple_weighted_recall,
        'f1': lr_simple_weighted_f1
    },
    'Weighted LR': {
        'precision': lr_weighted_precision,
        'recall': lr_weighted_recall,
        'f1': lr_weighted_f1
    },
    'Resampled RF': {
        'precision': rf_precision,
        'recall': rf_recall,
        'f1': rf_f1
    },
    'ANN': {
        'precision': ann_precision,
        'recall': ann_recall,
        'f1': ann_f1
    },
    'Weighted ANN': {
        'precision': ann_weighted_precision,
        'recall': ann_weighted_recall,
        'f1': ann_weighted_f1
    }
}

Next, let's convert this dictionary to a Pandas DataFrame object, which we'll use to build the figure.

In [28]:
print('Model Class 1 Metrics:')
df_metrics = pd.DataFrame.from_dict(metrics).T.reset_index(drop=False).rename(columns={'index': 'model'})
df_metrics = pd.melt(
    df_metrics,
    id_vars=['model'],
    value_vars=['precision', 'recall', 'f1'],
    var_name='metric',
    value_name='score'
)
display(df_metrics)

Model Class 1 Metrics:


Unnamed: 0,model,metric,score
0,Imbalanced LR,precision,0.885246
1,Resampled LR,precision,0.058371
2,Simple Weighted LR,precision,0.342412
3,Weighted LR,precision,0.822917
4,Resampled RF,precision,0.913978
5,ANN,precision,0.075251
6,Weighted ANN,precision,0.080501
7,Imbalanced LR,recall,0.55102
8,Resampled LR,recall,0.928571
9,Simple Weighted LR,recall,0.897959


Now, let's create a grouped barchart to view the metrics for each model.

In [29]:
fig = px.bar(
    df_metrics,
    x='model',
    y='score',
    color='metric',
    barmode='group',
    labels=dict(model='Classification Model', score='Metric Score', metric='Metric')
)

fig.update_layout(
    title='Classification Model Performance for Class 1 (Fraudulent Transactions)'
)

fig.show()

Now we have an easy to digest view of the various methods we tried out within this notebook. We can easily see that we
achieved the highest f1 score using the Random Forest classification model.

Again, it should be noted that the implementation of an ANN here was very simple. By expanding upon the ANN method, we
may be able to achieve a higher f1-score than the Random Forest model. However, the ANN also took significantly longer
to build and train than the Random Forest model, and the ANN model may be less interpretable than the Random Forest
(I may expand on the interpretability of the models discussed here at a later time). These are both considerations to
be mindful of when selecting a model for a task.

---

## Part 9 - Conclusion

Overall, this notebook should have provided a good introduction to classification metrics, and why accuracy can't always
be trusted... especially for imbalanced datasets! We've looked at a variety of models, and methods within those models,
and explored some of the ways we can evaluate and compare these models' performances.

The models used here are not an exhaustive list, however! Other models, such as Support Vector Machines or XGBoost
methods may result in even better performance (I may return to this project in the future and expand the models
comparisons to other such models).

Additionally, only one form of oversampling was tested in this notebook: the generic SMOTE method. Experimenting with
other oversampling methods, even other SMOTE methods, may prove beneficial.

---