# Credit Fraud Detection by Auto-Encoder ,Neural Networks & ML Frameworks
**Contents**

- <a href='#2'>2. Python Library Upload & Visualization Analysis</a>  
- <a href='#3'>3. Scaling and Distributing</a>  
    - <a href='#3.1'>3.1. Splitting of the Original DataFrame</a> 
    - <a href='#3.2'>3.2. Random UnderSampling</a> 
    - <a href='#3.3'>3.3. Equally Distributing and Correlating:</a> 
    - <a href='#3.4'>3.4. Correlation Matrices</a>
    - <a href='#3.5'>3.5. Anomaly Detection</a>
- <a href='#4'>4. Machine learning Modelling of Random Sampling</a>  
    - <a href='#4.1'>4.1. Model Training</a> 
    - <a href='#4.2'>4.2. GridSearchCV to find the best parameters</a> 
    - <a href='#4.3'>4.3. Plotting of Learning Curve</a> 
    - <a href='#4.4'>4.4. ROC Curve</a>
    - <a href='#4.5'>4.5. Investigation of Logistic Regression</a>    
    - <a href='#4.6'>4.6. Precision -Recall Analysis</a> 
- <a href='#5'>5.  Over-Sampling</a>  
    - <a href='#5.1'>5.1. SMOTE Technique</a> 
    - <a href='#5.2'>5.2. Precision-recall Curve</a> 
    - <a href='#5.3'>5.3. Evaluation of Test Data with Logistic Regression</a>        
- <a href='#6'>6.  Artificial Neural Networks</a>  
    - <a href='#6.1'>6.1. Keras ~ Random UnderSampling</a> 
    - <a href='#6.2'>6.2. Keras ~ OverSampling [SMOTE]</a> 
    - <a href='#6.3'>6.3. Conclusion</a> 
- <a href='#7'>7.  AutoEncoder Model Prediction Architecture</a>  
    - <a href='#7.1'>7.1. Dataset Preparation</a> 
    - <a href='#7.2'>7.2. Visualize Fraud and NonFraud Transactions using T-SNE</a> 
    - <a href='#7.3'>7.3. AutoEncoders</a> 
    - <a href='#7.4'>7.4. Obtain the Latent Representations</a> 
    - <a href='#7.5'>7.5. Visualize T-SNE the latent representations : Fraud Vs Non Fraud</a> 

**Introduction**
In this NoteBook we will use various predictive models to see how accurate they  are in detecting whether a transaction is a normal payment or a fraud. As described in the dataset, the features are scaled and the names of the features are not shown due to privacy reasons. Nevertheless, we can still analyze some important aspects of the dataset. 


<h2> Our Goals: </h2>
<ul>
<li> Understand the little distribution of the "little" data that was provided to us. </li>
<li> Create a 50/50 sub-dataframe ratio of "Fraud" and "Non-Fraud" transactions. (NearMiss Algorithm) </li>
<li> Determine the Classifiers we are going to use and decide which one has a higher accuracy. </li>
<li>Create a Neural Network and compare the accuracy to our best classifier. </li>
<li>Understand common mistaked made with imbalanced datasets. </li>
</ul>

   
  
<h2>Correcting Previous Mistakes from Imbalanced Datasets: </h2>
<ul>
<li> Never test on the oversampled or undersampled dataset.</li>
<li>If we want to implement cross validation, remember to oversample or undersample your training data <b>during</b> cross-validation, not before! </li>
<li> Don't use <b>accuracy score </b> as a metric with imbalanced datasets (will be usually high and misleading), instead use <b>f1-score, precision/recall score or confusion matrix </b></li>
</ul>


It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

**Content**
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

**DATASET LINK:::::::** https://www.kaggle.com/aniruddhachoudhury/creditcard-fraud-detection

# Python Library Upload & Visualization Analysis

In [4]:
# Imported Libraries

import numpy as np 
import pandas as pd 
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib.patches as mpatches
import time

# Classifier Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import collections


# Other Libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from collections import Counter
from sklearn.model_selection import KFold, StratifiedKFold
import warnings
warnings.filterwarnings("ignore");
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff




  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.


In [5]:
Data_Credit = pd.read_csv('creditcard.csv')
Data_Credit.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [6]:
Data_Credit.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.16598e-15,3.416908e-16,-1.37315e-15,2.086869e-15,9.604066e-16,1.490107e-15,-5.556467e-16,1.177556e-16,-2.406455e-15,...,1.656562e-16,-3.44485e-16,2.578648e-16,4.471968e-15,5.340915e-16,1.687098e-15,-3.666453e-16,-1.220404e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [7]:
# Null Values!
Data_Credit.isnull().sum().max()

0

In [8]:
Data_Credit.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [6]:
# The classes are heavily skewed we need to solve this issue lat
print('No Frauds', round(Data_Credit['Class'].value_counts()[0]/len(Data_Credit) * 100,2), '% of the dataset')
print('Frauds', round(Data_Credit['Class'].value_counts()[1]/len(Data_Credit) * 100,2), '% of the dataset')

No Frauds 99.83 % of the dataset
Frauds 0.17 % of the dataset


In [7]:
# 2 datasets
No_Frauds= Data_Credit[(Data_Credit['Class'] == 0)]
print(len(No_Frauds))
Frauds = Data_Credit[(Data_Credit['Class'] == 1)]
print(len(Frauds))

284315
492


In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=2, cols=1,shared_xaxes=True, vertical_spacing=0.02,
    subplot_titles=("Fraud","Non-Fraud"))


fig.add_trace(go.Scatter(x=Data_Credit.Time[Data_Credit.Class == 1], y=Data_Credit.Amount[Data_Credit.Class == 1],mode='markers'),
                 row=1, col=1)


fig.add_trace(go.Scatter(x=Data_Credit.Time[Data_Credit.Class == 0],y= Data_Credit.Amount[Data_Credit.Class == 0],mode='markers'),
                 row=2, col=1)
fig.update_xaxes(title_text="Time" )

fig.update_yaxes(title_text="Amount")

fig.update_layout(showlegend=False, title_text="Specs with Subplot Title")
fig.show()

In [8]:
#------------COUNT-----------------------
trace = go.Bar(x = (len(No_Frauds), len(Frauds)), y = ['No_Frauds', 'Frauds'], orientation = 'h', opacity = 0.8, marker=dict(
        color=[ 'green', 'darkviolet'],
        line=dict(color='#000000',width=1.5)))

layout = dict(title =  'Count of diagnosis variable')
                    
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

#------------PERCENTAGE-------------------
trace = go.Pie(labels = ['No_Frauds', 'Frauds'], values = Data_Credit['Class'].value_counts(), 
               textfont=dict(size=15), opacity = 0.8,
               marker=dict(colors=['darkviolet', 'gold'], 
                           line=dict(color='#000000', width=1.5)))


layout = dict(title =  'Distribution of diagnosis variable')
           
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

**Note:**  Notice how imbalanced is our original dataset! Most of the transactions are non-fraud. If we use this dataframe as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will "assume" that most transactions are not fraud. But we don't want our model to assume, we want our model to detect patterns that give signs of fraud!

**Distributions:** By seeing the distributions we can have an idea how skewed are these features, we can also see further distributions of the other features. There are techniques that can help the distributions be less skewed which will be implemented in this notebook in the future.

In [9]:
#Select only the anonymized features.
v_features = Data_Credit.ix[:,1:29].columns

In [None]:
import plotly.figure_factory as ff
import numpy as np


for i, cn in enumerate(Data_Credit[v_features[:1]]):
        group_labels = ['Fraud', 'Non-Fraud']

        colors = ['slategray', 'magenta']
        x1 = Data_Credit[cn][Data_Credit.Class == 1]
        x2 = Data_Credit[cn][Data_Credit.Class == 0]
# Create distplot with curve_type set to 'normal'
        fig = ff.create_distplot([x1, x2], group_labels, bin_size=.5,
                         curve_type='normal', # override default 'kde'
                         colors=colors)

# Add title
        fig.update_layout(title_text='Distplot with Normal Distributio n'+str(cn))
        fig.show()

# Scaling and Distributing 



Time and amount should be scaled as the other columns. On the other hand, we need to also create a sub sample of the dataframe in order to have an equal amount of Fraud and Non-Fraud cases, helping our algorithms better understand patterns that determines whether a transaction is a fraud or not.

**What is a sub-Sample?**

In this scenario, our subsample will be a dataframe with a 50/50 ratio of fraud and non-fraud transactions. Meaning our sub-sample will have the same amount of fraud and non fraud transactions.

**Why do we create a sub-Sample?**

In the beginning of this notebook we saw that the original dataframe was heavily imbalanced! Using the original dataframe  will cause the following issues:


- **Overfitting:** Our classification models will assume that in most cases there are no frauds! What we want for our model is to be certain when a fraud occurs.
- **Wrong Correlations:**  Although we don't know what the "V" features stand for, it will be useful to understand how each of this features influence the result (Fraud or No Fraud) by having an imbalance dataframe we are not able to see the true correlations between the class and features. 

**Summary:** 
- Scaled amount  and scaled time  are the columns with scaled values.
- There are 492 cases of fraud in our dataset so we can randomly get 492 cases of non-fraud to create our new sub dataframe. 
- We concat the 492 cases of fraud and non fraud, creating a new sub-sample.


In [9]:
Data_Operation=Data_Credit.copy()
Data_Operation.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [10]:
# Since most of our data has already been scaled we should scale the columns that are left to scale (Amount and Time)
from sklearn.preprocessing import StandardScaler, RobustScaler

# RobustScaler is less prone to outliers.
std_scaler = StandardScaler()
rob_scaler = RobustScaler()

Data_Operation['scaled_amount'] = rob_scaler.fit_transform(Data_Operation['Amount'].values.reshape(-1,1))
Data_Operation['scaled_time'] = rob_scaler.fit_transform(Data_Operation['Time'].values.reshape(-1,1))

Data_Operation.drop(['Time','Amount'], axis=1, inplace=True)

In [11]:
scaled_amount = Data_Operation['scaled_amount']
scaled_time = Data_Operation['scaled_time']

Data_Operation.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)
Data_Operation.insert(0, 'scaled_amount', scaled_amount)
Data_Operation.insert(1, 'scaled_time', scaled_time)

# Amount and Time are Scaled

Data_Operation.head()

Unnamed: 0,scaled_amount,scaled_time,V1,V2,V3,V4,V5,V6,V7,V8,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
0,1.783274,-0.994983,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0
1,-0.269825,-0.994983,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0
2,4.983721,-0.994972,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0
3,1.418291,-0.994972,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0
4,0.670579,-0.99496,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0


## Splitting of the Original DataFrame
<a id="splitting"></a>
Before proceeding with the <b> Random UnderSampling technique</b> we have to separate the orginal dataframe. 

**Why?**

- for testing purposes, remember although we are splitting the data when implementing Random UnderSampling or OverSampling techniques, we want to test our models on the original testing set not on the testing set created by either of these techniques.
- The main goal is to fit the model either with the dataframes that were undersample and oversample (in order for our models to detect the patterns), and test it on the original testing set.  

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
# We already have X_train and y_train for undersample data thats why I am using original to distinguish and to not overwrite these variables.
# original_Xtrain, original_Xtest, original_ytrain, original_ytest = train_test_split(X, y, test_size=0.2, random_state=42)

print('No Frauds', round(Data_Operation['Class'].value_counts()[0]/len(Data_Operation) * 100,2), '% of the dataset')
print('Frauds', round(Data_Operation['Class'].value_counts()[1]/len(Data_Operation) * 100,2), '% of the dataset')

X = Data_Operation.drop('Class', axis=1)
y = Data_Operation['Class']

No Frauds 99.83 % of the dataset
Frauds 0.17 % of the dataset


In [13]:
Strats = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

for train_index, test_index in Strats.split(X, y):
    print("Train:", train_index, "Test:", test_index)
    original_Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index]
    original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index]

Train: [ 30473  30496  31002 ... 284804 284805 284806] Test: [    0     1     2 ... 57017 57018 57019]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 30473  30496  31002 ... 113964 113965 113966]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 81609  82400  83053 ... 170946 170947 170948]
Train: [     0      1      2 ... 284804 284805 284806] Test: [150654 150660 150661 ... 227866 227867 227868]
Train: [     0      1      2 ... 227866 227867 227868] Test: [212516 212644 213092 ... 284804 284805 284806]


In [14]:
# Turn into an array
original_Xtrain = original_Xtrain.values
original_Xtest = original_Xtest.values
original_ytrain = original_ytrain.values
original_ytest = original_ytest.values

In [15]:
# See if both the train and test label distribution are similarly distributed
train_unique_label, train_counts_label = np.unique(original_ytrain, return_counts=True)
test_unique_label, test_counts_label = np.unique(original_ytest, return_counts=True)
print('-' * 100)

print('Label Distributions: \n')
print(train_counts_label/ len(original_ytrain))
print(test_counts_label/ len(original_ytest))

----------------------------------------------------------------------------------------------------
Label Distributions: 

[0.99827076 0.00172924]
[0.99827952 0.00172048]


## Random Under-Sampling:
![](https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/resampling.png)

**Steps:**
<ul>
<li>The first thing we have to do is determine how <b>imbalanced</b> is our class (use "value_counts()" on the class column to determine the amount for each label)  </li>
<li>Once we determine how many instances are considered <b>fraud transactions </b> (Fraud = "1") , we should bring the <b>non-fraud transactions</b> to the same amount as fraud transactions (assuming we want a 50/50 ratio), this will be equivalent to 492 cases of fraud and 492 cases of non-fraud transactions.  </li>
<li> After implementing this technique, we have a sub-sample of our dataframe with a 50/50 ratio with regards to our classes. Then the next step we will implement is to <b>shuffle the data</b> to see if our models can maintain a certain accuracy everytime we run this script.</li>
</ul>

**Note:** The main issue with "Random Under-Sampling" is that we run the risk that our classification models will not perform as accurate as we would like to since there is a great deal of <b>information loss</b> (bringing 492 non-fraud transaction  from 284,315 non-fraud transaction)

In [16]:
#our classes are highly skewed we should make them equivalent in order to have a normal distribution of the classes.

# Lets shuffle the data before creating the subsamples
Data_Operation = Data_Operation.sample(frac=1)

# amount of fraud classes 492 rows.
fraud_Data_Operation = Data_Operation.loc[Data_Operation['Class'] == 1]
non_fraud_Data_Operation = Data_Operation.loc[Data_Operation['Class'] == 0][:492]


In [17]:
normal_dist_op = pd.concat([fraud_Data_Operation, non_fraud_Data_Operation])

# Shuffle dataframe rows
new_operation = normal_dist_op.sample(frac=1, random_state=42)

new_operation.head()

Unnamed: 0,scaled_amount,scaled_time,V1,V2,V3,V4,V5,V6,V7,V8,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
56543,4.373646,-0.4372,0.630768,-1.658161,0.452849,-0.977678,-1.769746,-0.82686,-0.30122,-0.081957,...,0.522553,0.301589,0.332136,-0.383315,0.460005,0.327107,0.082217,-0.024285,0.078791,0
43624,0.243834,-0.506467,-1.048005,1.300219,-0.180401,2.589843,-1.164794,0.031823,-2.175778,0.699072,...,0.644993,0.549014,0.624321,-0.136663,0.131738,0.030921,-0.176701,0.504898,0.069882,1
62247,-0.194508,-0.405256,-1.176287,2.032277,1.335549,2.706513,-0.411466,-0.272458,0.180319,0.493822,...,0.385641,-0.220284,-0.566249,0.085387,0.323284,-0.297018,-0.027016,0.466141,0.264072,0
156990,2.29344,0.289078,-1.000611,3.34685,-5.534491,6.835802,-0.299803,0.095951,-2.440419,1.286301,...,1.189814,0.439757,-0.694099,0.29966,-0.657601,0.101648,0.430457,0.824685,0.326952,1
6446,-0.29344,-0.904851,0.70271,2.426433,-5.234513,4.416661,-2.170806,-2.667554,-3.878088,0.911337,...,0.422743,0.55118,-0.009802,0.721698,0.473246,-1.959304,0.319476,0.600485,0.129305,1


##  Equally Distributing and Correlating: 
<a id="correlating"></a>
Now that we have our dataframe correctly balanced, we can go further with our <b>analysis</b> and <b>data preprocessing</b>.

In [18]:
print('Distribution of the Classes in the subsample dataset')
print(new_operation['Class'].value_counts()/len(new_operation))
#------------COUNT-----------------------
No_Frauds_data= new_operation[(new_operation['Class'] == 0)]
Frauds_data = new_operation[(new_operation['Class'] == 1)]

trace = go.Bar(x = (len(No_Frauds_data), len(Frauds_data)), y = ['No_Frauds_data', 'Frauds_data'], orientation = 'h', opacity = 0.8, marker=dict(
        color=[ 'green', 'darkviolet'],
        line=dict(color='#000000',width=1.5)))

layout = dict(title =  'Count of diagnosis variable')
                    
fig = dict(data = [trace], layout=layout)
py.iplot(fig)


Distribution of the Classes in the subsample dataset
1    0.5
0    0.5
Name: Class, dtype: float64


## Correlation Matrices 
Correlation matrices are the essence of understanding our data. We want to know if there are features that influence heavily in whether a specific transaction is a fraud. However, it is important that we use the correct dataframe (subsample)  in order for us to see which features have a high positive or negative correlation with regards to fraud transactions.

**Summary and Explanation:**

- Negative Correlations:  V14, V12, V17 and V10 are negatively correlated, the lower these values are, the more likely the end result will be a fraud transaction.

- Positive Correlations: V2, V4, V11, and V19 are positively correlated, the higher these values are, the more likely the end result will be a fraud transaction.

- BoxPlots:We will use box-plots to have a better understanding of the distribution of these features in fraud case and non fraud transactions.


**Note: ** We have to make sure we use the subsample in our correlation matrix or else our correlation matrix will be affected by the high imbalance between our classes. This occurs due to the high class imbalance in the original dataframe.

In [25]:
def correlation_plotting(data,s):    
    correlation = data.corr()
    matrix_cols = correlation.columns.tolist()
    corr_array  = np.array(correlation)
    #Plotting

    trace = go.Heatmap(z = corr_array,
                       x = matrix_cols,
                       y = matrix_cols,
                       xgap = 2,
                       ygap = 2,
                       colorscale='Viridis',
                       colorbar   = dict() ,
                      )
    layout = go.Layout(dict(title = 'Correlation Matrix for variables  ' +s,
                            autosize = False,
                            height  = 720,
                            width   = 800,
                            margin  = dict(r = 0 ,l = 210,
                                           t = 25,b = 210,
                                         ),
                            yaxis   = dict(tickfont = dict(size = 9)),
                            xaxis   = dict(tickfont = dict(size = 9)),
                           )
                      )
    fig = go.Figure(data = [trace],layout = layout)
    fig.update_layout( 
                    title={
                        'y':1,
                        'x':0.6,
                        'xanchor': 'center',
                        'yanchor': 'top'})
    
    py.iplot(fig)

In [26]:
correlation_plotting(Data_Credit,"Imbalanced Correlation Matrix")
correlation_plotting(new_operation,"SubSample Correlation Matrix")

In [28]:
import plotly.express as px

def box_plot(s,a):
    fig = px.box(new_operation, x="Class", y=s, points="all",hover_data=['Class'],notched=True,)
    fig.update_layout(title_text=a+s)
    fig.show()

In [30]:
lis=['V17','V10','V14','V10']
for i in lis:
    box_plot(i,'Box Plot Styling for Class Negative Correlation & ' )

In [24]:
lis=['V4','V11','V2','V19']
for i in lis:
    box_plot(i,'Box Plot Styling for Class Positive Correlation & ' )

## Anomaly Detection:
<a id="anomaly"></a>
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSlDPHWnWEMWvX_q_8K8UHv-F507cSi-Y2DMmmE1i2tlZ2rgZtM">


Our main goal in this section is to remove "extreme outliers" from features that have a high correlation with our classes. This will have a positive impact on the accuracy of our models.  <br><br>


**Interquartile Range Method:**

- Interquartile Range (IQR): We calculate this by the difference between the 75th percentile and 25th percentile. Our aim is to create a threshold beyond the 75th and 25th percentile that in case some instance pass this threshold the instance will be deleted. 
- Boxplots: Besides easily seeing the 25th and 75th percentiles (both end of the squares) it is also easy to see extreme outliers (points beyond the lower and higher extreme).

**Outlier Removal Tradeoff:**
We have to be careful as to how far do we want the threshold for removing outliers. We determine the threshold by multiplying a number (ex: 1.5) by the (Interquartile Range). The higher this threshold is, the less outliers will detect (multiplying by a higher number ex: 3), and the lower this threshold is the more outliers it will detect.  <br><br>

**The Tradeoff:**
The lower the threshold the more outliers it will remove however, we want to focus more on "extreme outliers" rather than just outliers. Why? because we might run the risk of information loss which will cause our models to have a lower accuracy. You can play with this threshold and see how it affects the accuracy of our classification models.


**Summary:**
<ul>
<li> <b> Visualize Distributions: </b> We first start by visualizing the distribution of the feature we are going to use to eliminate some of the outliers. V14 is the only feature that has a Gaussian distribution compared to features V12 and V10. </li>
<li><b>Determining the threshold: </b> After we decide which number we will use to multiply with the iqr (the lower more outliers removed), we will proceed in determining the upper and lower thresholds by substrating q25 - threshold (lower extreme threshold) and adding q75 + threshold (upper extreme threshold). </li>
<li> <b>Conditional Dropping: </b> Lastly, we create a conditional dropping stating that if the "threshold" is exceeded in both extremes, the instances will be removed. </li>
<li> <b> Boxplot Representation: </b> Visualize through the boxplot that the number of "extreme outliers" have been reduced to a considerable amount. </li>
</ul>

**Note:** After implementing outlier reduction our accuracy has been improved by over 3%! Some outliers can distort the accuracy of our models but remember, we have to avoid an extreme amount of information loss or else our model runs the risk of underfitting.


**Reference**: More information on Interquartile Range Method: <a src="https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/"> How to Use Statistics to Identify Outliers in Data </a> by Jason Brownless (Machine Learning Mastery blog)

In [31]:
import plotly.figure_factory as ff
import numpy as np

v14_fraud_dist = new_operation['V14'].loc[new_operation['Class'] == 1].values
v12_fraud_dist = new_operation['V12'].loc[new_operation['Class'] == 1].values
v10_fraud_dist = new_operation['V10'].loc[new_operation['Class'] == 1].values

# Group data together
hist_data = [v14_fraud_dist, v12_fraud_dist,v10_fraud_dist]

group_labels = ['V14_Distribution','V12_Distribution','V10_Distribution']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2)
fig.show()

### **Removing Outliers (Highest Negative Correlated with Labels)**

**V14**

In [32]:
#V14 removing outliers from fraud transactions
v14_fraud = new_operation['V14'].loc[new_operation['Class'] == 1].values
q25, q75 = np.percentile(v14_fraud, 25), np.percentile(v14_fraud, 75)

print('Quartile 25: {} | Quartile 75: {}'.format(q25, q75))
v14_iqr = q75 - q25
print('iqr: {}'.format(v14_iqr))


Quartile 25: -9.692722964972385 | Quartile 75: -4.282820849486866
iqr: 5.409902115485519


In [33]:
v14_cut_off_value = v14_iqr * 1.5
v14_lower, v14_upper = q25 - v14_cut_off_value, q75 + v14_cut_off_value
print('Cut Off: {}'.format(v14_cut_off_value))
print('V14 Lower: {}'.format(v14_lower))
print('V14 Upper: {}'.format(v14_upper))

Cut Off: 8.114853173228278
V14 Lower: -17.807576138200663
V14 Upper: 3.8320323237414122


In [34]:
outliers = [x for x in v14_fraud if x < v14_lower or x > v14_upper]
print('Feature V14 Outliers for Fraud Cases: {}'.format(len(outliers)))
print('V14 outliers:{}'.format(outliers))

Feature V14 Outliers for Fraud Cases: 4
V14 outliers:[-19.2143254902614, -18.049997689859396, -18.8220867423816, -18.4937733551053]


In [35]:
new_operation = new_operation.drop(new_operation[(new_operation['V14'] > v14_upper) | (new_operation['V14'] < v14_lower)].index)

**V12**

In [36]:
# V12 removing outliers from fraud transactions
v12_fraud = new_operation['V12'].loc[new_operation['Class'] == 1].values
q25, q75 = np.percentile(v12_fraud, 25), np.percentile(v12_fraud, 75)
v12_iqr = q75 - q25

v12_cut_off_value = v12_iqr * 1.5
v12_lower, v12_upper = q25 - v12_cut_off_value, q75 + v12_cut_off_value
print('V12 Lower: {}'.format(v12_lower))
print('V12 Upper: {}'.format(v12_upper))
outliers = [x for x in v12_fraud if x < v12_lower or x > v12_upper]
print('V12 outliers: {}'.format(outliers))
print('Feature V12 Outliers for Fraud Cases: {}'.format(len(outliers)))
new_operation = new_operation.drop(new_operation[(new_operation['V12'] > v12_upper) | (new_operation['V12'] < v12_lower)].index)
print('Number of Instances after outliers removal: {}'.format(len(new_operation)))


V12 Lower: -17.3430371579634
V12 Upper: 5.776973384895937
V12 outliers: [-18.047596570821604, -18.4311310279993, -18.683714633344298, -18.553697009645802]
Feature V12 Outliers for Fraud Cases: 4
Number of Instances after outliers removal: 975


In [37]:
# Removing outliers V10 Feature
v10_fraud = new_operation['V10'].loc[new_operation['Class'] == 1].values
q25, q75 = np.percentile(v10_fraud, 25), np.percentile(v10_fraud, 75)
v10_iqr = q75 - q25

v10_cut_off_value = v10_iqr * 1.5
v10_lower, v10_upper = q25 - v10_cut_off_value , q75 + v10_cut_off_value
print('V10 Lower: {}'.format(v10_lower))
print('V10 Upper: {}'.format(v10_upper))
outliers = [x for x in v10_fraud if x < v10_lower or x > v10_upper]
print('V10 outliers: {}'.format(outliers))
print('Feature V10 Outliers for Fraud Cases: {}'.format(len(outliers)))
new_operation = new_operation.drop(new_operation[(new_operation['V10'] > v10_upper) | (new_operation['V10'] < v10_lower)].index)
print('Number of Instances after outliers removal: {}'.format(len(new_operation)))

V10 Lower: -14.89885463232024
V10 Upper: 4.920334958342141
V10 outliers: [-22.1870885620007, -22.1870885620007, -15.2399619587112, -15.124162814494698, -16.6011969664137, -15.563791338730098, -18.2711681738888, -17.141513641289198, -20.949191554361104, -16.2556117491401, -16.7460441053944, -14.9246547735487, -22.1870885620007, -14.9246547735487, -15.346098846877501, -19.836148851696, -22.1870885620007, -23.2282548357516, -15.2318333653018, -15.1237521803455, -24.5882624372475, -24.403184969972802, -15.563791338730098, -18.9132433348732, -16.6496281595399, -16.3035376590131, -15.2399619587112]
Feature V10 Outliers for Fraud Cases: 27
Number of Instances after outliers removal: 946


In [39]:
lis=['V14']#,'V12','V10']
for i in lis:
    box_plot(i,'Box Plot with reduced outliers Styling for Class  & ' )

# Machine learning Modelling of Random Sampling
Before we have to split our data into training and testing sets and separate the features from the labels.


**Learning Curves:**
<ul>
<li>The <b>wider the  gap</b>  between the training score and the cross validation score, the more likely your model is <b>overfitting (high variance)</b>.</li>
<li> If the score is low in both training and cross-validation sets</b> this is an indication that our model is <b>underfitting (high bias)</b></li>
<li><b> Logistic Regression Classifier</b>  shows the best score in both training and cross-validating sets.</li>
</ul>

## Model Training

In [40]:
new_operation.head()

Unnamed: 0,scaled_amount,scaled_time,V1,V2,V3,V4,V5,V6,V7,V8,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
56543,4.373646,-0.4372,0.630768,-1.658161,0.452849,-0.977678,-1.769746,-0.82686,-0.30122,-0.081957,...,0.522553,0.301589,0.332136,-0.383315,0.460005,0.327107,0.082217,-0.024285,0.078791,0
43624,0.243834,-0.506467,-1.048005,1.300219,-0.180401,2.589843,-1.164794,0.031823,-2.175778,0.699072,...,0.644993,0.549014,0.624321,-0.136663,0.131738,0.030921,-0.176701,0.504898,0.069882,1
62247,-0.194508,-0.405256,-1.176287,2.032277,1.335549,2.706513,-0.411466,-0.272458,0.180319,0.493822,...,0.385641,-0.220284,-0.566249,0.085387,0.323284,-0.297018,-0.027016,0.466141,0.264072,0
156990,2.29344,0.289078,-1.000611,3.34685,-5.534491,6.835802,-0.299803,0.095951,-2.440419,1.286301,...,1.189814,0.439757,-0.694099,0.29966,-0.657601,0.101648,0.430457,0.824685,0.326952,1
6446,-0.29344,-0.904851,0.70271,2.426433,-5.234513,4.416661,-2.170806,-2.667554,-3.878088,0.911337,...,0.422743,0.55118,-0.009802,0.721698,0.473246,-1.959304,0.319476,0.600485,0.129305,1


In [41]:
# Undersampling before cross validating (prone to overfit)
X = new_operation.drop('Class', axis=1) #new_operation
y = new_operation['Class']

In [42]:
# Our data is already scaled we should split our training and test sets
from sklearn.model_selection import train_test_split

# This is explicitly used for undersampling.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [43]:
# Turn the values into an array for feeding the classification algorithms.
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

In [44]:
# Let's implement simple classifiers

classifiers = {
    "LogisiticRegression": LogisticRegression(),
    "KNearest": KNeighborsClassifier(),
    "Support Vector Classifier": SVC(),
    "DecisionTreeClassifier": DecisionTreeClassifier()
}

In [45]:
# Wow our scores are getting even high scores even when applying cross validation.
from sklearn.model_selection import cross_val_score


for key, classifier in classifiers.items():
    classifier.fit(X_train, y_train)
    training_score = cross_val_score(classifier, X_train, y_train, cv=5)
    print("Classifiers: ", classifier.__class__.__name__, "Has a training score of", round(training_score.mean(), 2) * 100, "% accuracy score")

Classifiers:  LogisticRegression Has a training score of 93.0 % accuracy score
Classifiers:  KNeighborsClassifier Has a training score of 92.0 % accuracy score
Classifiers:  SVC Has a training score of 92.0 % accuracy score
Classifiers:  DecisionTreeClassifier Has a training score of 90.0 % accuracy score


## GridSearchCV to find the best parameters

In [46]:
# Use GridSearchCV to find the best parameters.
from sklearn.model_selection import GridSearchCV


# Logistic Regression 
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}



grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(X_train, y_train)
# We automatically get the logistic regression with the best parameters.
log_reg = grid_log_reg.best_estimator_

knears_params = {"n_neighbors": list(range(2,5,1)), 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}

grid_knears = GridSearchCV(KNeighborsClassifier(), knears_params)
grid_knears.fit(X_train, y_train)
# KNears best estimator
knears_neighbors = grid_knears.best_estimator_

# Support Vector Classifier
svc_params = {'C': [0.5, 0.7, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(SVC(), svc_params)
grid_svc.fit(X_train, y_train)

# SVC best estimator
svc = grid_svc.best_estimator_

# DecisionTree Classifier
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,4,1)), 
              "min_samples_leaf": list(range(5,7,1))}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
grid_tree.fit(X_train, y_train)

# tree best estimator
tree_clf = grid_tree.best_estimator_

In [47]:
# Overfitting Case

log_reg_score = cross_val_score(log_reg, X_train, y_train, cv=5)
print('Logistic Regression Cross Validation Score: ', round(log_reg_score.mean() * 100, 2).astype(str) + '%')


knears_score = cross_val_score(knears_neighbors, X_train, y_train, cv=5)
print('Knears Neighbors Cross Validation Score', round(knears_score.mean() * 100, 2).astype(str) + '%')

svc_score = cross_val_score(svc, X_train, y_train, cv=5)
print('Support Vector Classifier Cross Validation Score', round(svc_score.mean() * 100, 2).astype(str) + '%')

tree_score = cross_val_score(tree_clf, X_train, y_train, cv=5)
print('DecisionTree Classifier Cross Validation Score', round(tree_score.mean() * 100, 2).astype(str) + '%')

Logistic Regression Cross Validation Score:  92.86%
Knears Neighbors Cross Validation Score 92.33%
Support Vector Classifier Cross Validation Score 92.72%
DecisionTree Classifier Cross Validation Score 91.53%


In [48]:
# We will undersample during cross validating
undersample_X = Data_Credit.drop('Class', axis=1) # Data_Credit
undersample_y = Data_Credit['Class']

In [49]:
for train_index, test_index in Strats.split(undersample_X, undersample_y):
    print("Train:", train_index, "Test:", test_index)
    undersample_Xtrain, undersample_Xtest = undersample_X.iloc[train_index], undersample_X.iloc[test_index]
    undersample_ytrain, undersample_ytest = undersample_y.iloc[train_index], undersample_y.iloc[test_index]

Train: [ 30473  30496  31002 ... 284804 284805 284806] Test: [    0     1     2 ... 57017 57018 57019]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 30473  30496  31002 ... 113964 113965 113966]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 81609  82400  83053 ... 170946 170947 170948]
Train: [     0      1      2 ... 284804 284805 284806] Test: [150654 150660 150661 ... 227866 227867 227868]
Train: [     0      1      2 ... 227866 227867 227868] Test: [212516 212644 213092 ... 284804 284805 284806]


In [50]:
undersample_Xtrain = undersample_Xtrain.values
undersample_Xtest = undersample_Xtest.values
undersample_ytrain = undersample_ytrain.values
undersample_ytest = undersample_ytest.values 

In [51]:
undersample_accuracy = []
undersample_precision = []
undersample_recall = []
undersample_f1 = []
undersample_auc = []

In [52]:
# Implementing NearMiss Technique 
# Distribution of NearMiss (Just to see how it distributes the labels we won't use these variables)
X_nearmiss, y_nearmiss = NearMiss().fit_sample(undersample_X.values, undersample_y.values)
print('NearMiss Label Distribution: {}'.format(Counter(y_nearmiss)))

NearMiss Label Distribution: Counter({0: 492, 1: 492})


In [53]:
# Cross Validating the right way
for train, test in Strats.split(undersample_Xtrain, undersample_ytrain):
    undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy='majority'), log_reg) # SMOTE happens during Cross Validation not before..
    undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])
    undersample_prediction = undersample_model.predict(undersample_Xtrain[test])
    
    undersample_accuracy.append(undersample_pipeline.score(original_Xtrain[test], original_ytrain[test]))
    undersample_precision.append(precision_score(original_ytrain[test], undersample_prediction))
    undersample_recall.append(recall_score(original_ytrain[test], undersample_prediction))
    undersample_f1.append(f1_score(original_ytrain[test], undersample_prediction))
    undersample_auc.append(roc_auc_score(original_ytrain[test], undersample_prediction))

## Plotting of Learning Curve

In [54]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve
def plots_trace(Model,X_train, y_train,cv,n_jobs,string):
    fig = go.Figure()
    train_sizes, train_scores, test_scores = learning_curve(Model, X_train, y_train, cv=cv, n_jobs=n_jobs, train_sizes=np.linspace(.1, 1.0, 5))
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fig.add_trace(go.Scatter(x=train_sizes, y=train_scores_mean,
                             fill=None,
                             mode='lines',
                             line_color='indigo',
                             name="Training Score"))

    fig.add_trace(go.Scatter(x=train_sizes,y= test_scores_mean,
                             fill=None,
                             mode='lines',
                             line_color='yellow',
                             name='Cross-validation score'))


    fig.update_layout(title=string,
                      xaxis_title="Score",
                      yaxis_title="Training Size")
    return fig.show()

In [58]:
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
Models={log_reg:'Logistic Regression Learning Curve', knears_neighbors:'Knears Neighbors Learning Curve', svc:'Support Vector Classifier Learning Curve',
        tree_clf:'Decision Tree Classifier Learning Curve'}
for k,v in Models.items():
    plots_trace(k,X_train, y_train,cv,1,v)

## ROC Curve

In [59]:
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict

log_reg_pred = cross_val_predict(log_reg, X_train, y_train, cv=5,
                             method="decision_function")

knears_pred = cross_val_predict(knears_neighbors, X_train, y_train, cv=5)

svc_pred = cross_val_predict(svc, X_train, y_train, cv=5,
                             method="decision_function")

tree_pred = cross_val_predict(tree_clf, X_train, y_train, cv=5)

In [50]:
from sklearn.metrics import roc_auc_score

print('Logistic Regression: ', roc_auc_score(y_train, log_reg_pred))
print('KNears Neighbors: ', roc_auc_score(y_train, knears_pred))
print('Support Vector Classifier: ', roc_auc_score(y_train, svc_pred))
print('Decision Tree Classifier: ', roc_auc_score(y_train, tree_pred))

Logistic Regression:  0.9789602574867059
KNears Neighbors:  0.9254233137419535
Support Vector Classifier:  0.9761964735516373
Decision Tree Classifier:  0.916218863699972


In [51]:
log_fpr, log_tpr, log_thresold = roc_curve(y_train, log_reg_pred)
knear_fpr, knear_tpr, knear_threshold = roc_curve(y_train, knears_pred)
svc_fpr, svc_tpr, svc_threshold = roc_curve(y_train, svc_pred)
tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train, tree_pred)

In [52]:
def roc_curve_plots():
    fig = go.Figure()

    fig.add_trace(go.Scatter(x=log_fpr, y=log_tpr,
                             fill=None,
                             mode='lines',
                             line_color='red',
                             name='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train, log_reg_pred))
                             ))

    fig.add_trace(go.Scatter(x=knear_fpr,y= knear_tpr,
                             fill=None,
                             mode='lines',
                             line_color='yellow',
                             name='KNears Neighbors Classifier Score: {:.4f}'.format(roc_auc_score(y_train, knears_pred))
                            ))
    fig.add_trace(go.Scatter(x=svc_fpr, y=svc_tpr,
                             fill=None,
                             mode='lines',
                             line_color='green',
                             name='Support Vector Classifier Score: {:.4f}'.format(roc_auc_score(y_train, svc_pred))
                            ))
    fig.add_trace(go.Scatter(x=tree_fpr,y= tree_tpr,
                             fill=None,
                             mode='lines',
                             line_color='darkblue',
                             name='Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train, tree_pred))
                            ))
    fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1],
                             fill=None,
                             mode='lines', line={'dash': 'dash', 'color': 'black'},
                             name='Minimum ROC Score'
                            ))

    
    
    fig.update_layout(
                annotations=[
                            go.layout.Annotation(
                                x=.5, y=.5,
                                xref="x",
                                yref="y",
                                text="'Minimum ROC Score of 50% \n (This is the minimum score to get)",
                                showarrow=True,
                                arrowhead=5,
                                ax=40,
                                ay=40)
                            ])
   
    
    fig.update_layout(title='ROC Curve for Top 4 Classifiers',
                      xaxis_title='False Positive Rate',
                      yaxis_title='True Positive Rate',
                     height=600, width=1000,
                     legend=dict(x=-.1, y=1.5))
    
    fig.update_layout(
                    title={
                        'y':0.75,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'})
    
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.show()

In [53]:
roc_curve_plots()

## Investigation of Logistic Regression:

**Terms**:
<ul>
<li><b>True Positives:</b> Correctly Classified Fraud Transactions </li>
<li><b>False Positives:</b> Incorrectly Classified Fraud Transactions</li>
<li> <b>True Negative:</b> Correctly Classified Non-Fraud Transactions</li>
<li> <b>False Negative:</b> Incorrectly Classified Non-Fraud Transactions</li>
<li><b>Precision: </b>  True Positives/(True Positives + False Positives)  </li>
<li><b> Recall: </b> True Positives/(True Positives + False Negatives)   </li>
<li> Precision as the name says, says how precise (how sure) is our model in detecting fraud transactions while recall is the amount of fraud cases our model is able to detect.</li>
<li><b>Precision/Recall Tradeoff: </b> The more precise (selective) our model is, the less cases it will detect. Example: Assuming that our model has a precision of 95%, Let's say there are only 5 fraud cases in which the model is 95% precise or more that these are fraud cases. Then let's say there are 5 more cases that our model considers 90% to be a fraud case, if we lower the precision there are more cases that our model will be able to detect. </li>
</ul>

In [54]:
def logistic_roc_curve(log_fpr, log_tpr):
    fig = go.Figure()

    fig.add_trace(go.Scatter(x=log_fpr, y=log_tpr,
                             fill=None,
                             mode='lines',
                             line_color='red',
                             name='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train, log_reg_pred))
                             ))
    fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1],
                             fill=None,
                             mode='lines', line={'dash': 'dash', 'color': 'black'},
                             name='Minimum ROC Score'
                            ))
    fig.update_layout( xaxis_title='False Positive Rate',
                      yaxis_title='True Positive Rate',
                    title={'text':'Logistic Regression ROC Curve',
                        'y':0.75,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
             height=600, width=1000,
                     legend=dict(x=-.1, y=1.5))
    fig.show()
    
    
logistic_roc_curve(log_fpr, log_tpr)

In [71]:
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_train, log_reg_pred)

In [72]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
y_pred = log_reg.predict(X_train)

# Overfitting Case
print('Overfitting: \n')
print('Recall Score: {:.2f}'.format(recall_score(y_train, y_pred)))
print('Precision Score: {:.2f}'.format(precision_score(y_train, y_pred)))
print('F1 Score: {:.2f}'.format(f1_score(y_train, y_pred)))
print('Accuracy Score: {:.2f}'.format(accuracy_score(y_train, y_pred)))
print('---' * 30)

# How it should look like
print('How it should be:\n')
print("Accuracy Score: {:.2f}".format(np.mean(undersample_accuracy)))
print("Precision Score: {:.2f}".format(np.mean(undersample_precision)))
print("Recall Score: {:.2f}".format(np.mean(undersample_recall)))
print("F1 Score: {:.2f}".format(np.mean(undersample_f1)))


Overfitting: 

Recall Score: 0.85
Precision Score: 0.68
F1 Score: 0.76
Accuracy Score: 0.73
------------------------------------------------------------------------------------------
How it should be:

Accuracy Score: 0.59
Precision Score: 0.01
Recall Score: 0.92
F1 Score: 0.02


In [73]:
undersample_y_score = log_reg.decision_function(original_Xtest)

## Precision -Recall Analysis

In [74]:
from sklearn.metrics import average_precision_score

undersample_average_precision = average_precision_score(original_ytest, undersample_y_score)

print('Average precision-recall score: {0:0.2f}'.format(
      undersample_average_precision))

Average precision-recall score: 0.03


In [75]:
precision, recall, _ = precision_recall_curve(original_ytest, undersample_y_score)

In [84]:
def precision_recall_plot(STRING,RECALL,PRECISION,SHAPE,AVG):
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=RECALL,y=PRECISION,fill='tonexty',
                            line_shape=SHAPE))
        
        fig.update_layout(showlegend=False,
                          xaxis_title='Recall',
                          yaxis_title='Precision',
                          title={'text':STRING.format(
                         AVG),
                                'y':.9,
                                'x':0.5,
                                'xanchor': 'center',
                                'yanchor': 'top'},
                         height=600, width=1000,
                         legend=dict(x=-.1, y=1.5))
        fig.show()
    
    

In [85]:
precision_recall_plot('UnderSampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}',recall,precision,'hv',undersample_average_precision)

# Over-Sampling


## SMOTE Technique
<a id="smote"></a>
<img src="http://glemaitre.github.io/imbalanced-learn/_images/sphx_glr_plot_smote_enn_001.png">
<b>SMOTE</b> stands for Synthetic Minority Over-sampling Technique.  Unlike Random UnderSampling, SMOTE creates new synthetic points in order to have an equal balance of the classes. This is another alternative for solving the "class imbalance problems". <br><br>


**Understanding SMOTE:**
- Solving the Class Imbalance: SMOTE creates synthetic points from the minority class in order to reach an equal balance between the minority and majority class. 
- Location of the synthetic points:SMOTE picks the distance between the closest neighbors of the minority class, in between these distances it creates synthetic points. 
- Final Effect:More information is retained since we didn't have to delete any rows unlike in random undersampling.
- Accuracy || Time Tradeoff: Although it is likely that SMOTE will be more accurate than random under-sampling, it will take more time to train since no rows are eliminated as previously stated.


**Cross Validation Overfitting Mistake:
Overfitting during Cross Validation:**  
In our undersample analysis I want to show you a common mistake I made that I want to share with all of you. It is simple, if you want to undersample or oversample your data you should not do it before cross validating. Why because you will be directly influencing the validation set before implementing cross-validation causing a "data leakage" problem. <b>In the following section you will see amazing precision and recall scores but in reality our data is overfitting!</b>


**The Wrong Way:**
<img src="https://www.marcoaltini.com/uploads/1/3/2/3/13234002/2639934.jpg?401"><br>

As mentioned previously, if we get the minority class ("Fraud) in our case, and create the synthetic points before cross validating we have a certain influence on the "validation set" of the cross validation process. Remember how cross validation works, let's assume we are splitting the data into 5 batches, 4/5 of the dataset will be the training set while 1/5 will be the validation set. The test set should not be touched! For that reason, we have to do the creation of synthetic datapoints "during" cross-validation and not before, just like below: <br>

**The Right Way**:
<img src="https://www.marcoaltini.com/uploads/1/3/2/3/13234002/9101820.jpg?372"> <br>
As you see above, SMOTE occurs "during" cross validation and not "prior" to the cross validation process. Synthetic data are created only for the training set without affecting the validation set.




**References**: 
<ul>
<li><a src="https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation"> 
DEALING WITH IMBALANCED DATA: UNDERSAMPLING, OVERSAMPLING AND PROPER CROSS-VALIDATION </a></li> 

<li> <a src="http://rikunert.com/SMOTE_explained "> SMOTE explained for noobs  </a></li>
<li> <a src="https://www.youtube.com/watch?v=DQC_YE3I5ig&t=794s"> Machine Learning - Over-& Undersampling - Python/ Scikit/ Scikit-Imblearn </a></li>
</ul>

In [60]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, RandomizedSearchCV

print('Length of X (train): {} | Length of y (train): {}'.format(len(original_Xtrain), len(original_ytrain)))
print('Length of X (test): {} | Length of y (test): {}'.format(len(original_Xtest), len(original_ytest)))

# List to append the score and then find the average
accuracy_lst = []
precision_lst = []
recall_lst = []
f1_lst = []
auc_lst = []

Length of X (train): 227846 | Length of y (train): 227846
Length of X (test): 56961 | Length of y (test): 56961


In [61]:
# Classifier with optimal parameters
log_reg_sm = LogisticRegression()
rand_log_reg = RandomizedSearchCV(LogisticRegression(), log_reg_params, n_iter=4)


# Implementing SMOTE Technique 
# Cross Validating the right way

log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
for train, test in Strats.split(original_Xtrain, original_ytrain):
    pipeline = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_log_reg) # SMOTE happens during Cross Validation not before..
    model = pipeline.fit(original_Xtrain[train], original_ytrain[train])
    best_est = rand_log_reg.best_estimator_
    prediction = best_est.predict(original_Xtrain[test])
    
    accuracy_lst.append(pipeline.score(original_Xtrain[test], original_ytrain[test]))
    precision_lst.append(precision_score(original_ytrain[test], prediction))
    recall_lst.append(recall_score(original_ytrain[test], prediction))
    f1_lst.append(f1_score(original_ytrain[test], prediction))
    auc_lst.append(roc_auc_score(original_ytrain[test], prediction))
    


In [62]:
print("accuracy: {}".format(np.mean(accuracy_lst)))
print("precision: {}".format(np.mean(precision_lst)))
print("recall: {}".format(np.mean(recall_lst)))
print("f1: {}".format(np.mean(f1_lst)))

accuracy: 0.9415664710483274
precision: 0.06169758307120463
recall: 0.9137293086660175
f1: 0.11372794096562253


In [63]:
labels = ['No Fraud', 'Fraud']
smote_prediction = best_est.predict(original_Xtest)
print(classification_report(original_ytest, smote_prediction, target_names=labels))

              precision    recall  f1-score   support

    No Fraud       1.00      0.99      0.99     56863
       Fraud       0.11      0.85      0.20        98

    accuracy                           0.99     56961
   macro avg       0.55      0.92      0.59     56961
weighted avg       1.00      0.99      0.99     56961



## Precision-recall Curve

In [81]:
y_score = best_est.decision_function(original_Xtest)

In [82]:
average_precision = average_precision_score(original_ytest, y_score)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))

Average precision-recall score: 0.75


In [86]:
precision, recall, _ = precision_recall_curve(original_ytest, y_score)
precision_recall_plot('OverSampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}',recall,precision,'vhv',average_precision)

In [88]:
# SMOTE Technique (OverSampling) After splitting and Cross Validating
sm = SMOTE(ratio='minority', random_state=42)

# This will be the data were we are going to 
Xsm_train, ysm_train = sm.fit_sample(original_Xtrain, original_ytrain)

In [89]:

# Implement GridSearchCV and the other models.
# Logistic Regression
t0 = time.time()
log_reg_sm = grid_log_reg.best_estimator_
log_reg_sm.fit(Xsm_train, ysm_train)
t1 = time.time()
print("Fitting oversample data took :{} sec".format(t1 - t0))

Fitting oversample data took :7.309938430786133 sec


## Evaluation of Test Data with Logistic Regression:
**testing_logistic:**
**Confusion Matrix:
> **Positive/Negative:** Type of Class (label) ["No", "Yes"]
> **True/False:** Correctly or Incorrectly classified by the model.<br><br>

- **True Negatives (Top-Left Square):** This is the number of **correctly** classifications of the "No" (No Fraud Detected) class. 

- **False Negatives (Top-Right Square):** This is the number of **incorrectly** classifications of the "No"(No Fraud Detected) class.

- **False Positives (Bottom-Left Square):** This is the number of **incorrectly** classifications of the "Yes" (Fraud Detected) class 

- **True Positives (Bottom-Right Square):** This is the number of **correctly** classifications of the "Yes" (Fraud Detected) class.

**Summary:** 
- Random UnderSampling:We will evaluate the final performance of the classification models in the random undersampling subset.Keep in mind that this is not the data from the original dataframe.
- Classification Models: The models that performed the best were logistic regression  and support vector classifier (SVM)

In [90]:
# Logistic Regression fitted using SMOTE technique
y_pred_log_reg = log_reg_sm.predict(X_test)

# Other models fitted with UnderSampling
y_pred_knear = knears_neighbors.predict(X_test)
y_pred_svc = svc.predict(X_test)
y_pred_tree = tree_clf.predict(X_test)

In [91]:
from sklearn.metrics import confusion_matrix
log_reg_cf = confusion_matrix(y_test, y_pred_log_reg)
kneighbors_cf = confusion_matrix(y_test, y_pred_knear)
svc_cf = confusion_matrix(y_test, y_pred_svc)
tree_cf = confusion_matrix(y_test, y_pred_tree)

In [92]:
from plotly.graph_objs import *
def confusion_matrix_plot(Model,String):
        trace1 = {
          "type": "heatmap", 
          "x": ['Non-Fraud', 'Fraud'], 
          "y": ['Fraud', 'Non-Fraud'], 
          "z": Model, 
          "showscale": True, 
          "colorscale": "Viridis"
        }
        data = [trace1]
        layout = {
         
          "xaxis": {
            "side": "bottom", 
            "dtick": 1, 
            "ticks": "", 
            "title": "Predicted label", 
            "gridcolor": "rgb(0, 0, 0)"
          }, 
          "yaxis": {
            "dtick": 1, 
            "ticks": "", 
            "title": "True label", 
            "ticksuffix": "  "
          }, 

          "annotations": [
            {
              "x": 'Non-Fraud', 
              "y": 'Fraud', 
              "font": {"color": "white"}, 
              "text": str(Model[1][0]), 
              "xref": "x1", 
              "yref": "y1", 
              "showarrow": False
            }, 
            {
              "x": 'Fraud', 
              "y": 'Fraud', 
              "font": {"color": "white"}, 
              "text": str(Model[1][1]), 
              "xref": "x1", 
              "yref": "y1", 
              "showarrow": False
            }, 
            {
              "x": 'Non-Fraud', 
              "y": 'Non-Fraud',
              "font": {"color": "white"}, 
              "text": str(Model[0][0]), 
              "xref": "x1", 
              "yref": "y1", 
              "showarrow": False
            }, 
            {
              "x": 'Fraud', 
              "y": 'Non-Fraud', 
              "font": {"color": "white"}, 
              "text": str(Model[0][1]), 
              "xref": "x1", 
              "yref": "y1", 
              "showarrow": False
            }
          ]
        }
        fig = Figure(data=data, layout=layout)
        
        fig.update_layout(
                    title={'text':String,
                        'y':.9,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
             height=500, width=500,
                     )
        fig.show()
    

In [96]:
CF_MODEL={'Logistic Regression Confusion Matrix':log_reg_cf, 'Knears Neighbors Confusion Matrix':kneighbors_cf, 'Support Vector Confusion Matrix':svc_cf,'Decision Tree Confusion Matrix':tree_cf}
for k,v in CF_MODEL.items():
    confusion_matrix_plot(v,k)

In [97]:
from sklearn.metrics import classification_report


print('Logistic Regression:')
print(classification_report(y_test, y_pred_log_reg))

print('KNears Neighbors:')
print(classification_report(y_test, y_pred_knear))

print('Support Vector Classifier:')
print(classification_report(y_test, y_pred_svc))

print('Support Vector Classifier:')
print(classification_report(y_test, y_pred_tree))

Logistic Regression:
              precision    recall  f1-score   support

           0       0.91      0.95      0.93       102
           1       0.94      0.89      0.91        88

    accuracy                           0.92       190
   macro avg       0.92      0.92      0.92       190
weighted avg       0.92      0.92      0.92       190

KNears Neighbors:
              precision    recall  f1-score   support

           0       0.89      0.99      0.94       102
           1       0.99      0.86      0.92        88

    accuracy                           0.93       190
   macro avg       0.94      0.93      0.93       190
weighted avg       0.94      0.93      0.93       190

Support Vector Classifier:
              precision    recall  f1-score   support

           0       0.90      0.96      0.93       102
           1       0.95      0.88      0.91        88

    accuracy                           0.92       190
   macro avg       0.92      0.92      0.92       190
weighted

In [98]:
# Final Score in the test set of logistic regression
from sklearn.metrics import accuracy_score

# Logistic Regression with Under-Sampling
y_pred = log_reg.predict(X_test)
undersample_score = accuracy_score(y_test, y_pred)



# Logistic Regression with SMOTE Technique (Better accuracy with SMOTE t)
y_pred_sm = best_est.predict(original_Xtest)
oversample_score = accuracy_score(original_ytest, y_pred_sm)


d = {'Technique': ['Random UnderSampling', 'Oversampling (SMOTE)'], 'Score': [undersample_score, oversample_score]}
final_df = pd.DataFrame(data=d)

# Move column
score = final_df['Score']
final_df.drop('Score', axis=1, inplace=True)
final_df.insert(1, 'Score', score)

# Note how high is accuracy score it can be misleading! 
final_df

Unnamed: 0,Technique,Score
0,Random UnderSampling,0.921053
1,Oversampling (SMOTE),0.987974


# Artificial Neural Networks  

**Random UnderSampling Data vs OverSampling (SMOTE):  ** 

Simple Neural Network (with one hidden layer) in order to see  which of the two logistic regressions models we implemented in the (undersample or oversample(SMOTE)) has a better accuracy for detecting fraud and non-fraud transactions. 

**Target:  **

Our main goal is to explore how our simple neural network behaves in both the random undersample and oversample dataframes and see whether they can predict accuractely both non-fraud and fraud cases.

## **Keras ~ Random UnderSampling**:
- Dataset:  In this final phase of testing we will fit this model in both the random undersampled subset and oversampled dataset (SMOTE) in order to predict the final result using the original dataframe testing data.
- Neural Network Structure:  As stated previously, this will be a simple model composed of one input layer (where the number of nodes equals the number of features) plus bias node, one hidden layer with 32 nodes and one output node composed of two possible results 0 or 1 (No fraud or fraud). 
- Other characteristics: The learning rate will be 0.001, the optimizer we will use is the AdamOptimizer, the activation function that is used in this scenario is "Relu" and for the final outputs we will use sparse categorical cross entropy, which gives the probability whether an instance case is no fraud or fraud (The prediction will pick the highest probability between the two.) 


In [77]:
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy


In [78]:
n_inputs = X_train.shape[1]

undersample_model = Sequential([
    Dense(n_inputs, input_shape=(n_inputs, ), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2, activation='softmax')
])

W0104 19:37:52.008883 14308 deprecation_wrapper.py:119] From C:\Users\aniruddha.choudhury\AppData\Local\Continuum\anaconda3\envs\kdlsd\lib\site-packages\keras\backend\tensorflow_backend.py:66: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0104 19:37:52.032883 14308 deprecation_wrapper.py:119] From C:\Users\aniruddha.choudhury\AppData\Local\Continuum\anaconda3\envs\kdlsd\lib\site-packages\keras\backend\tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0104 19:37:52.035882 14308 deprecation_wrapper.py:119] From C:\Users\aniruddha.choudhury\AppData\Local\Continuum\anaconda3\envs\kdlsd\lib\site-packages\keras\backend\tensorflow_backend.py:4432: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



In [79]:
undersample_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 30)                930       
_________________________________________________________________
dense_2 (Dense)              (None, 32)                992       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 66        
Total params: 1,988
Trainable params: 1,988
Non-trainable params: 0
_________________________________________________________________


In [80]:
undersample_model.compile(Adam(lr=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

W0104 19:37:52.305881 14308 deprecation_wrapper.py:119] From C:\Users\aniruddha.choudhury\AppData\Local\Continuum\anaconda3\envs\kdlsd\lib\site-packages\keras\optimizers.py:793: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0104 19:37:52.317882 14308 deprecation_wrapper.py:119] From C:\Users\aniruddha.choudhury\AppData\Local\Continuum\anaconda3\envs\kdlsd\lib\site-packages\keras\backend\tensorflow_backend.py:3622: The name tf.log is deprecated. Please use tf.math.log instead.



In [81]:
undersample_model.fit(X_train, y_train, validation_split=0.2, batch_size=25, epochs=20, shuffle=True, verbose=1)

W0104 19:37:52.591881 14308 deprecation.py:323] From C:\Users\aniruddha.choudhury\AppData\Local\Continuum\anaconda3\envs\kdlsd\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0104 19:37:52.637885 14308 deprecation_wrapper.py:119] From C:\Users\aniruddha.choudhury\AppData\Local\Continuum\anaconda3\envs\kdlsd\lib\site-packages\keras\backend\tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Train on 605 samples, validate on 152 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1ea51554390>

In [83]:
undersample_fraud_predictions = undersample_model.predict_classes(original_Xtest, batch_size=200, verbose=0)

In [84]:
undersample_cm = confusion_matrix(original_ytest, undersample_fraud_predictions)
actual_cm = confusion_matrix(original_ytest, original_ytest)

In [85]:
CF_MODEL={'Random UnderSample \n Confusion Matrix':undersample_cm, 'Confusion Matrix \n (with 100% accuracy)':actual_cm}
for k,v in CF_MODEL.items():
    confusion_matrix_plot(v,k);

## Keras ~ OverSampling [SMOTE]:


In [86]:
n_inputs = Xsm_train.shape[1]

oversample_model = Sequential([
    Dense(n_inputs, input_shape=(n_inputs, ), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2, activation='softmax')
])

In [87]:
oversample_model.compile(Adam(lr=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [88]:
oversample_model.fit(Xsm_train, ysm_train, validation_split=0.2, batch_size=300, epochs=20, shuffle=True, verbose=1)

Train on 363923 samples, validate on 90981 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1ea00a72eb8>

In [89]:
oversample_predictions = oversample_model.predict(original_Xtest, batch_size=200, verbose=0)

In [90]:
oversample_fraud_predictions = oversample_model.predict_classes(original_Xtest, batch_size=200, verbose=0)

In [91]:
oversample_smote = confusion_matrix(original_ytest, oversample_fraud_predictions)
actual_cm = confusion_matrix(original_ytest, original_ytest)

In [92]:
CF_MODEL={'Random OverSample \n Confusion Matrix':oversample_smote, 'Confusion Matrix \n (with 100% accuracy)':actual_cm}
for k,v in CF_MODEL.items():
    confusion_matrix_plot(v,k);

## Conclusion: 
So by doing SMOTE implementaion on our imbalanced dataset helped us with the imbalance of our labels (more no fraud than fraud transactions). Nevertheless, I still have to state that sometimes the neural network on the oversampled dataset predicts less correct fraud transactions than our model using the under-sample dataset. However, remember that the removal of outliers was implemented only on the random under-sample dataset and not on the oversampled one. Also, in our under-sample data our model is unable to detect for a large number of cases non fraud transactions correctly and instead, misclassifies those non fraud transactions as fraud cases. Imagine that people that were making regular purchases got their card blocked due to the reason that our model classified that transaction as a fraud transaction, this will be a huge disadvantage for the financial institution. The number of customer complaints and customer disatisfaction will increase. 



# AutoEncoder Model Prediction Architecture 

![](https://miro.medium.com/max/1382/1*DFPyZ-XMPFNukCP3mF8gJg.png)

In [93]:
from keras.layers import Input, Dense
from keras.models import Model, Sequential
from keras import regularizers
from sklearn.model_selection import train_test_split 
from sklearn.manifold import TSNE
from sklearn import preprocessing 
sns.set(style="whitegrid")
np.random.seed(203)

## Dataset Preparation

In [94]:
data = pd.read_csv("creditcard.csv")
data["Time"] = data["Time"].apply(lambda x : x / 3600 % 24)
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,0.000278,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,0.000278,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,0.000556,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [95]:
dist = data['Class'].value_counts().to_frame().reset_index()
dist['percent'] = dist["Class"].apply(lambda x : round(100*float(x) / len(data), 2))
dist = dist.rename(columns = {"index" : "Target", "Class" : "Count"})
dist

Unnamed: 0,Target,Count,percent
0,0,284315,99.83
1,1,492,0.17


One of the biggest challenge of this problem is that the target is highly imbalanced as only 0.17 % cases are fraud transactions. But the advantage of the representation learning approach is that it is still able to handle such imbalance nature of the problems. We will look how. For our use-case let's take only about 1000 rows of non-fraud transactions. 

**Consider only 1000 rows of non fraud cases**

In [96]:
non_fraud = data[data['Class'] == 0].sample(1000)
fraud = data[data['Class'] == 1]

df = non_fraud.append(fraud).sample(frac=1).reset_index(drop=True)
X = df.drop(['Class'], axis = 1).values
Y = df["Class"].values

## Visualize Fraud and NonFraud Transactions using T-SNE
Let's visualize the nature of fraud and non-fraud transactions using T-SNE. T-SNE (t-Distributed Stochastic Neighbor Embedding) is a dataset decomposition technique which reduced the dimentions of data and produces only top n components with maximum information.


Every dot in the following represents a transaction. Non Fraud transactions are represented as Green while Fraud transactions are represented as Red. The two axis are the components extracted by tsne. 

From the above graph we can observe that there are many non_fraud transactions which are very close to fraud transactions, thus are difficult to accurately classify from a model. 

In [97]:
def tsne_plot(x1, y1):
    tsne = TSNE(n_components=2, random_state=0)
    X_tt = tsne.fit_transform(x1)
    Data=pd.DataFrame(y1,columns={'T'})
    Data['T'] = Data['T'].apply(lambda x: 'Fraud' if x==1 else 'Non-Fraud')
    Target=Data.values
    d=pd.DataFrame()
    d['x']=X_tt[:,0]
    d['y']=X_tt[:,1]
    d['target']=Target
    fig = px.scatter(d, x="x", y="y", color="target")

    fig.update_traces(marker=dict(size=12,
                              line=dict(width=1,
                              color='DarkSlateGrey')),
                              selector=dict(mode='markers'))
    fig.layout.title.text = "T-SNE for Fraud vs Non-Fraud"
    fig.show()
    
   
    


In [98]:
tsne_plot(X,Y)

## AutoEncoders Modelling
![](https://miro.medium.com/max/5028/1*tY4F3BPq4ctTMelMEnLZvw.png)

We will create an autoencoder model in which we only show the model non-fraud cases. The model will try to learn the best representation of non-fraud cases. The same model will be used to generate the representations of fraud cases and we expect them to be different from non-fraud ones. 

Create a network with one input layer and one output layer having identical dimentions ie. the shape of non-fraud cases. We will use keras package. 

In [99]:
## input layer 
input_layer = Input(shape=(X.shape[1],))

## encoding part
encoded = Dense(100, activation='tanh', activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoded = Dense(50, activation='relu')(encoded)

## decoding part
decoded = Dense(50, activation='tanh')(encoded)
decoded = Dense(100, activation='tanh')(decoded)

## output layer
output_layer = Dense(X.shape[1], activation='relu')(decoded)

In [100]:
autoencoder = Model(input_layer, output_layer)
autoencoder.compile(optimizer="adadelta", loss="mse")

In [101]:
x = data.drop(["Class"], axis=1)
y = data["Class"].values

x_scale = preprocessing.MinMaxScaler().fit_transform(x.values)
x_norm, x_fraud = x_scale[y == 0], x_scale[y == 1]

In [102]:
autoencoder.fit(x_norm[0:2000], x_norm[0:2000], 
                batch_size = 256, epochs = 10, 
                shuffle = True, validation_split = 0.20);

Train on 1600 samples, validate on 400 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [103]:
autoencoder.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 30)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               3100      
_________________________________________________________________
dense_8 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_9 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_10 (Dense)             (None, 100)               5100      
_________________________________________________________________
dense_11 (Dense)             (None, 30)                3030      
Total params: 18,830
Trainable params: 18,830
Non-trainable params: 0
_______________________________________________________

## Obtain the Latent Representations
Now, the model is trained. We are intereseted in obtaining latent representation of the input learned by the model. This can be accessed by the weights of the trained model. We will create another network containing sequential layers, and we will only add the trained weights till the third layer where latent representation exists. 

In [104]:
autoencoder.layers

[<keras.engine.input_layer.InputLayer at 0x1ea002f1e10>,
 <keras.layers.core.Dense at 0x1ea002f1da0>,
 <keras.layers.core.Dense at 0x1ea002f1128>,
 <keras.layers.core.Dense at 0x1ea00654780>,
 <keras.layers.core.Dense at 0x1ea006529e8>,
 <keras.layers.core.Dense at 0x1ea050666d8>]

In [105]:
hidden_representation = Sequential()
hidden_representation.add(autoencoder.layers[0])
hidden_representation.add(autoencoder.layers[1])
hidden_representation.add(autoencoder.layers[2])

In [106]:
norm_hid_rep = hidden_representation.predict(x_norm[:3000])
fraud_hid_rep = hidden_representation.predict(x_fraud)

## Visualize  T-SNE the latent representations : Fraud Vs Non Fraud
We will create a training dataset using the latent representations obtained and let's visualize the nature of fraud vs non-fraud cases. 

In [107]:
rep_x = np.append(norm_hid_rep, fraud_hid_rep, axis = 0)
y_n = np.zeros(norm_hid_rep.shape[0])
y_f = np.ones(fraud_hid_rep.shape[0])
rep_y = np.append(y_n, y_f)
tsne_plot(rep_x, rep_y)

Perfect graph, we can observe that now fraud and non-fraud transactions are pretty visibile and are linearly separable.