## Expected Questions:
1. NN Architecture/Layers: 
    - Really good conversation withs several approaches: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw#:~:text=Like%20the%20Input%20layer%2C%20every,by%20the%20chosen%20model%20configuration.
    - Article on approach to number of layers: https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/
    - Overall this is an area of much debate in data science and there is no one approach fits all.
    
    100 - Input Layer
    
    Hidden Layers: 128/64/32/16
    

2. Activation Functions: 
    - https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0
    - Relu: Because of the horizontal line in ReLu( for negative X ), the gradient can go towards 0. For activations in that region of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLu problem. This problem can cause several neurons to just die and not respond making a substantial part of the network passive. There are variations in ReLu to mitigate this issue by simply making the horizontal line into non-horizontal component . for example y = 0.01x for x<0 will make it a slightly inclined line rather than horizontal line. This is leaky ReLu. There are other variations too. The main idea is to let the gradient be non zero and recover during training eventually.
    
    
3. Loss Functions: 
    - Article on each loss function: https://analyticsindiamag.com/loss-functions-in-deep-learning-an-overview/
    - MSE (Regression): Mean Squared Error is the mean of squared differences between the actual and predicted value. If the difference is large the model will penalize it as we are computing the squared difference.
    - Cross Entropy (Binary Classification): It gives the probability value between 0 and 1 for a classification task. Cross-Entropy calculates the average difference between the predicted and actual probabilities.
    
    
4. Optimizers: 
    - https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6
    - ADAM optimizer is the current "best of breed" - is more dynamic than traditional optimizers such as Stochastic Gradient Descent.
    

Last peice of advise: KISS principle - https://en.wikipedia.org/wiki/KISS_principle

## Introduction to Neural Networks - Fraud Detection

Nilson reports that U.S. card fraud (credit, debt, etc) was reportedly 9 billion dollars in 2016 and expected to increase to 12 billion dollars by 2020. For perspective, in 2017 both PayPal's and Mastercard's revenue was only $10.8 billion each.


**Objective:** In this session, given the credit card transactions, we will build a simple neural network (i.e., Multilayer perceptrons) for Fraud Detection using Keras.

This notebooks covers,

1. Creating a Model

2. Adding Layers

3. Activations

4. Optimizers and Loss functions

5. Evaluation

### Dataset Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, the original features and more background information about the data is not provided. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Source: https://www.kaggle.com/mlg-ulb/creditcardfraud

In [None]:
#!pip install tensorflow==2.0

In [1]:
import tensorflow as tf
print(tf.__version__)

2.0.0


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn import preprocessing

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, auc
import matplotlib.pyplot as plt
from tensorflow.keras import optimizers


In [None]:
#from google.colab import drive
#drive.mount('/content/drive/')
#project_path = '/content/drive/My Drive/Colab Notebooks/'
#dataset_file = project_path + 'creditcard.csv'

In [101]:
data = pd.read_csv('creditcard.csv')

In [102]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [37]:
data = data.drop("Time", axis = 1)

In [38]:
X_data = data.iloc[:, :-1]

In [7]:
X_data.shape

(284807, 29)

In [8]:
X_data.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99


In [39]:
y_data = data.iloc[:, -1]

In [10]:
y_data.shape

(284807,)

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.2, random_state = 7)

In [41]:
X_train = preprocessing.normalize(X_train)
X_test = preprocessing.normalize(X_test)

In [42]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(227845, 29)
(56962, 29)
(227845,)
(56962,)


### Creating a model

Keras model object can be created with Sequential class

At the outset, the model is empty per se. It is completed by adding additional layers and compilation


In [43]:
model = Sequential()

### Adding layers [layers and activations]

Keras layers can be added to the model

Adding layers are like stacking lego blocks one by one

It should be noted that as this is a classification problem, sigmoid layer (softmax for multi-class problems) should be added


29/64/32/1

In [44]:
model.add(Dense(64, input_shape = (29,), activation = 'relu')) # 1st hidden layer - 64
model.add(Dense(32, activation = 'tanh')) # 2nd hidden layer

model.add(Dense(1, activation = 'sigmoid')) # Output

### Model compile [optimizers and loss functions]

Keras model should be "compiled" prior to training

Types of loss (function) and optimizer should be designated


In [45]:
sgd = optimizers.Adam(lr = 0.001)

In [46]:
model.compile(optimizer = sgd, 
              loss = 'binary_crossentropy', 
              metrics=['accuracy'])

Binary Cross entropy: https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/categorical-crossentropy

### Summary of the model

In [47]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 64)                1920      
_________________________________________________________________
dense_13 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 33        
Total params: 4,033
Trainable params: 4,033
Non-trainable params: 0
_________________________________________________________________


### Training [Forward pass and Backpropagation]

Training the model

In [48]:
model.fit(X_train, 
          y_train.values, 
          batch_size = 700, 
          epochs = 10, 
          verbose = 1)

Train on 227845 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fba026eead0>

### Evaluation
Keras model can be evaluated with evaluate() function

Evaluation results are contained in a list



In [None]:
results = model.evaluate(X_test, y_test.values, verbose = 0)


In [51]:
print(model.metrics_names)
print(results)    

['loss', 'accuracy']
[0.003805335954387563, 0.9992276]


### Confusion Matrix

In [52]:
Y_pred_cls = model.predict_classes(X_test, batch_size=200, verbose=0)

In [103]:
print('Accuracy Model (Dropout): '+ str(results[1]))
print('Recall_score: ' + str(recall_score(y_test,Y_pred_cls)))
print('Precision_score: ' + str(precision_score(y_test, Y_pred_cls)))
print('F-score: ' + str(f1_score(y_test,Y_pred_cls)))
confusion_matrix(y_test, Y_pred_cls)

Accuracy Model (Dropout): 0.99824446
Recall_score: 0.54
Precision_score: 0.8307692307692308
F-score: 0.6545454545454545


array([[56851,    11],
       [   46,    54]])

#### Feel free to experiment with the model and get to better evaluation metric scores. 
Happy Learning!

## Bonus - Lets Try an additional layer

ADAM Optimizer: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/#:~:text=Adam%20is%20a%20replacement%20optimization,sparse%20gradients%20on%20noisy%20problems.

29/64/32/16/1

In [79]:
sgd = optimizers.Adam(lr = 0.001)

model = Sequential()

model.add(Dense(64, input_shape = (29,), activation = 'relu'))
model.add(Dense(32, activation = 'tanh'))
model.add(Dense(16, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(optimizer = sgd, loss = 'binary_crossentropy', metrics=['accuracy'])

model.summary()

model.fit(X_train, y_train.values, batch_size = 700, epochs = 10, verbose = 1)

results = model.evaluate(X_test, y_test.values, verbose = 0)

print(model.metrics_names)
print(results)    


Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_31 (Dense)             (None, 64)                1920      
_________________________________________________________________
dense_32 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_33 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_34 (Dense)             (None, 1)                 17        
Total params: 4,545
Trainable params: 4,545
Non-trainable params: 0
_________________________________________________________________
Train on 227845 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [80]:
Y_pred_cls = model.predict_classes(X_test, batch_size=200, verbose=0)
print('Accuracy Model (Dropout): '+ str(results[1]))
print('Recall_score: ' + str(recall_score(y_test.values,Y_pred_cls)))
print('Precision_score: ' + str(precision_score(y_test.values, Y_pred_cls)))
print('F-score: ' + str(f1_score(y_test.values,Y_pred_cls)))
confusion_matrix(y_test.values, Y_pred_cls)

Accuracy Model (Dropout): 0.99942064
Recall_score: 0.8
Precision_score: 0.8602150537634409
F-score: 0.8290155440414508


array([[56849,    13],
       [   20,    80]])

## Bonus - Lets change the layer structure to 29/100/50/1

In [100]:
sgd = optimizers.Adam(lr = 0.001)
model = Sequential()

model.add(Dense(100, input_shape = (29,), activation = 'tanh'))
model.add(Dense(50, activation = 'tanh'))
model.add(Dense(1, activation = 'tanh'))

model.compile(optimizer = sgd, loss = 'binary_crossentropy', metrics=['accuracy'])

model.summary()

model.fit(X_train, y_train.values, batch_size = 700, epochs = 10, verbose = 1)

results = model.evaluate(X_test, y_test.values, verbose = 0)


Model: "sequential_27"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_86 (Dense)             (None, 100)               3000      
_________________________________________________________________
dense_87 (Dense)             (None, 50)                5050      
_________________________________________________________________
dense_88 (Dense)             (None, 1)                 51        
Total params: 8,101
Trainable params: 8,101
Non-trainable params: 0
_________________________________________________________________
Train on 227845 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [90]:
Y_pred_cls = model.predict_classes(X_test, batch_size=200, verbose=0)
print('Accuracy Model (Dropout): '+ str(results[1]))
print('Recall_score: ' + str(recall_score(y_test.values,Y_pred_cls)))
print('Precision_score: ' + str(precision_score(y_test.values, Y_pred_cls)))
print('F-score: ' + str(f1_score(y_test.values,Y_pred_cls)))
confusion_matrix(y_test.values, Y_pred_cls)

Accuracy Model (Dropout): 0.99899936
Recall_score: 0.54
Precision_score: 0.8307692307692308
F-score: 0.6545454545454545


array([[56851,    11],
       [   46,    54]])