In [1]:
pip install matplotlib numpy pandas torch scikit-learn torchmetrics

Note: you may need to restart the kernel to use updated packages.


I will be using the credit card fraud dataset https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud with 30 learnable features and 285k samples. This dataset is used to predict whether a credit card transaction is fraudulent or not and the source data has been converted to a set of PCA vectors. To build the deep learning model that will predict whether transactions are fradulent I will use the pytorch ML library.

## Task 1:
The primary tools I will be using will be pytorch for the neural network and scikitlearn for the metrics.
Resources:

Data Processing
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
* https://pytorch.org/docs/stable/data.html#torch.utils.data.WeightedRandomSampler
* https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
These two links are important to understanding the way pytorch dataloaders operate. These are handy ways to preprocess and batch the data that will be fed into the neural network the native pytorch way by define a class representing the datasets.

Neural Network
* https://pytorch.org/tutorials/beginner/nn_tutorial.html
* https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html#torch.nn.Sequential
* https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html

These links define how the neural networks in pytorch are easily defined and how the architecture is represented by a custom class that inherits the NN module. This will help in defining the structure of the 2 layer nn. Along with this the flow of data through the forwards pass is defined so this will define the model's execution.

Back Propogation and training
* https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html
* https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

Pytorch (like tensorflow) automatically tracks the gradients of tensors to make the back propogation step easy but you have to know the functions in order to call the execution of these graphs. This will be used to compute to compute the backwards pass which will inform the magnitude of the weight updates once passed through the loss and optimizer. These functions can be applied to the nn model by linking the parameters to the loss function and simply applying the optimizer to the value of the loss within each iteration. It is important to note since the gradients are always tracked, they must be zeroed out with each iteration of the training step so each batch contributes to the gradients of the model evenly. This process repeated over and over along the data will gradually update the weights of the nn and train the model.

* https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html

Here the loss function is defined which will be used after the forwards pass definine the magnitude to update the backwards pass.

* https://pytorch.org/docs/stable/optim.html
* https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam

These inform how the optimizer functions in pytorch and specifically the Adam optimizer I will be using that will actually apply the computed gradients and loss to update the weights in corespondence with the learning rate.


## Task 2:
Design and implementation

In [2]:
import pandas as pd
import numpy as np
import torch

### (1) Exploratory Data Analysis

In [3]:
# EPA 
filename = "creditcard.csv"
cc_data = pd.read_csv(filename)
cc_vector = cc_data.to_numpy()
print("Size of data: ",cc_vector.shape)
print(f"{np.sum(cc_vector[:,-1])} Positive cases")
print(f"{cc_vector.shape[0] - np.sum(cc_vector[:,-1])} Negative cases")
cc_data.head()

Size of data:  (284807, 31)
492.0 Positive cases
284315.0 Negative cases


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


From this we can see the dataset is extremely unbalanced in favor of non fraudulent transactions which makes sense considering this is real world finacial data from a certain time frame. Along with this by looking at the head of the table we can see the scalar coefficients for each datapoint's PCA components. Unfortunately this means we cannot attach semantic meaning to each PC since this information was not shared but we can see that there are 28 components (later we will see they decrease in importance) along with the ammount, time, and classification (fraud or not). This means we will have 28 learnable parameters (29 if ammount is taken into account) and one binary output class.

In [4]:
cc_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Here we confirm the memory use and variable types. All learnable parameters are type float so no conversion will need to be done and the entire dataset is only 67MB meaning we will be able to fit it in memory without an issue.

In [5]:
cc_data.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


From this we can see the largest single transaction was 25k and the PCA coefficients range between 0 and 30. Since the ammount of each transaction covers a larger range we may need to normalize this before passing it to the deep learning model.

In [6]:
variable = cc_vector[:,2:]
correlations = np.corrcoef(variable,rowvar=False)
print(correlations.min())
print(np.sum(np.abs(correlations),axis=0))
corr = cc_data.iloc[:,2:].corr()
corr.style.background_gradient(cmap='coolwarm')

-0.5314089393280333
[1.62269759 1.4038413  1.23217915 1.48133056 1.25962434 1.58456787
 1.12295422 1.14197829 1.31838508 1.15497962 1.27013473 1.00986319
 1.33629487 1.00720925 1.20044847 1.33379011 1.14713559 1.0909338
 1.35949373 1.14641231 1.06560596 1.11531771 1.01236712 1.05114457
 1.00766343 1.04640519 1.01979426 3.96793462 3.5858812 ]


Unnamed: 0,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
V2,1.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.531409,0.091289
V3,0.0,1.0,0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.21088,-0.192961
V4,-0.0,0.0,1.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.098732,0.133447
V5,0.0,-0.0,-0.0,1.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,-0.386356,-0.094974
V6,0.0,0.0,-0.0,0.0,1.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.215981,-0.043643
V7,0.0,0.0,-0.0,0.0,0.0,1.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.397311,-0.187257
V8,-0.0,-0.0,0.0,0.0,-0.0,0.0,1.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.103079,0.019875
V9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.044246,-0.097733
V10,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,1.0,-0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.101502,-0.216883
V11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.000104,0.154876


The coefficient matrix does not tell us much since we are already working with orthogonal PCA components and the identity matrix confirms this. However from the right we can see the first PC has a greater influence on the $ ammount and class telling us the PCA is sorted in order of most significant PC to least.
With the below information we can calculate the dollar ammount this model would have saved if it was ran with this data.

In [7]:
print(f"Total ammount of USD represented in the dataset: ${int(cc_vector[:,-2].sum()):,}")
num_pos = cc_vector[cc_vector[:,-1]==1,-2].sum()
num_neg = cc_vector[cc_vector[:,-1]==0,-2].sum()
print(f"Total USD in legitament transactions: ${int(cc_vector[cc_vector[:,-1]==0,-2].sum()):,}")
print(f"Total USD in fradulent transactions: ${int(cc_vector[cc_vector[:,-1]==1,-2].sum()):,}")

Total ammount of USD represented in the dataset: $25,162,590
Total USD in legitament transactions: $25,102,462
Total USD in fradulent transactions: $60,127


### (2) Performing a train-dev-test split 

In [57]:
from sklearn.preprocessing import StandardScaler

# with ammount
x = cc_vector[:,1:-2]
y = cc_vector[:,-1]
# normalize by subtracting by mean and dividing by standard deviation
x = StandardScaler().fit_transform(x)

print("Data:",x.shape)
print("Labels",y.shape)
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.10, stratify=y)
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size=0.10, stratify=y_train)

print("Train size: ",y_train.shape,"\t Validation size",y_valid.shape,"\t Test size",y_test.shape)


Data: (284807, 28)
Labels (284807,)
Train size:  (230693,) 	 Validation size (25633,) 	 Test size (28481,)


In [56]:
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
class ccDataset(Dataset):
    def __init__(self,x,y):
        self.x = torch.tensor(x,dtype = torch.float32)
        self.y = torch.tensor(y,dtype = torch.float32)

    def __getitem__(self,idx):
        return self.x[idx],self.y[idx]
    def __len__(self):
        return self.x.shape[0]

# Creating torch dataset
weights = y_train.copy()
print(weights.sum())
print(len(weights))
weights = weights*600+0.5
sampler = WeightedRandomSampler(weights,num_samples=len(weights), replacement = True)
batch_size = 32
train_loader = DataLoader(ccDataset(x_train,y_train),sampler = sampler, batch_size = batch_size)
valid_loader = DataLoader(ccDataset(x_valid,y_valid), batch_size = batch_size, shuffle = True)
test_loader = DataLoader(ccDataset(x_test,y_test), batch_size = batch_size, shuffle = True)

399.0
230693


The data has been split 80 10 10 for train test validation. Along with this a significant weight was applied to the fraud cases in the training set to assist with the data imbalance

### (3) Forward propagation 
(clearly describe the activation functions and other hyper-parameters you are using).

In [55]:
from torch import nn

# Defining the model
class DeepModel(nn.Module):
    def __init__(self):
        super(DeepModel, self).__init__()
        self.flat = nn.Flatten()
        self.relu_layer = nn.Sequential(
            nn.Linear(28,128),
            nn.ReLU(),
            nn.Linear(128,128),
            nn.ReLU(),
            nn.Linear(128,1),
            nn.Sigmoid())
        
    def forward(self,x):
        flatter = self.flat(x)
        out = self.relu_layer(flatter)
        return out
print(DeepModel().to("cpu"))

DeepModel(
  (flat): Flatten(start_dim=1, end_dim=-1)
  (relu_layer): Sequential(
    (0): Linear(in_features=28, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=1, bias=True)
    (5): Sigmoid()
  )
)


Here we can see this two hidden layer NN utilizes the relu activation function and goes from the input size to 128 hidden neurons before returning to 1 single output with a sigmoid activation function to normalize the output between 0-1 for binary classification. For this model there are 28 input features (all floats) and a single output for binary classification with two hidden layers. All layers are fully connected and feed forwards defined by the nn.Sequential module. This concludes the description of the forwards pass.

### (4) Compute the final cost function. 
(described below based on binary cross entropy)
### (5) Implement gradient descent
 (any variant of gradient descent depending upon your
data and project can be used) to train your model. In this step it is up to you as someone
in charge of their project to improvise using optimization algorithms (Adams, RMSProp
etc.) and/or regularization.

In [54]:
# Tested learning rate 1e-4, 1e-5,1e-6, 1e-7
learning_rate = 1e-6
# Tested weigh reg  1e-4, 1e-5,1e-6
weight_reg =1e-5
# Tested Epochs 5 -> 20
epochs = 5
model = DeepModel()

# Adam or adaptive momentum optimizer discussed in class will dynamically update the learning rate based on the cost
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate, weight_decay=weight_reg)
# binary cross entropy similar to cross entropy with a different curve for binary applications
loss = nn.BCELoss()

Here we can see the selected cost function for the nn is based on Binary cross entropy
https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a
This loss function is based on cross entropy with a different weighting curve to more aggressivly adjust for the binary case. Here the loss is the binary cross entropy of the predicted class (fraudulent transaction vs non fraudulent) vs the ground truth in binary.

In [51]:
from torchmetrics import ConfusionMatrix
def evaluation(model,dataloader,lossfunct):
    correct_count=0
    correct_pos=0
    total_loss = 0
    total=0
    cmap = np.zeros((2,2))
    with torch.no_grad():
        for (x,y) in dataloader:
            output = model(x)
            total+=output.shape[0]
            correct_count += (np.rint(output.detach().numpy()) == y.reshape(-1,1).detach().numpy()).sum()
            correct_pos +=  (np.rint(output.detach().numpy()) == (y.reshape(-1,1).detach().numpy() == np.ones((y.shape[0],1)))).sum()
            total_loss += lossfunct(output,y.reshape(-1,1)) 
            
            confmat = ConfusionMatrix(num_classes=2)
            cmap +=np.asarray(confmat(output, y.int()))

    print("Confusion matrix: \n",cmap)
    print(f"Precision {100*(cmap[0][0]/(cmap[0][0]+cmap[0][1])):0.2f}%")
    print(f"Accuracy {100*((cmap[0][0]+cmap[1][1])/(np.sum(cmap))):0.2f}%")
    print(f"Recall  {100*(cmap[0][0]/(cmap[0][0]+cmap[1][0])):0.2f}%")
    print(f"Specificity  {100*(cmap[1][1]/(cmap[1][1]+cmap[0][1])):0.2f}%")

    return total_loss, correct_count
    
    

# training loop
losses = []
accuracies = []
for i in range(epochs):
    correct = 0
    total_training = 0
    for j, (xb, yb) in enumerate(train_loader):
        output = model(xb)

        correct += (np.rint(output.detach().numpy()) == yb.reshape(-1,1).detach().numpy()).sum()
        lossval = loss(output,yb.reshape(-1,1)) 
        total_training+=lossval
        # Remove gradient from previous exec
        optimizer.zero_grad()
        
        # BackProp to compute gradient
        lossval.backward()
        
        # Apply updates to model
        optimizer.step()
     
    print(f"Epoch #{i}")
    print(f"Training Accuracy: {100*float(correct)/(len(train_loader)*batch_size):0.2f}%")
    print(f"Training Loss: {total_training}")
    print("Validation:")
    val_loss, val_corr = evaluation(model,valid_loader,loss)
    print(f"Valid Loss: {val_loss}")
    print()

Epoch #0
Training Accuracy: 70.91%
Training Loss: 3053.04443359375
Validation:
Confusion matrix: 
 [[5.5500e+03 2.0039e+04]
 [1.0000e+00 4.3000e+01]]
Precision 21.69%
Accuracy 21.82%
Recall  99.98%
Specificity  0.21%
Valid Loss: 596.673095703125

Epoch #1
Training Accuracy: 84.38%
Training Loss: 2249.88232421875
Validation:
Confusion matrix: 
 [[2.1215e+04 4.3740e+03]
 [1.0000e+00 4.3000e+01]]
Precision 82.91%
Accuracy 82.93%
Recall  100.00%
Specificity  0.97%
Valid Loss: 501.9655456542969

Epoch #2
Training Accuracy: 92.98%
Training Loss: 1877.2794189453125
Validation:
Confusion matrix: 
 [[2.4101e+04 1.4880e+03]
 [3.0000e+00 4.1000e+01]]
Precision 94.19%
Accuracy 94.18%
Recall  99.99%
Specificity  2.68%
Valid Loss: 399.75

Epoch #3
Training Accuracy: 93.71%
Training Loss: 1610.913330078125
Validation:
Confusion matrix: 
 [[2.4558e+04 1.0310e+03]
 [3.0000e+00 4.1000e+01]]
Precision 95.97%
Accuracy 95.97%
Recall  99.99%
Specificity  3.82%
Valid Loss: 321.7656555175781

Epoch #4
Trainin

Here we can obseve the batch gradient descent algorithm being employed. Along with this the Adam optimizer was used to improve upon the performance of the base batch gradient descent with weight regularization (decision of hyperparmeters discussed below). The keys steps in pytorch are to predict the class with the gradients automatically propogating, then compute the loss based on the loss function and ground truth, using this loss finally propogate these backwards by executing the gradient graph and zero out the optimizer so it has no residual gradient before applying the optimizer update by which the weights will be updated. Through this iterative process the model trains in 6 epochs


### (6) Present the results using the test set

In [53]:
print("~Final Test Results~")
test_loss, test_corr = evaluation(model,test_loader,loss)
print(f"Test Loss: {test_loss}")

~Final Test Results~
Confusion matrix: 
 [[27470.   966.]
 [    0.    45.]]
Precision 96.60%
Accuracy 96.61%
Recall  100.00%
Specificity  4.45%
Test Loss: 292.41851806640625


## Analysis
From this test results we can see the NN performs well on the fraud dataset with the slight issue of biasing positive classifcations for fraud which causes some missed classifications in the negative case. For the future it may benefit to try different sampling techniques to mitigate the model bias.

## Task 3: Hyperparameter selection

For this stage I decided to go with a pseduo grid search (Actual grid search would have taken too long). I essentially performed a guided grid search by changing single variables at a time. The procedure for this was taking each hyperparameter at different orders of mangnitude and rerunning the model (different trials in the comments of the code). This gave me a general idea of which hyperparameters were most important and eventually resulted in the model's sucess as seen from the testing validation in step 6.
Each hyperparameter and rational:
 
learning_rate = 1e-6 | Various other orders of magnitude were tested and this provided optimal results

weight_reg =1e-5  | Common weight regularization coefficient to reduce overfitting

epochs = 6  | Epochs decided after watching when the validation loss exceeds training loss (sign of overfitting)

batch_size=32 | Small batch size chosen to reduce memory consumption and increase convergence speed

class weight ratio = 500:1 positive:negative | Very important to weight the fraud cases highly in the loss function to assist with the extremely unbalanced dataset. This ratio specifically nearly provides a 1:1 positive to negative class weight to help regularize training.

Use of weight regularization
I decided to add weight regularization (coefficient above) since it can help with reducing overfitting which is important when the dataset only has a few hundred examples of fraud to thousands of non fraud examples. The deep learning model could have simply reduced all outputs to 0 and achieve high accuracy however adding weight regularization limit's its ability to utilize extreme weights to memorize the data and thus helping with overfitting and generalization.

Optimization Algo:
For this project I decided to use the Adam or adaptive momentum optimizer due to it's standard use in binary classification problem and ability to adjust the learning rate as the training process develops. Adam will decrease the learning rate as the loss falls in order to avoid overfitting which is especially important in datasets like this one which are extremely unbalance.

## Task 4: Comparison with another model

In [28]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# Support vector machine bagging clasisfier
bag_model = BaggingClassifier(base_estimator=KNeighborsClassifier(),n_estimators =5 )

bag_model.fit(x_train,y_train)
y_pred = bag_model.predict(x_train)
print("~Training~")
print(f"Model Training Accuracy: {100*metrics.accuracy_score(y_train, y_pred):0.4f}%")
two_cof = confusion_matrix(y_train, y_pred)
print("Confusion Matrix: \n",two_cof)

y_pred = bag_model.predict(x_valid)
print("~Validation~")
print(f"Model Validation Accuracy: {100*metrics.accuracy_score(y_valid, y_pred):0.4f}%")
two_cof = confusion_matrix(y_valid, y_pred)
print("Confusion Matrix: \n",two_cof)

y_pred = bag_model.predict(x_test)
print("~Testing~")
print(f"Model Testing Accuracy: {100*metrics.accuracy_score(y_test, y_pred):0.4f}%")

two_cof = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n",two_cof)
print(f"Precision {100*(two_cof[0][0]/(two_cof[0][0]+two_cof[0][1])):0.2f}%")
print(f"Accuracy {100*((two_cof[0][0]+two_cof[1][1])/(np.sum(two_cof))):0.2f}%")
print(f"Recall  {100*(two_cof[0][0]/(two_cof[0][0]+two_cof[1][0])):0.2f}%")
print(f"Specificity  {100*(two_cof[1][1]/(two_cof[1][1]+two_cof[0][1])):0.2f}%")
print()
# K-fold
print("Performing K-fold validation")
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(bag_model, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print(n_scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))
print("Done")

~Training~
Model Training Accuracy: 99.9562%
Confusion Matrix: 
 [[230274     20]
 [    81    318]]
~Validation~
Model Validation Accuracy: 99.9571%
Confusion Matrix: 
 [[25588     1]
 [   10    34]]
~Testing~
Model Testing Accuracy: 99.9965%
Confusion Matrix: 
 [[28436     0]
 [    1    44]]
Precision 100.00%
Accuracy 100.00%
Recall  100.00%
Specificity  100.00%

Performing K-fold validation
[0.99943822 0.99943822 0.99957867 0.99954356 0.99961378 0.99943822
 0.99957867 0.99954354 0.99947331 0.99950843 0.99943822 0.99954356
 0.99954356 0.99947333 0.99957867 0.99957867 0.99933289 0.99968399
 0.99950843 0.99954354 0.99950844 0.99964889 0.99922756 0.99971911
 0.99957867 0.999368   0.99964889 0.99964888 0.99947331 0.99936798]
Accuracy: 1.000 (0.000)
Done


### Analysis
For this model comparison I chose to rerun the steps with an ensamble method utilizing a Kneighborsclassifier with the bagging method of ensamble methods. While this methodology took extremely long to run due to the need to train multiple models and running the K nearest neighbors algorithm on nearly 300k examples it out performed the NN based approach. From the above statistics we can see it acheives nearly perfect accuracy on the training, validation, and testing set with minimal hypter parameter tuning. Along with this the confusion matrix seems more balanced then the NN example that trails behind in accuracy at 96%. This model likely outperformed the deep learning model due to the extremely unbalanced dataset. To mitigate this imbalance in the NN extreme weights were added to the limited fraud cases which added training instability and a tendency for the model to bias a single class. On the other hand K nearest neighbors (with the con of poor scaling) performs extremely well with ubalanced datasets as it is simply trying to cluster the same classes in higher dimensional space thus there is no bias of the model for high data imbalance. Along with this, the data features themselves are principal components from PCA so from a geometric persepective component these vectors already define coordinates in space (as they are linearly independent and form a basis) and thus the K-neighbors spacial segmentation fits well into the form of the input data while the NN has no such assumptions. Overall, both models perform well so if computation is no concern then an ensamble of K-nearest neighbors solves this fraud detection better then a traditional deep neural network.