<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/CreditCardUnsupervised_Sensitivity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Fraud Detection - Sensitivity Study


**Introduction**


The purpose of this Jupyter notebook is to show the sensitivity of the various models to changes in their parameters.


The data was taken from https://www.kaggle.com/mlg-ulb/creditcardfraud, and downsampled for the purpose of this masterclass. 



In [None]:
## Data import from Github

import os
import pandas as pd

force_download = False
if force_download or not os.path.exists('X_unsupervised.csv.zip'):
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_unsupervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y_unsupervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/ml_utils.py
X = pd.read_csv('X_unsupervised.csv.zip')
X = X.drop(columns='Time')
y = pd.read_csv('y_unsupervised.csv.zip')['Class']

We will be using the "pandas" package for data handling and manipulation, and later "scikit-learn" (imported with "sklearn") for various outlier detection algorithms. 

# Outlier algorithms

Go to the section of the outlier algorithm assigned to you or chosen by you to generate your scores. 
First run the cell below for important imports.


In [None]:
# from sklearn.neighbors import LocalOutlierFactor
# !pip install seaborn==0.11.1 # Needed for plotting
# !pip install tensorflow
# !pip install pyod
import numpy as np
from sklearn.covariance import EmpiricalCovariance
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
try:
    from pyod2.models.auto_encoder import AutoEncoder
except ModuleNotFoundError:
    !pip install pyod
    

## Mahalanobis Distance

In [None]:
cov = EmpiricalCovariance()
cov.fit(X)
mah_outlier_scores = cov.mahalanobis(X)

In [None]:
print(f'Mahalanobis score: {roc_auc_score(y, mah_outlier_scores):.2f}')

## Gaussian Mixture


In [None]:
n_components_list = np.arange(2, 11)
gmm_scores_list = []
bic_list = []
for n_components in n_components_list:
    gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=1, n_init=3) 
    gmm.fit(X)
    gmm_scores_list.append(roc_auc_score(y, -gmm.score_samples(X)))
    bic_list.append(gmm.bic(X))
  

In [None]:
fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
ax1.plot(n_components_list, gmm_scores_list, 'g-', label='AUC')
ax2.plot(n_components_list, bic_list, 'r-', label='BIC')
plt.xticks(n_components_list)
ax1.set_xlabel('# Components')
ax1.set_ylabel('AUC score', color='g')
ax2.set_ylabel('BIC', color='r')
plt.title("AUC scores and BIC, GMM", fontsize=20)

plt.show()

## Nearest neighbours


In [None]:
kN_list = [5, 11, 31, 71, 201]
knn_scores_list = []
for kN in kN_list:
    nn = NearestNeighbors(n_neighbors=kN)
    nn.fit(X)
    distances_to_neighbors = nn.kneighbors()[0]
    knn_outlier_scores = np.mean(distances_to_neighbors, axis=1)
    knn_scores_list.append(roc_auc_score(y, knn_outlier_scores))
    

In [None]:
plt.semilogx(kN_list, knn_scores_list, 'k-', label='AUC')
plt.xticks(kN_list)
plt.xlabel('k')
plt.ylabel('AUC score', color='k')
plt.title("AUC scores, kNN", fontsize=20)
plt.show()

## Isolation Forest algorithm

In [None]:
N_sample_list = [2**N for N in np.arange(8, 13)]
iforest_scores_list = [] 
for N_samples in N_sample_list:
    iforest = IsolationForest(n_estimators=100, max_samples=N_samples)
    iforest.fit(X)
    iforest_scores_list.append(roc_auc_score(y, -iforest.score_samples(X)))
    


In [None]:
plt.plot(N_sample_list, iforest_scores_list, 'k-', label='AUC')
plt.xticks(N_sample_list)
plt.xlabel('max_samples')
plt.ylabel('AUC score', color='k')
plt.title("AUC scores, iForest", fontsize=20)
plt.show()

##  Autoencoder

Autoencoders are a special type of neural networks, that are trained to effectively compress and decompress a signal. The idea behind using these networks for outlier detection, is that the neural network is expected to handle "typical" datapoints well, whereas it will struggle with outliers. 

We use the pyod AutoEncoder class to construct the network. This way we can focus on the main parameters. 

Run the cells below to: 
- Create an Autoencoder object
- Train this object on the data
- Get the scores using .score_samples()


In [None]:
autoenc_scores = []
X_scaled = MinMaxScaler().fit_transform(X)
add_width_list = np.array([0, 2, 4, 6])
mid_width = 3
end_width = 8
for add_width in add_width_list:
    clf = AutoEncoder(
        hidden_neurons=[end_width+add_width, mid_width+add_width, end_width+add_width], # Choose bottleneck here!
        hidden_activation='elu',
        output_activation='sigmoid', 
        optimizer='adam',
        epochs=5,
        batch_size=16,
        dropout_rate=0.0, #may not be needed here
        l2_regularizer=0.0,
        validation_size=0.1,
        preprocessing=False, #NB: this uses sklearn's StandardScaler
        verbose=1,
        random_state=1,
    )
    clf.fit(X_scaled)
    autoenc_scores.append(roc_auc_score(y, clf.decision_scores_))

In [None]:
plt.plot(add_width_list + mid_width, autoenc_scores, 'k-', label='AUC')
plt.xticks(add_width_list + mid_width)
plt.xlabel('Bottleneck width')
plt.ylabel('AUC score', color='k')
plt.title("AUC scores, Autoencoder", fontsize=20)
plt.show()