<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/CreditCardUnsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Fraud Detection


**Introduction**


The purpose of this Jupyter notebook is to take you through several unsupervised outlier detection algorithms, and show their value for the purpose of fraud detection. The completely different approach compared to the supervised case, is that we assume we have insufficient labels to recognize fraud patterns, as is often the case in fraud detection (fraud events are often very rare). In applying an outlier detection for fraud detection, we hypothesize that payment patterns that are "untypical" (outliers) are more likely to be fraudulent.  


Generally, whenever an unsupervised approach is chosen, there are no labels available; neither for algorithm optimization, nor for comparison or validation. For this masterclass, we do use a labeled dataset, which will only be used at the very end, to get a feeling of how the various algorithms perform and compare. 


The data was taken from https://www.kaggle.com/mlg-ulb/creditcardfraud, and downsampled for the purpose of this masterclass. 

Note that for each algorithm, we want outliers get higher scores. 


In [None]:
## Data import from Github
import os
if not os.path.exists('X_unsupervised.csv.zip'):
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_unsupervised.csv.zip

We will be using the "pandas" package for data handling and manipulation, and later "scikit-learn" (imported with "sklearn") for various outlier detection algorithms. 

In [None]:
## Package import: pandas for data handling and manipulation
import warnings
warnings.filterwarnings("ignore")
import pandas as pd

# A small hack: "monkey-patching" the DataFrame class to add column-wise normalization as a method
def normalize_columns(self,):
    return (self - self.mean()) / self.std()

pd.DataFrame.normalize_columns = normalize_columns

Next, we will load the data in a so-called DataFrame (a pandas object), and inspect it by plotting the N-top rows

In [None]:
X = pd.read_csv('X_unsupervised.csv.zip')
# .head() returns a DataFrame, that consists of the first N (default: N=5) rows 
# of the DataFrame it is applied on
X.head() 

The data describes credit card transactions, one transaction per row. 

As you may notice, all features are numeric. All Vx features are the result of a mathematical operation called PCA. In reality, we have to deal often with non-numerical (for instance, categorical data), that requires some effort to make it numerical and suitable for the mathematical models we work with. 

The pre-fabricated data thus saves us considerable time. 

Let us first determine the dimensions of the DataFrame (note that the first dimension goes along the rows, the second along columns):

## Assignment: Defining a homemade outlier score

Generate an array with outlier scores based on your own hand-made logic. Store the outlier predictions in a pandas Series with the name "homemade_outlier_scores", using the examples below. 


**Question:**

Below, there are several options to create a hand-crafted outlier score. 

Which one would you chose, and why? Uncomment your preference, to assign the outcome to the variable 'homemade_outlier_scores'. If you see a better solution, you are free to implement that. 

The various methods that are used on the DataFrame are:
- .abs() This method converts all values to their absolute (and does not change the size of the DataFrame)
- .drop(columns=...) This method returns the DataFrame with the indicated columns (may be a string or a list of strings) removed
- .max(axis=1) This method, when executed with axis=1, sums over all columns
- .mean(axis=1) 

In [None]:
# Some examples to make an outlier score below. Uncomment (remove the "#") to execute it.

# homemade_outlier_scores = X.drop(columns='Time').abs().max(axis=1)
# homemade_outlier_scores = (X.normalize_columns()**2).mean(axis=1)
# homemade_outlier_scores = X['Amount']
# homemade_outlier_scores = X.drop(columns='Time').abs().max(axis=1) # .drop() returns the cropped dataframe



In [None]:
# To verify the shape, add .shape to the dataframe and look at the output. What shape should it be?
# homemade_outlier_scores.shape

# Outlier algorithms

Go to the section of the outlier algorithm assigned to you or chosen by you to generate your scores. 
First run the cell below for important imports.


In [None]:
# from sklearn.neighbors import LocalOutlierFactor
# !pip install seaborn==0.11.1 # Needed for plotting
# !pip install tensorflow
import numpy as np
from sklearn.covariance import EmpiricalCovariance
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import MinMaxScaler

try:
    import pyod
except ModuleNotFoundError:
    !pip install pyod
from pyod.models.auto_encoder import AutoEncoder




Also get rid of the "Time" column by executing the next cell

In [None]:
X = X.drop(columns='Time')

## Mahalanobis Distance

The Mahalanobis distance is a generalization of distance (measured in standard deviations) to a multivariate normal distribution. The assumption being made is thus that the data is normally distributed, and that outliers are located further away from the center than the inliers. 

Run the cells below to: 
- Create an EmpiricalCovariance object
- Fit the data to this model
- Assign the scores to "mah_outlier_scores", using the method "mahalanobis"

If necessary, change the sign of the assignment from`+` to `-`. 

In [None]:
?EmpiricalCovariance

In [None]:
cov = EmpiricalCovariance()
cov.fit(X)
mah_outlier_scores = + cov.mahalanobis(X)

In [None]:
?cov.mahalanobis

**Question**: in which situation is the Mahalonobis-distance equal to do a simple, column-wise mean of squared values? 

## Gaussian Mixture

The Gaussian Mixture is a assumes the data consists of one or multiple "blobs" of clusters with some normal distribution (NB: with a co-variance matrix constrained to be spherical, diagonal or non-constrained - full). It is a "soft clustering" method, as each point may belong to each cluster with some probability. 
After fitting, the method .score_samples() returns some probability measure (probability density of the point within the gaussian mixture distribution). 

Run the cells below to: 

- Create a GaussianMixture object (you may adapt the parameters if desired)
- Fit the object to the data
- Get scores for the individual data points using `.score_samples()`

If necessary, change the sign of the assignment from `+` to `-` (read the documentation to decide this). 

In [None]:
?GaussianMixture

In [None]:
gmm = GaussianMixture(n_components=5, covariance_type='full', random_state=1, n_init=3) 
gmm.fit(X)

In [None]:
gmm_scores = +gmm.score_samples(X)

In [None]:
?gmm.score_samples

In [None]:
?gmm.score_samples

**Question**: It is not really possible to know in advance a good value for the number of components. Can you think of a procedure to estimate it?


## Nearest neighbours

Neighbourhood-based algorithms look at the points in the vicinity of a pointto determine its "outlierness". 
In the most basic NearestNeighbor algorithm as used here, the distance of a point to its neighbours is used to measure its outlierness. (The more involved LOF algorithm uses the deviation in local density of a data point with respect to its neighbors). 

Run the cells below to: 
- Create a NearestNeighbors object (adapt the parameters if you wish)
- Fit the data to this model
- Assign the scores to `knn_outlier_scores`, by aggregating the data in `distances_to_neighbors`

Let's create a NearestNeighbors object, and use that. First, we may want to read some documentation regarding the NearestNeighbors class:

In [None]:
?NearestNeighbors

In [None]:
nn = NearestNeighbors(n_neighbors=50)
nn.fit(X)
distances_to_neighbors = nn.kneighbors()[0]

The "heavy lifting" was done by the `.kneighbors()` method. 
It returns for each point the distances to the nearest N points, and the index of the nearest point. 

As a final step, we collapse this distance matrix (m points x N neighbours) to m scores. This may be done in several ways, for instance by taking the mean, or the median. Choose one of the options given below (by default, the mean is taken). 

In [None]:
knn_outlier_scores = np.mean(distances_to_neighbors, axis=1)
# knn_outlier_scores = np.median(distances_to_neighbors, axis=1)
# knn_outlier_scores = np.max(distances_to_neighbors, axis=1)
# knn_outlier_scores = np.min(distances_to_neighbors, axis=1)


**Question**: what is an interpretation (say when n_neighbours is 11) for the

- median
- min
- max

distance to the n_neighbours?

## Isolation Forest algorithm

The isolation forest algorithm measures how difficult it is to isolate a point from the rest of the data, using a random splitting algorithm. Similar to the Random Forest algorithm, many (`n_estimators`) different trees are built, each time based on a randomly drawn sample (of size `max_samples`).

Run the cells below to: 
- Create an IsolationForest object with the correct parameters
- Fit the IsolationForest object with the data
- Get the scores using `.score_samples()`

If necessary, change the sign of the assignment from `+` to `-` (read the documentation, and note that the score as returned by `.score_samples()` is a measure for the number of needed splits to isolate a point). 

In [None]:
?IsolationForest

In [None]:
iforest = IsolationForest(n_estimators=100, max_samples=1024)
iforest.fit(X)


In [None]:
iforest_outlier_scores = +iforest.score_samples(X)

In [None]:
?iforest.score_samples

**Question:** What advantage and disadvantage of this method do you see?


##  Autoencoder

Autoencoders are a special type of neural networks, that are trained to effectively compress and decompress a signal. The idea behind using these networks for outlier detection, is that the neural network is expected to handle "typical" datapoints well, whereas it will struggle with outliers. 

We use the pyod `AutoEncoder` class to construct the network. This way we don't have to bother with the details of building the network, and can focus on the main parameters. 

Run the cells below to: 
- Create an Autoencoder object
- Train this object on the data
- Get the scores using .score_samples()


In [None]:
?AutoEncoder

In [None]:
?MinMaxScaler

In [None]:
X_scaled = MinMaxScaler().fit_transform(X)
clf = AutoEncoder(
    hidden_neurons=[10, 5, 10], # Choose bottleneck here!
    hidden_activation='elu',
    output_activation='sigmoid', 
    optimizer='adam',
    epochs=10,
    batch_size=16,
    dropout_rate=0.0, #may not be needed here
    l2_regularizer=0.0,
    validation_size=0.1,
    preprocessing=False, #NB: this uses sklearn's StandardScaler
    verbose=1,
    random_state=1,
)

In [None]:
clf.fit(X_scaled)

In [None]:
?clf

In [None]:
autoenc_outlier_scores = + clf.decision_scores_

**Question**: Why do you think we had to scale the data with a MinMaxScaler (hint: the signal needs to be reconstructed by the network). 


# Plot and compare results

In the next section, you will compare how your algorithm did against your "home-made" algorithm, using the labels (ground-truth: is a point an outlier or not? In this case: is a transaction fraudulent or not?). 
Note that this information is usually not available for those problems where we decide to use outlier detection.


Look carefully at the plots and assess their meaning. 



In [None]:
# Get the labels, and a helper module
force_download = False
if force_download or not os.path.exists('y_unsupervised.csv.zip'): # then probably nothing was downloaded yet
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y_unsupervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/ml_utils.py
y = pd.read_csv('y_unsupervised.csv.zip')['Class']


In [None]:
from ml_utils import plot_top_N, plot_conditional_distribution

### Conditional distributions of the scores, and the AUC metrics

NB: 
- plot your "homemade" score and your own algorithm
- use np.log1p(your_score) to logtransform your score if needed (in case it has a very long "tail")


In [None]:
# _ = plot_outlier_scores(y.values, np.log1p(homemade_outlier_scores), title='Homemade: ')

In [None]:
_ = plot_conditional_distribution(y.values, knn_mean_outlier_scores), title='KNN: ')

In [None]:
_ = plot_conditional_distribution(y.values, iforest_outlier_scores, title='Isolation Forest: ')

In [None]:
_ = plot_conditional_distribution(y.values, mah_outlier_scores, title='Mahalonobis: ')

In [None]:
_ = plot_conditional_distribution(y.values, autoenc_outlier_scores, title='Autoencoder: ')

In [None]:
_ = plot_conditional_distribution(y.values, (gmm_scores - np.min(gmm_scores)), title='GMM: ')

The following plots shows how many of the top-N points in terms of score were actual outliers (the more yellow the plot, the better the algorithm performed).


### Precision@top-N

In [None]:
# _ = plot_top_N(y_true=y, scores=homemade_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=knn_mean_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=iforest_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=mah_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=autoenc_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=gmm_scores, N=100)

**Question:** based on the number of positives, what precision do you expect when randomly guessing?

# Discussion



- How did you construct your home-made outlier model?
- How did it perform?
- What choices did you make for the outlier algorithm, if any, and why?
- Answers to the questions
