<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/CreditCardUnsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Introduction (together, 5 minutes)

In this "notebook", we can run Python commands, make plots, and make notes. 

The purpose of this notebook is to guide you through some approaches to outlier detection using Python, and give you an impression of what the various algorithms do. 

Note that there are two types of cells in this notebook: Markdown cells (that contain text, like this one), and Code cells (that execute some code, like the next cell). 

By clicking the "Play-button" on a cell, we execute that code cell. Lines within code cells that start with a "#" are Python comments, and not executed. 

Your input is required whenever there is a Question (in that case: write in the Markdown cell) or whenever you find some 'xxxxx' in the code cell (in this case, some code needs to be fixed or completed).

We start by importing our outlier data, by executing the next cell. 

In [None]:
## Data import from Github
!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_unsupervised.csv.zip

We will be using the "pandas" package for data handling and manipulation, and later "scikit-learn" (imported with "sklearn") for various outlier detection algorithms. 

In [None]:
## Package import: pandas for data handling and manipulation
import pandas as pd

# A small hack: "monkey-patching" the DataFrame class to add column-wise normalization as a method
def normalize_columns(self,):
    return (self - self.mean()) / self.std()

pd.DataFrame.normalize_columns = normalize_columns

Next, we will load the data in a so-called DataFrame (a pandas object), and inspect it by plotting the N-top rows

In [None]:
X = pd.read_csv('X_unsupervised.csv.zip')
# .head() returns a DataFrame, that consists of the first N (default: N=5) rows 
# of the DataFrame it is applied on
X.head() 

The data describes credit card transactions, one transaction per row. 

As you may notice, all features are numeric. All Vx features were generated by compressing the original data using a mathematical operation called PCA. In reality, we always have to convert our data to a purely numerical form (however, we generally want to avoid losing touch of the meaning of the attributes, for instance reasons of explainability).

In this case, it is advantageous because little pre-processing or interpretation is needed, and we can feed the data directly into any algorithm, which will save us time. 

Let us first determine the dimensions of the DataFrame (note that the first dimension goes along the rows, the second along columns):

In [None]:
X.shape

In any realistic situation, we would not have access to labels (otherwise, we would be using a supervised approach) and typically know nothing about the fraction of positives. We will already give one fact away: the fraction of positive labels is about 0.3%. 

Before proceeding, let's demonstrate some dataframe operations, with a smaller demonstration dataframe. 

### Some useful pandas DataFrame methods:

In [None]:
# Let's demonstrate the hints with a smaller dataframe (the first 5 rows):
small_df = X.head(5).copy()

**.drop()** 

The .drop(columns=[...]) method can be applied on a DataFrame to drop one or more rows or columns, and returns itself (i.e.: a DataFrame). 

Example usage to delete ("drop") one or more columns: 

In [None]:
small_df.drop(columns=['V1', 'V5']) # This drops the V1 and V5 column s

**.abs()** 

The .abs() method can be applied on a DataFrame (or Series) to convert absolute numerical values, and returns itself (i.e., a DataFrame). 

Example usage for .abs():

In [None]:
small_df.abs()

**.max(axis=1), .sum(axis=1), .mean(axis=1)**

These methods can be applied on a DataFrame to do row-wise operations. They all returna a Series (with as many rows as the DataFrame it was applied on)

In [None]:
small_df.max(axis=1)

**.normalize_columns()**

We added this method to our DataFrame in the beginning. It performs column-wise normalization (i.e.: after this operation, the column-wise mean is zero, and the column-wise variance is one. 

In [None]:
small_df.normalize_columns()

#### Other useful operations: selecting single and multiple columns

Generally, this returns a DataFrame when selecting multiple columns, and a Series when selecting a single columns

- Selecting a single column by its name:

In [None]:
# A single column:
small_df['Amount']

- Selecting multiple columns with their numerical index using .iloc: 

In [None]:
# The first 5 columns:
small_df.iloc[:, :5]

In [None]:
# All columns execpt the last one:
small_df.iloc[:, :-1]

Note that many pandas DataFrame methods return a DataFrame, on which we can apply another function. 
Applying a method on the result of another method is called "chaining". We can for instance first drop a column, then use normalize_columns() (our home-made addition) to normalize the columns), then .abs() to convert to absolute, then .sum(axis=1) to sum horizontally, to yield a Series:

In [None]:
small_df.drop(columns=['Amount']).normalize_columns().abs().min(axis=1)

## 1. Generating a homemade outlier score (group assignment, 5 minutes)

Generate an array with outlier scores based on your own hand-made logic. Store the outlier predictions in a pandas Series with the name "homemade_outlier_scores", using the examples below. 


**Question 1:** what shape should this array have? (# rows, # columns)


Answer: xxxxx


Below, create an outlier score using the previously shown concepts. 

It is recommended to drop a column (which one??) before doing so. 

In [None]:
# Some examples to make an outlier score below. Uncomment (remove the "#") to execute it.




# Some options below. Note that only the last executed line will be kept!

# homemade_outlier_scores = X['Amount']
# homemade_outlier_scores = X['V1'].abs()
# homemade_outlier_scores = X.iloc[:, :10].abs().max(axis=1)
# homemade_outlier_scores = xxxxx (your own score, if desired)



In [None]:
# To verify the shape, add .shape to the dataframe and look at the output
homemade_outlier_scores.shape

### 2. Use an outlier algorithm to generate outlier scores (10 minutes)

Go to the section of the outlier algorithm assigned to you to generate your scores. 


In [None]:
# from sklearn.neighbors import LocalOutlierFactor
# !pip install seaborn==0.11.1 # Needed for plotting
!pip install tensorflow
import numpy as np
from sklearn.covariance import EmpiricalCovariance
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.mixture import GaussianMixture
from pyod.models.auto_encoder import AutoEncoder
from sklearn.preprocessing import MinMaxScaler


### Mahalanobis Distance

As the name suggests, the Mahalanobis distance is a distance-based outlier detection. The Mahalonobis distance is a generalization of the distance in standard deviation to any multidimensional, normal distribution.

In the cells below: 
- Create an EmpiricalCovariance object with the correct parameters (find out which ones)
- Fit the data to this model
- Assign the scores to "mah_outlier_scores", using the method "mahalanobis"

Can the outcome of this method be directly used as an outlier score?

In [None]:
?EmpiricalCovariance

# cov = EmpiricalCovariance(xxxx)
# cov.fit(xxxx)
# mah_outlier_scores = cov.???

**Question**: in which situation is the Mahalonobis-distance equal to do a simple, column-wise mean? 

### Nearest neighbours

The family ofnearest neighbours-based algorithms looks at the immediate "neighbour" in terms of nearest points. 
If a point is far-removed from its neighbours, it may be considered untypical. The distance may however be determined in several different ways. 

In the cells below: 
- Create an EmpiricalCovariance object with the correct parameters (find out which ones)
- Fit the data to this model
- Assign the scores to "mah_outlier_scores", using the method "mahalonobis"

Let's create a NearestNeighbors object, and use that. First, we may want to read some documentation regarding the NearestNeighbors class:

In [None]:
?NearestNeighbors
# nn = NearestNeighbors(xxxxx)
# nn.fit(xxxxx)
# distances_to_neighbors = nn.kneighbors()[0]

The "heavy lifting" was done by the kneighbors() method. 
It returns the distances to the first N points, and the index of the nearest point 


As a final step, reduce the distance matrix (size: N points x N neighbours) to scores, one per point (row). 
Use an appropiate numpy function (ideas are: mean, median, min, max)

In [None]:
# knn_mean_outlier_scores = np.xxxx(distances_to_neighbors, axis=1)

**Question**: what is an interpretation (say when n_neighbours is 11) for the

- median
- min
- max

distance to the n_neighbours?

### Isolation Forest algorithm

The isolation forest algorithm measures how difficult it is to isolate a point from the rest of the data, using a random splitting algorithm many times. We want to have 1000 estimators (trees), and each estimator to be based on a sample of 1024 points

In the cells below: 
- Create an IsolationForest object with the correct parameters
- Fit the IsolationForest object with the data
- Get the scores using .score_samples()


In [None]:
?IsolationForest
# iforest = IsolationForest(xxxx=xxxxx, xxx=xxxxxx)
# iforest.fit(xxxx)


The score as returned by score_samples() is a measure for the number of needed splits to isolate a point. 

**Question:** Is a high score or a low score an indication for a point being an outlier? 
Reflect this in your score calculation. 


In [None]:
# iforest_outlier_scores = iforest.score_samples(xxx)

### Gaussian Mixture

The Gaussian Mixture assumes the data consists of one or multiple "blobs" of clusters with some normal distribution (NB: with a co-variance matrix constrained to be spherical, diagonal or non-constrained - full). 
After fitting, the method .score_samples() returns some probability measure (probability density of the point within the gaussian mixture distribution). 

In the cells below, 

- Create a GaussianMixture object with sensible parameters 
- Fit the object to the data
- Get scores for the individual data points using .score_samples() 

In [None]:
?GaussianMixture
# gmm = GaussianMixture(xxx=xxxx, xxx=xxx, random_state=1) 
# gmm.fit(xxxx)

**Question**: The .score_samples method returns a probability. Can the scores be used as they are or do they need to be modified? Reflect this in your calculation. 

In [None]:
# gmm_scores = gmm.score_samples(xxx)


###  Autoencoder

Autoencoders are a special type of neural networks, that are trained to effectively compress and decompress a signal. The idea behind using these networks for outlier detection, is that the neural network is expected to handle "typical" datapoints well, whereas it will struggle with outliers. 

We use the pyod AutoEncoder class, because this way we don't have to define all details of the neural network architecture, but can specify the main parameters. 

- Use the scaled data X_scaled (why do you think?)
- 

In [None]:
?AutoEncoder

In [None]:
X_scaled = MinMaxScaler().fit_transform(X)
clf = AutoEncoder(
    hidden_neurons=[10, xxxx, 10], # Choose bottleneck here!
    hidden_activation='elu',
    output_activation='sigmoid', # Choose an activation ('linear', 'sigmoid', 'relu', 'elu' are some possibilities)
    input_activation='sigmoid',
    optimizer='adam',
    epochs=10,
    batch_size=16,
    dropout_rate=0.0, #may not be needed here
    l2_regularizer=0.0,
    validation_size=0.1,
    preprocessing=False, #NB: this uses sklearn's StandardScaler
    verbose=1,
    random_state=1,
)

In [None]:
clf.fit(X_scaled)

In [None]:
?clf.predict_proba

**Question:** Inspect the documentation on .predict_proba, and calculate the outlier scores. 

In [None]:
# autoenc_outlier_scores = clf.predict_proba(xxxx)

##  3: Plot and compare results (5 min)

In the next section, you will compare how your algorithm did against your "home-made" algorithm, using the labels (ground-truth: is a point an outlier or not? In this case: is a transaction fraudulent or not?) 

Note that this information is usually not available for those problems where we decide to use outlier detection. 

In [None]:
# Get the labels, and a helper module
!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y_unsupervised.csv.zip
!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/outlierutils.py
y = pd.read_csv('y_unsupervised.csv.zip')['Class']


In [None]:
from outlierutils import plot_top_N, plot_outlier_scores

### Show the conditional distributions of the scores, and the AUC metrics

(NB: only plot your "homemade" score and your own algorithm. 

In [None]:
_ = plot_outlier_scores(y.values, np.log1p(homemade_outlier_scores), title='Homemade: ')

In [None]:
_ = plot_outlier_scores(y.values, np.log1p(knn_mean_outlier_scores), title='KNN: ')

In [None]:
_ = plot_outlier_scores(y.values, np.log1p(iforest_outlier_scores), title='Isolation Forest: ')

In [None]:
_ = plot_outlier_scores(y.values, np.log10(mah_outlier_scores), title='Mahalonobis: ')

In [None]:
_ = plot_outlier_scores(y.values, np.log10(autoenc_outlier_scores), title='Autoencoder: ')

In [None]:
_ = plot_outlier_scores(y.values, np.log10(gmm_scores - np.min(gmm_scores) + 1), title='GMM: ')

### Showing the precision@top-N

In [None]:
_ = plot_top_N(y_true=y, scores=homemade_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=knn_mean_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=iforest_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=mah_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=autoenc_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=gmm_scores, N=100)

**Question:** based on the number of positives, what precision do you expect when randomly guessing?

## Discussion



- How did you construct your home-made outlier model?
- How did it perform?
- What choices did you make for the outlier algorithm, if any, and why?
- How do both compare?

