<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/BreakoutSession_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

In this "notebook", we can run Python commands, make plots, and make notes. 

The purpose of this notebook is to guide you through some approaches to outlier detection using Python, and give you an impression of what the various algorithms do. 

Note that there are two types of cells in this notebook: Markdown cells (that contain text, like this one), and Code cells (that execute some code, like the next cell). 

By clicking the Play button on a cell, we execute a code cell. Lines that start with a "#" are comments, and not executed. 

Your input is required whenever there is a Question (in that case: write in the Markdown cell) or whenever you find some 'xxxxx' in the code cell (in this case, some code needs to be fixed or completed).

We start by importing our outlier data, by executing the next cell. 

In [None]:
## Data import from Github
!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X.csv.zip

We will be using pandas for data handling, and scikit-learn (sklearn) for various outlier detection algorithms. 

Also, we imported a self-made module (outlierutils.py) that will be used for inspecting our results. 

In [None]:
## Package import: pandas for data handling and manipulation
import pandas as pd

Next, we will load the data in a so-called DataFrame (a pandas object), and inspect it by plotting the N-top rows

In [None]:
X = pd.read_csv('X.csv.zip')
# .head() returns a DataFrame, that consists of the first N (default: N=5) rows 
# of the DataFrame it is applied on
X.head() 

The data describes credit card transactions, one transaction per row. 

As you may notice, all features are numeric. All Vx features were generated by compressing the original data using a mathematical operation called PCA. In reality, we always have to convert our data to a purely numerical form (however, we generally want to avoid losing touch of the meaning of the attributes, for instance reasons of explainability).

In this case, it is advantageous because little pre-processing or interpretation is needed, and we can feed the data directly into any algorithm, which will save us time. 

Before proceeding, let us determine the dimensions of the DataFrame:

In [None]:
X.shape

In any realistic situation, we would not have access to labels (otherwise, we would be using a supervised approach) and typically know nothing about the fraction of positives. We will already give one fact away: the fraction of positive labels is about 0.3%. 

### Generating a homemade outlier score (5 minutes)
We will generate an array with outlier scores, based on your own hand-made logic. 


**Question 1:** what shape should this array have? (# rows, # columns)


Answer: xxxxx


Before proceeding, let's demonstrate some dataframe operations, with a smaller demonstration dataframe. 

In [None]:
# Let's demonstrate the hints with a smaller dataframe (the first 5 rows):
small_df = X.head(3).copy()

- we can delete ("drop") one, or more columns as follows: 

In [None]:
small_df = small_df.drop(columns=['V1', 'V5']) # This drops the V1 and V5 column s
small_df

- We can select a single column by its name:

In [None]:
# A single column:
small_df['Amount']

- We can select multiple columns (amongst others) using .iloc: 

In [None]:
# All rows, and the first 5 columns:
small_df.iloc[:, :5]

In [None]:
# All rows, all columns except the last 10 ones:
small_df.iloc[:, :-10]

- We can use .max(axis=1) and .sum(axis=1) to get the max- and summation over all columns (this reduces the size of the dataframe from m rows x n columns to m rows. 

- Also, we can use .abs() to convert the values to absolute (this doesn't change the size)

In [None]:
small_df.iloc[:, :10].abs()

Below, create an outlier score using the previously shown concepts. 

It is recommended to drop a column (which one??) before doing so. 

In [None]:
# Some examples to make an outlier score below. Uncomment (remove the "#") to execute it.

# If you want to drop a column, do this as follows:

X = X.drop(columns= ['xxxxx']) # This will create a DataFrame without the 'xxxxx' column, and assign it to X again




# Some options below. Note that only the last executed line will be kept!

# homemade_outlier_scores = X['Amount']
# homemade_outlier_scores = X['V1'].abs()
# homemade_outlier_scores = X.iloc[:, :10].abs().max(axis=1)
# homemade_outlier_scores = xxxxx (your own score, if desired)



In [None]:
homemade_outlier_score = X.iloc[:, :-1].abs().sum(axis=1)

In [None]:
# To verify the shape, add .shape to the dataframe and look at the output
homemade_outlier_scores.shape

## Using outlier algorithms to generate outlier scores (10 minutes)

We will demonstrate some of the many readily available outlier algorithms to generate scores. 

Note that Python is an object-oriented programming language. We typically first make an instance of a class (an object, that we pre-configure), than we perform various tasks (methods) with it. 

First we need to import some more objects. 

In [None]:
# from sklearn.neighbors import LocalOutlierFactor
!pip install seaborn==0.11.1 # Needed for plotting

import numpy as np
from sklearn.covariance import EmpiricalCovariance
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.mixture import GaussianMixture

### Nearest neighbours algorithm


Let's create a NearestNeighbors object, and use that. First, we may want to read some documentation regarding the NearestNeighbors class:

In [None]:
?NearestNeighbors

Most default settings seem ok for a start. An interesting parameter to change may however be n_neighbors.

Set n_neighbors to a value that seems okay (giving no arguments will get you all default values, as far as defaults are given)

In [None]:
nn = NearestNeighbors(n_neighbors= xxxxx )

Now we have the object ready to accept data. We can pass it the data using the .fit() method: 

In [None]:
nn.fit(xxxxx)

In this case, the "heavy lifting" is done by the kneighbors() method. 
It returns the distances to the first N points, and the index of the nearest point 

NB: this takes about 20-30 seconds for this dataset!

In [None]:
distances_to_neighbors = nn.kneighbors()[0]


The output is m rows x N neighbors: for each point m, the distances to its N nearest neighbors. 
Taking the mean, the max, or the median or all reasonable approaches to get a single outlier score per point. 

Let's look at the values for the first point, point 0:

In [None]:
distances_to_neighbors[0]

Next, we need to compress this m x N data to an outlier score. We do this by determining the mean (or median, or min, or max, as you prefer) distance as an outlier score. Let us explore three options

In [None]:
knn_mean_outlier_scores = np.mean(distances_to_neighbors, axis=1)
knn_median_outlier_scores = np.xxxxx
knn_min_outlier_scores = np.xxxxx
knn_max_outlier_scores = np.xxxxx

knn_outlier_scores.shape

**Question:** is a high or a low score indicator for an outlier-ish point?

Answer: xxxxx

**Question:**  how may we interpret the median, min and max, in case N_neighbors is, say, 10?

Answer: xxxxxx

### Isolation Forest algorithm


In [None]:
?IsolationForest

Set the number of estimators to 1000, which gives a less noisy result. 

In [None]:
iforest = IsolationForest(xxxx=xxxxx)

Next, we fit the forest (create 1'000 splitting trees) with .fit(), and let all our datapoints pass this tree and count the needed splits with .score_samples() 

In [None]:
iforest.fit(X)
iforest_outlier_scores = iforest.score_samples(X)

The score is a measure for the number of needed splits to isolate a point. 

**Question:** Is a high score or a low score an indication for a point being an outlier?

### Mahalanobis distance

In [None]:
cov_outlier_scores = EmpiricalCovariance().fit(X).mahalanobis(X)

FOR GMM: The scores are the probability of a point belonging to its most likely cluster, for each point. 

**Question:** Is a high score or a low score an indication for a point being an outlier?

In [None]:
# In case the scores need to be reversed, un-comment and execute the next line
# gmm_outlier_scores = -gmm_outlier_scores

##  3: Plot and compare results

In the next section, we will see how well our algorithms did. Note that this information is often not available for problems where we apply outlier detection. 

In [None]:
# Get the labels, and a helper module
!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y.csv.zip
!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/outlierutils.py
y = pd.read_csv('y.csv.zip')['Class']


In [None]:
from outlierutils import plot_top_N, plot_outlier_scores

### Showing the conditional distributions of the scores, and the AUC metrics

In [None]:
_ = plot_outlier_scores(y.values, np.log1p(homemade_outlier_scores))

In [None]:
_ = plot_outlier_scores(y.values, np.log1p(knn_mean_outlier_scores))

In [None]:
_ = plot_outlier_scores(y.values, np.log1p(-iforest_outlier_scores))

In [None]:
_ = plot_outlier_scores(y.values, np.log10(cov_outlier_scores))

### Showing the precision@top-N

In [None]:
_ = plot_top_N(y_true=y, scores=homemade_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=knn_mean_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=-iforest_outlier_scores, N=100)

In [None]:
_ = plot_top_N(y_true=y, scores=cov_outlier_scores, N=100)

**Question:** based on the number of positives, what precision do you expect when randomly guessing?