<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/BreakoutSession_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
## Package installing and data import

!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X.csv.zip
!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y.csv.zip
!curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/outlierutils.py


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.2M  100 20.2M    0     0  9027k      0  0:00:02  0:00:02 --:--:-- 9027k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1956  100  1956    0     0   5669      0 --:--:-- --:--:-- --:--:--  5669
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12860  100 12860    0     0  39088      0 --:--:-- --:--:-- --:--:-- 38969


### Package imports 

We are using Python as a programming language. Its great advantage for data science purposes lies in the many and extensive open-source packages for data manipulation and machine learning. We will be using pandas for data handling, and scikit-learn (sklearn) for various outlier detection algorithms. 

Also, we imported a self-made module (outlierutils.py) that will be used for inspecting our results. 

In [25]:
import pandas as pd

Next, we will load the data in a so-called DataFrame (a pandas object), and inspect it.

In [26]:
X = pd.read_csv('X.csv.zip')
X.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,1.378,-0.01831,0.02579,-0.4705,-1.359,0.1285,-0.3384,0.0987,1.468,0.208,...,0.3638,0.0908,0.2778,0.4624,-0.6177,-0.1891,2.537,-0.07275,-0.02106,149.6
1,0.4482,-0.2258,-0.1833,0.4639,1.191,0.1671,0.06003,0.0851,0.6357,-0.1148,...,-0.2554,-0.167,-0.6387,-0.08234,1.065,0.1259,0.1665,0.266,0.014725,2.69
2,0.38,0.248,-0.12134,-2.89,-1.358,-0.3276,-0.5034,0.2477,2.346,1.11,...,-1.515,0.2076,0.7715,1.801,0.0661,-0.139,1.773,-1.34,-0.05975,378.8
3,-0.8633,-0.1083,1.966,-1.06,-0.9663,0.6475,-0.01031,0.3774,-0.6313,-0.684,...,-1.387,-0.05496,0.005272,1.247,0.1782,-0.2219,1.793,-0.1852,0.06146,123.5
4,0.403,-0.00943,-0.0382,-0.4514,-1.158,-0.206,-0.4072,-0.2705,0.1752,-0.237,...,0.818,0.753,0.7983,0.09595,0.538,0.5024,1.549,0.878,0.2152,70.0


The data describes credit card transactions, one transaction per row. Only the final column (Amount in USD) has a meaning. 

As you may notice, all features are numeric. They were compressed and anonymized using a mathematical operation called PCA. In reality, we always have to convert our data to a purely numerical form (however, we generally want to avoid losing touch of the meaning of the attributes, for instance reasons of explainability).

In this case, it is advantageous because no pre-processing or interpretation is needed, and we can feed the data directly into any algorithm, which will save us time. 

Before proceeding, let us determine the dimensions of the DataFrame:

In [13]:
X.shape

(284807, 29)

In any realistic situation, we would not have access to labels (otherwise, we would be using a supervised approach) and typically know nothing about the fraction of positives. We will already give one fact away: the fraction of positive labels is 0.17%. 

## Assignment 1: Generate your own outlier score
We will generate an array with outlier scores, based on your own hand-made logic. 

#### Step 1: what shape should this array have? (# rows, # columns)

#### Step 2: using the .sum(), .max() and .abs() methods, create an outlier score, either by selecting one of the examples below, or by modifying them

#### Step 3: verify that the shape is correct


## Hints: 

1. we can select a single column by its name, and multiple columns by .iloc. 
Let's demonstrate with a smaller dataframe (the first 5 rows):


In [30]:
small_df = X.head()
# A single column:
small_df['Amount']

0    149.60
1      2.69
2    378.80
3    123.50
4     70.00
Name: Amount, dtype: float64

In [50]:
# All rows, and the first 5 columns:
small_df.iloc[:, :5]

Unnamed: 0,V1,V2,V3,V4,V5
0,1.378,-0.01831,0.02579,-0.4705,-1.359
1,0.4482,-0.2258,-0.1833,0.4639,1.191
2,0.38,0.248,-0.12134,-2.89,-1.358
3,-0.8633,-0.1083,1.966,-1.06,-0.9663
4,0.403,-0.00943,-0.0382,-0.4514,-1.158


In [51]:
# All rows, all columns except the last 10 ones:
small_df.iloc[:, :-10]

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19
0,1.378,-0.01831,0.02579,-0.4705,-1.359,0.1285,-0.3384,0.0987,1.468,0.208,0.2515,0.06696,-0.991,0.404,-0.552,-0.3113,-0.1105,0.1335,0.2396
1,0.4482,-0.2258,-0.1833,0.4639,1.191,0.1671,0.06003,0.0851,0.6357,-0.1148,-0.0691,-0.3398,0.489,-0.1458,1.612,-0.1438,0.1013,-0.00898,-0.0788
2,0.38,0.248,-0.12134,-2.89,-1.358,-0.3276,-0.5034,0.2477,2.346,1.11,0.525,-0.6895,0.7173,-2.262,0.6245,-0.1659,0.909,-0.05536,0.7915
3,-0.8633,-0.1083,1.966,-1.06,-0.9663,0.6475,-0.01031,0.3774,-0.6313,-0.684,-0.208,-1.176,0.508,-1.232,-0.2264,-0.2878,-0.1903,0.06274,0.2375
4,0.403,-0.00943,-0.0382,-0.4514,-1.158,-0.206,-0.4072,-0.2705,0.1752,-0.237,0.4084,0.1412,1.346,0.8037,-0.8228,-1.12,-0.1375,0.2195,0.593


We can use .max(axis=1) and .sum(axis=1) to get the max- and summation over all columns (this reduces the size of the dataframe from m rows x n columns to m rows. 

Also, we can use .abs() to convert the values to absolute (this doesn't change the size)

In [52]:
small_df.iloc[:, :10].abs()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10
0,1.378,0.01831,0.02579,0.4705,1.359,0.1285,0.3384,0.0987,1.468,0.208
1,0.4482,0.2258,0.1833,0.4639,1.191,0.1671,0.06003,0.0851,0.6357,0.1148
2,0.38,0.248,0.12134,2.89,1.358,0.3276,0.5034,0.2477,2.346,1.11
3,0.8633,0.1083,1.966,1.06,0.9663,0.6475,0.01031,0.3774,0.6313,0.684
4,0.403,0.00943,0.0382,0.4514,1.158,0.206,0.4072,0.2705,0.1752,0.237


## Your experiments and work below:

In [44]:
# Some examples to make an outlier score below. Uncomment (remove the "#") to execute it.
# Only the last executed one will be kept

homemade_outlier_score = X['Amount']
# homemade_outlier_score = X['V1'].abs()
# homemade_outlier_score = X.iloc[:, :10].abs().max(axis=1)




In [45]:
# To verify the shape, add .shape to the dataframe and look at the output
homemade_outlier_score.shape

(284807,)

## Assignment 2 (10 minutes): Use an outlier algorithm to generate the outlier scores

We will use one of the various readily available outlier algorithms to generate scores. 

In Python, we typically first make an instance of a class (an object), than we perform various tasks (methods) with it. 

In [71]:
# First, we import some algorithms 
# from sklearn.neighbors import LocalOutlierFactor
!pip install pyod==0.8.8
!pip install seaborn==0.11.1

from sklearn.covariance import EmpiricalCovariance #, MinCovDet # (MinCovDet may be very slow)
from pyod.models.knn import KNN
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.mixture import GaussianMixture

You should consider upgrading via the '/Users/ernstoldenhof/Projects/MasterclassSFI2021/venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/ernstoldenhof/Projects/MasterclassSFI2021/venv/bin/python -m pip install --upgrade pip' command.[0m


#### Option: Nearest neighbours 
(uses a lot of time --> I may need to make the dataset smaller)

Let's create a NearestNeighbors object, and use that. First, we may want to read some documentation regarding the NearestNeighbors class:

In [57]:
?NearestNeighbors

Most default settings seem ok for a start. An interesting parameter to change may however be n_neighbors.

Set n_neighbors to a value that seems okay (giving no arguments will get you all default values, as far as defaults are given)

In [None]:
nn = NearestNeighbors(n_neighbors=10)

Now we have the object ready to accept data. We can directly fit it on the data using the .fit() method: 

In [85]:
import time
t0 = time.process_time()
nn.fit(X.head(30000))
kneighbors = nn.kneighbors()[0]
duration = time.process_time() - t0
print(duration)

28.532253000000026


In [80]:
kneighbors.mean(axis=1)

(50000,)

In [17]:
y = pd.read_csv('y.csv.zip')

y.mean()*100

Class    0.172749
dtype: float64

In [86]:
y.head(30000).sum()

Class    94
dtype: int64

In [None]:
!pip install seaborn
from outlierutils import plot_top_N, plot_outlier_scores

In [None]:
cov_ = EmpiricalCovariance().fit(X)
# cov_ = MinCovDet().fit(X) # Robust estimation
mahalonobis_scores = cov_.mahalanobis(X)


In [None]:
mahalonobis_scores = np.log10(mahalonobis_scores)
res = plot_outlier_scores(y.values, mahalonobis_scores, bw=0.1, title='Mahalonobis')

In [None]:
res = plot_top_N(y.values, mahalonobis_scores, N=100)

In [None]:
gmm = GaussianMixture(n_components=5, covariance_type='full', random_state=1) # try also spherical
gmm.fit(X, )
gmm_scores = - gmm.score_samples(X)

In [None]:
# gmm_scores = np.clip(gmm_scores, -15, 50)
res = plot_outlier_scores(y.values, np.log10(gmm_scores+100), bw=0.1, title='Pen digits, Mahalonobis (GMM)')

In [None]:
res = plot_top_N(y.values, gmm_scores, N=100)

## Assignment 3: Plot and compare results
