<a href="https://colab.research.google.com/github/DonErnesto/amld2020-unsupervised/blob/master/notebooks/exercises_1_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise: Demonstration of several algorithms on the Pen Digits dataset

Exercises using the Pen Digits Dataset: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/literature/PenDigits/PenDigits_v01.html

The data is already present in .pkl format (load the data below).
This data set was created by recording the writing pattern of digits on a digital writing pad. The digit "4" is downsampled to only 20 instances (instead of ~1000 for the other points), making it an outlier. 

Note that (unlike MNIST), the features do not simply correspond to pixels, but are subsampled coordinates, 8 pairs. 

This dataset is small and simple: it has only numeric features and no NaN's. 

# Package installing and data import

In [None]:
# Now only load the required files...
!curl -O https://raw.githubusercontent.com/amld/workshop-unsupervised-fraud/master/outlierutils.py
!curl -O https://raw.githubusercontent.com/amld/workshop-unsupervised-fraud/master/data/x_pendigits.pkl
!curl -O https://raw.githubusercontent.com/amld/workshop-unsupervised-fraud/master/data/y_pendigits.pkl




In [None]:
# Install pyod package
!pip install pyod

## Imports 

In [None]:
# %tensorflow_version 1.x
# standard library imports
import os
import sys
from collections import Counter

# pandas, seaborn etc.
import seaborn as sns
import sklearn 
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np

# sklearn outlier models
from sklearn.neighbors import NearestNeighbors
# from sklearn.neighbors import LocalOutlierFactor
# from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture

# other sklearn functions
from sklearn.decomposition import PCA
from sklearn.covariance import MinCovDet, EmpiricalCovariance
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import scale as preproc_scale
from sklearn.manifold import TSNE

# pyod
import pyod
from pyod.models.auto_encoder import AutoEncoder
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from pyod.models.pca import PCA as pyod_PCA
from pyod.models.iforest import IForest



In [None]:
from outlierutils import plot_top_N, plot_outlier_scores # For easy plotting and evaluation

## Data loading

In [None]:
data_path = './'
x_pen = pd.read_pickle(os.path.join(data_path, 'x_pendigits.pkl'))
y_pen = pd.read_pickle(os.path.join(data_path, 'y_pendigits.pkl'))

# Scale and put again into a DataFrame
sc = StandardScaler()
x_pen = pd.DataFrame(data=sc.fit_transform(x_pen))

In [None]:
print('Number of points: {}'.format(len(y_pen)))
print('Number of positives: {} ({:.3%})'.format(y_pen.sum(), y_pen.mean()))

## Demo: Usage of plotting functions

See examples how to plot the conditional scores and the top-N ranking below

In [None]:
# example with random data
y_true_, scores_ = np.random.choice([0, 1], 100), np.random.uniform(size=100)
results = plot_outlier_scores(y_true=y_true_, 
                            scores=scores_, 
                            bw=0.05, 
                            title='Example.')

In [None]:
# The next plot shows the true labels of the N points with the highest outlier scores.
# More yellow is better!

results = plot_top_N(y_true=y_true_, scores=scores_, N=50)

Note that both `plot_outlier_scores` and `plot_top_N` expect numpy arrays. These may be obtained from a pandas Series using `pd.Series.values`


## Data visualization

t-SNE of large datasets may take a long time to compute. The next piece of code will downsample the negatives, while retaining all positives. 

In [None]:
N_downsample = 3000
assert x_pen.index.equals(y_pen.index), 'Error, indexes differ. Reset them to continue'
x_downsampled = pd.concat((x_pen[y_pen==0].sample(N_downsample - int(y_pen.sum()), random_state=1),
                           x_pen[y_pen==1]), 
                          axis=0).sample(frac=1, random_state=1)
y_downsampled = y_pen[x_downsampled.index]

#### Q 0. 
Reduce the dimensionality with T-SNE, and visualize the positive and negative class in a scatter plot. 
What do you observe?

**Hint**: To get help for a function or class, run `?<object>`  

In [None]:
?TSNE

In [None]:
MAX_N_TSNE = 4000 #Avoid overly long computation times with TSNE. Values < 4000 recommended 
neg = y_downsampled == 0
pos = y_downsampled == 1

assert len(x_downsampled) <= MAX_N_TSNE, 'Using a dataset with more than {} points is not recommended'.format(
                                            MAX_N_TSNE)
#X_2D = TSNE(xxxx).fit_transform(x_downsampled) # transform to 2-D space for plotting


In [None]:
#fig, ax = plt.subplots(1, 1, figsize=(8, 8))
#ax.scatter(X_2D[pos, 0], X_2D[pos, 1], c=[[0.8, 0.4, 0.4],], marker='x', s=120, label='Positive')
#ax.scatter(X_2D[neg, 0], X_2D[neg, 1], c=[[0.2, 0.3, 0.9],], marker='o', s=10, label='Negative')

#plt.axis('off')
#plt.legend()
#plt.show() 

In [None]:
del x_downsampled, y_downsampled # To avoid using the wrong data later


## Mahalanobis Distance

Using `EmpiricalCovariance`, or `MinCovDet` (a robust estimator), do a `.fit()` to fit the covariance matrix. 
Determine the distance with `.mahalanobis()` and use this as outlier score. Get the Area Under the ROC-Curve, PR-curve and Precision@100 using `plot_outlier_scores` and `plot_top_N`

In [None]:
# cov_ = EmpiricalCovariance().fit(x_pen)


## GMM

Using `GaussianMixture`, with a reasonable value for n_components and `covariance_type=full`, do a `.fit()` and `.score_samples()` to get the log-probability of each sample. The negative log-probability will be the outlier score.


In [None]:
# gmm = GaussianMixture(n_components=xxxx, covariance_type='full', random_state=1) # try also spherical
# gmm.fit(x_pen)


#### Q 1. 
Which algorithm performed better, GMM or Mahalonobis? Why do you think?

## KNN algorithm


#### Q 2.
With the scikit-learn NearestNeighbors class, determine the probability of the nearest neighbour of a point being an outlier, conditional on the class membership of that point. Note how this corresponds to the observations from the t-SNE visualization
 
**HINT**: 
- *The output of clf.kneighbors() (after doing clf.fit()) is a tuple, the first element being an array with distances, the second element being an array with indexes of neighbours. The first column corresponds to the points themselves. The code provided in the next cell may be used to get the indices of the nearest neighbour of all points.*


In [None]:
# clf_nn = NearestNeighbors(n_neighbors=10)
# clf_nn.fit(x_pen)
# nearest_1st_n = clf_nn.kneighbors(x_pen)[1][:, 1] # to get the indices of the first nearest neighbour for each point

## Next, get the conditional mean by subsetting nearest_1st_ns with y_pen==1 or y_pen==0

Use pyod's KNN class to detect the outliers. 
Use `method=median`, and guess a reasonable value for `n_neighbors`based on the insights from t-SNE and the previous question. 

Plot the conditional score curves and the top-100 results, using `plot_outlier_scores` and `plot_top_N`. 



#### Q 3.



Vary n_neighbors, how does it affect AUC-ROC, AUC-PR and precision@100?

**HINT**: 
- *Use clf.decision_scores_ (an attribute of all pyod's detectors) to get the outlier scores (and not the binary labels, that are stored in clf.labels_)*

In [None]:
# clf = KNN(method='median', n_neighbors=xxx)
# clf.fit()
# get the prediction label and outlier scores of the training data
# y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
# y_train_scores = clf.decision_scores_  # raw outlier scores (use these for scoring!)

## LOF 

Note that the LOF algorithm compares the "reachability density" of a point to its k nearest neighbours, compared to that same density of its nearest neighbours.

Also here, we will use a pyod class. Therefore, we can use the same method- and attribute names. 


#### Q 4.
Considering the concept that underlies LOF, and the results from the t-SNE/Nearest neighbour analysis, do you expect LOF to do better than KNN with n_neighbours=10? Why?

#### Q 5.
Plot the scoring curves for a few values of n_neighbours. What is a good value?


In [None]:
# lof = LOF(n_neighbors=n_neighbours, contamination=0.01)




## Isolation Forest


#### Q 6. 
What disadvantage of Isolation Forest do you see compared to the previous algorithms?

Run an `IsolationForest`analysis with a reasonable set of parameters


In [None]:
#ifo = IForest(behaviour='new',n_estimators=10, max_samples=512)
#ifo.fit(xxx)

#### (optional question)
Compare the scatter in AUC scores by running ten times the Isolation Forest, with different `random_state`s, for `n_estimators`=100

## PCA reconstruction error

#### Q 7.
What number of components would you estimate to be suffificient? How may it be determined?

#### Q 8.
Determine the Euclidean reconstruction error, by first transforming the data, and then applying the inverse transform. What scores do you get?

**Use the sci-kit learn implementation, rather than pyod's PCA class. This one seems not to be implemented well**


In [None]:
# pca = PCA(n_components=...)
# pca_tf = ... 
# x_pen_recon = ...

In [None]:
# pca_recon = ((x_pen - x_pen_recon)**2).mean(axis=1)


## Autoencoder reconstruction error

Run the autoencoder with a bottleneck size that worked well for PCA. Run for ~10-15 epochs and look at AUC score. 
Many different configurations (number of hidden layers, number of neurons) may be used, but pick one. 



#### Q 9. 
What output activation do you think will work best? 



In [None]:
clf = AutoEncoder(
    hidden_neurons=[xx, xx, xx], # Choose size here!
    hidden_activation='elu',
    output_activation='xx', # Choose an activation ('linear', 'sigmoid', 'relu', 'elu' are some possibilities)
    optimizer='adam',
    epochs=15,
    batch_size=16,
    dropout_rate=0.0, #may not be needed here
    l2_regularizer=0.0,
    validation_size=0.1,
    preprocessing=False, #NB: this uses sklearn's StandardScaler
    verbose=1,
    random_state=1,
)


In [None]:
# clf.fit(x_pen)


#### Q 10.

- Which algorithm performed best?
- Can it be reasonably "tuned" without having the labels available?


#### (Optional, if time permits)

Run the next code, that replaces the original outliers with sythesized ones (combining the first 8 and last 8 features of randomly chosen observations), and run the analysis again. What differences do you observe?


In [None]:
new_values = pd.concat((x_pen.sample(20, random_state=1).iloc[:, :8].reset_index(drop=True),
                        x_pen.sample(20, random_state=2).iloc[:, 8:].reset_index(drop=True)),
                 axis=1).set_index(y_pen[y_pen==1].index)
x_pen.loc[y_pen==1, :] = new_values

In [None]:
import tensorflow as tf

In [None]:
tf.__version__