# Outlier Detection (Gaussian Toy Example)

In this notebook, we will demonstrate the basics of outlier detection in the context of anomaly detection. 
We will use simple Gaussian toy data to demonstrate the basic concepts. 

In the outlier detection version of anomaly detection, we train a model to learn what our background looks like
and then classify things as anomalous based on how 'disimilar' they look as compared to the background. 

Essentially this means we are defining events that have low probability density under the background to be anomalous. 
In this case, we are generating our own Gaussian toy data, so we know the true probability distribution of the background.
However, in realistic physics examples this is usually not the case. 
One must therefore train a machine learning model to learn the background probability distribution, or an equivalent proxy, from a sample of background events.

One common proxy used to learn the background distribution is a type of neural network called an autoencoder.
Autoencoders do not directly learn the probability distribution. Instead they are trained to take the input data, compress it down into some smaller representation
and decompress it back out to recover the original inputs. The idea is that by forcing the model to learn to compress the data, it will force it to learn its underlying structure.
If the model is trained only on background events, it should hopefully learn how to do this compression task for background events but not for signal events.
Therefore, there should be a larger difference between the model input and output on signal events. 
This difference, called the reconstruction loss, can therefore be used as an anomaly score.

Note that unlike weak supervision, we expect this type model to always be worse than a supervised classifier because it never sees signal events during the training.
However, it can usuaully be trained in an easier fashion, (because one only needs to find a sample of background events) and has a stable performance instead of varying depending on the amount of signal present. 

In [None]:
!pip install vector scikit-learn==1.4.0

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy
import sys

from os.path import exists, join, dirname, realpath
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# adding parent directory to path
parent_dir = dirname(realpath(globals()["_dh"][0]))
sys.path.append(parent_dir)

from sk_cathode.generative_models.autoencoder import Autoencoder
from sk_cathode.classifier_models.neural_network_classifier import NeuralNetworkClassifier
from sk_cathode.utils.evaluation_functions import plot_roc_and_sic

In [None]:
# :sunglasses:
plt.style.use('dark_background')

In [None]:
# Pick the dimensionality of our dataset
n_dim = 10  # How many total dimensions of our data
n_signal_dim = 2  # How many dimensions of our signal are different from background

# Background is multi-dim Gaussian with zero mean, diagonal covariance of one
bkg_means = np.array([0.]*n_dim)
bkg_vars = np.ones(n_dim)
bkg_cov = np.diag(bkg_vars)
bkg_pdf = scipy.stats.multivariate_normal(bkg_means, bkg_cov)

# Signal is multi-dim Gaussian centered at 1 for 'signal like dimensions and 0 for the bkg-like dimensions
sig_means = np.array([2.5] * n_signal_dim + [0.] * (n_dim - n_signal_dim))
sig_vars = np.array(n_signal_dim * [0.1] + [1.0]* (n_dim - n_signal_dim))
sig_cov = np.diag(sig_vars)
sig_pdf = scipy.stats.multivariate_normal(sig_means, sig_cov)

In [None]:
# Data for training of autoencoder classifer
n_bkg = 100000
bkg_events_train = bkg_pdf.rvs(size=n_bkg)

# Data for training of supervised classifer
n_sup = 10000
sig_events_sup = sig_pdf.rvs(size=n_sup)
bkg_events_sup = bkg_pdf.rvs(size=n_sup)

x_sup = np.append(sig_events_sup, bkg_events_sup, axis=0)
y_sup = np.append(np.ones(n_sup, dtype=np.int8), np.zeros(n_sup, dtype=np.int8))

x_sup, y_sup = shuffle(x_sup, y_sup, random_state=42)
x_sup_train, x_sup_val, y_sup_train, y_sup_val = train_test_split(x_sup, y_sup, test_size=0.2, random_state=42)

# Data for testing
n_test = 50000
sig_events_test = sig_pdf.rvs(size=n_test//10)
bkg_events_test = bkg_pdf.rvs(size=n_test)

x_test = np.append(sig_events_test, bkg_events_test, axis=0)
y_test = np.append(np.ones(n_test//10, dtype=np.int8), np.zeros(n_test, dtype=np.int8))

In [None]:
# Simple scatter plot of the first two dimensions of our data, background is in blue, signal is in orange
plt.figure(figsize = (5, 5))
plt.scatter(bkg_events_test[:, 0], bkg_events_test[:, 1], s=0.3, color='C0')
plt.scatter(sig_events_test[:, 0], sig_events_test[:, 1], s=0.3, color='C1')
plt.gca().set_aspect(1.)
plt.xlabel(r'$x_1$', fontsize=16)
plt.ylabel(r'$x_2$', fontsize=16)
plt.xlim([-4, 4])
plt.ylim([-4, 4])
plt.show()

In [None]:
# AE model

#TODO: Pick size of compressed representation (latent)

latent_size =
#TODO: What should the sizes of each layer of our network be? 
layers_sizes = 

#You can choose a fewer number of epochs if you have limited resources, not huge performance difference
epochs = 100

ae_model = Autoencoder(n_inputs=n_dim, 
                       layers=layers_sizes, 
                       val_split=0.1,
                       early_stopping=True, 
                       epochs=epochs, verbose=True)
ae_model.fit(bkg_events_train)

To see how well our autoencoder does as a classifier, we need to evaluate it on example signal and background events.
An autoencoder is not naturally a classifier, so in this case we use the mean squared error (input - output)^2 as an 'anomaly score' for each event.
This implemented in the Autoencoder class with the `predict_proba` method

Though it is not directly a classification probability as one would get from a classifier, it has similar properties (higher values means more signal-like), 
so we can directly use it for metrics like ROC curves and AUCs

In [None]:
# Do a quick check of the performance of the autoencoder as a classifier 

# predict_proba method of the Autoencoder computes MSE loss which we use as 'anomaly score' for each event 

#TODO : What should our predicted label be for the auto encoder ? 
y_test_ae = 
auc_ae = roc_auc_score(y_test, y_test_ae)

print(f"AE AUC {auc_ae:.3f}")

In order to gauge how well our autoencoder is doing we can compare it to serval benchmarks.

First we can see how well evaluating the true background pdf would do. We expect this to to similar to a well performing autoencder
We also train a supervised classifier and compute the likelihood ratio. We expect these to be sgnificantly better than the outlier detection methods

In [None]:
# Do a quick check of the performance of the true bkg pdf as a classifier 
#First evaluate P_bkg(x) using our PDF
x_test_bkg_pdf = bkg_pdf.pdf(x_test)

# TODO : What should we use as the classifier score for the bkg pdf ?  
y_test_pdf = 

auc_pdf = roc_auc_score(y_test, y_test_pdf)
print(f"bkg PDF AUC {auc_pdf:.3f}")

In [None]:
# Use exact likelihood ratio to get optimal performance likelihood ratio
# What should we use to compute the likelihood ratio ?  
x_test_sig_pdf = 
likelihood_ratio = 

auc_ratio = roc_auc_score(y_test, likelihood_ratio)
print(f"likelihood ratio AUC {auc_ratio:.3f}")

In [None]:
# Train a supervised model for comparison

sup_model = NeuralNetworkClassifier(n_inputs=n_dim,
                                    early_stopping=True, epochs=50,
                                    verbose=True)
sup_model.fit(x_sup_train, y_sup_train, x_sup_val, y_sup_val)

In [None]:
# Do a quick check of the performance of the supervised classifier 
y_test_sup = sup_model.predict(x_test)
auc_sup = roc_auc_score(y_test, y_test_sup)

print(f"Supervised AUC {auc_sup:.3f}")

In [None]:
plot_roc_and_sic(y_test,
                 [likelihood_ratio, y_test_sup, y_test_ae, y_test_pdf],
                 labels = ['Likelihood Ratio', 'Supervised', 'Autoencoder',  'True Background PDF'], sic_max = 20)
plt.show()

We can see that the outlier detection methods, either based on the true background PDF or the autoencoder fall below the sensitivity of a supervised classifier. 
However they both are able to successfully enhance the sensitivity to the signal, by factors greater than ~2. 
The autoencoder performance is decently close to the true background PDF, which is encouraging. 

Feel free to now play around with how the results change if you change the latent size of the autoencoder, or change the signal or background PDF's.
Perhaps you can also compare the autoencoder performance to that of weak supervision on the same dataset. 