In this notebook, you will see how a semisupervised anomaly detection algorithm can be imported into the `oab` framework to be evaluated.
After installing `oab`, we will see what this algorithm can look like and how its performance is evaluated.

In [1]:
import sys
sys.path.append('../..')

%load_ext autoreload
%autoreload 2

In [2]:
# download example algorithm and inspect content
import wget
wget.download('https://raw.githubusercontent.com/jandeller/test/main/RandomGuesserSemisupervised.py', "RandomGuesserSemisupervised.py")
!cat RandomGuesserSemisupervised.py

import numpy as np

class RandomGuesserSemisupervised():

    def fit(self, X_train):
        pass
      
    def decision_function(self, X_test):
        "Assign a random number to each sample from the test set"
        n_samples = X_test.shape[0]
        return np.random.randn(n_samples)


The sample `RandomGuesser` algorithm shown here is - as the name suggests - a random guesser, i.e., it assigns random anomaly scores to the samples.

An algorithm used for semisupervised anomaly detection needs to specify a `fit(X_train)` method for training and a `decision_function(X_test)` method for inference that returns an anomaly score per data point in the test set.

It is of course possible to rename the method and field, use a method for accessing the anomaly scores, etc. Note that if this is done, the following code has to be changed accordingly. Adhering to the conventions described above (`fit(X_train)` and `decision_function(X_test)`) allows you to use the same interface as algorithms from [`PyOD`](https://pyod.readthedocs.io/en/latest/) as shown when [comparing algorithms using `oab`](https://colab.research.google.com/drive/1aV_itaYCJgzdZ1lQ7SUyHQ7z01xSPxDN?usp=sharing#scrollTo=QnAfCGTGL7xv).

In [3]:
# import objects/functions from oab
from oab.data.load_dataset import load_dataset
from oab.evaluation import EvaluationObject

# and import the RandomGuesser
from RandomGuesserSemisupervised import RandomGuesserSemisupervised

In [4]:
#load dataset
wilt = load_dataset('wilt', semisupervised=True)

# sampling parameters
training_split = 0.7
max_contamination_rate = 0.5
n_steps = 10

Credits: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


In [5]:
# evaluate the random guesser
eval_obj = EvaluationObject("Random")

for (X_train, X_test, y_test), settings in wilt.sample_multiple_with_training_split(training_split=training_split, 
                                                                 max_contamination_rate=max_contamination_rate, 
                                                                 n_steps=n_steps):
    rg = RandomGuesserSemisupervised()
    rg.fit(X_train) # data is fitted to RandomGuesser
    pred = rg.decision_function(X_test) # and decision_scores_ is accessed
    eval_obj.add(y_test, pred, settings)
_ = eval_obj.evaluate()

Evaluation on dataset wilt with normal labels ['n'] and anomaly labels ['w'].
Total of 10 datasets. Per dataset:
3193 training instances, 1626 test instances, training contamination rate 0.0, test contamination rate 0.15805658056580565.
Mean 	 Std_dev 	 Metric
0.501 	 0.026 		 roc_auc
0.161 	 0.013 		 average_precision
0.004 	 0.015 		 adjusted_average_precision


As one would expect, the results are not better than random. (A random guess should have 0.5 as ROC AUC score and 0 as Adjusted Average Precision score.)