This notebook shows how `oab` can be used to make results easily reproducible, ranging from preprocessing over converting the dataset into an anomaly dataset and finally sampling from this dataset.
Note that if a more unique preprocessing is applied, this can also easily be made reproducible. The notebook `Reproducibility2.ipynb` focuses on that.

This notebook is divided into 2 subsections:
1. Making an experiment that is to be reproduced.
2. Reproducing the experiment.

`yaml` files play an integral role in making reproducibility work, as they store the operations and parameters performed on the data.

In [1]:
import sys
sys.path.append('../..')

%load_ext autoreload
%autoreload 2

# 1

In [2]:
# other imports
import pandas as pd

from oab.data.classification_dataset import ClassificationDataset
from oab.data.unsupervised import UnsupervisedAnomalyDataset
from oab.data.load_dataset import load_dataset
from oab.evaluation import EvaluationObject, ComparisonObject

from pyod.models.knn import KNN

_ = load_dataset("wilt") # dataset wilt is loaded, but not used, as this is to show how to use any dataset, not only preinstalled ones.

Credits: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


In [3]:
# Load wilt dataset and split into x and y values
w = pd.read_csv("datasets/wilt/wilt.csv")
y_wilt = w['class']
x_wilt = w.iloc[:, 1:].values

# create a ClassificationDataset object
cd = ClassificationDataset(x_wilt, y_wilt, name="WILT")
# perform a preprocessing operation
cd.standardize_columns()
# this operations is now written to `config.yaml`, feel free to check the content of this file
cd.write_operations_to_yaml("config.yaml")

In [4]:
# now, we make an anomaly dataset by specifying which labels are normal labels
# again, check the yaml for what has changed
# note that it's also possible to write this information to a new yaml file using the parameter `yamlpath_new`
ad = UnsupervisedAnomalyDataset(cd, normal_labels=['n'], yamlpath_append='config.yaml')

In [5]:
# run experiment
# again, check the yaml for what has changed. Just as above, alternatively use `yamlpath_new`
eval_obj = EvaluationObject(algorithm_name="kNN")
for (x, y), sample_config in ad.sample_multiple(n=50, n_steps=10, contamination_rate=0.1, yamlpath_append='config.yaml'):
# for (x, y), sample_config in ad.sample_multiple(n=50, n_steps=10, contamination_rate=0.1, yamlpath_new='sampling.yaml'):
    algo = KNN()
    algo.fit(x)
    pred = algo.decision_scores_
    eval_obj.add(ground_truth=y, prediction=pred, description=sample_config)

_ = eval_obj.evaluate()

Evaluation on dataset WILT with normal labels ['n'] and anomaly labels ['w'].
Total of 10 datasets. Per dataset:
50 instances, contamination_rate 0.1.
Mean 	 Std_dev 	 Metric
0.332 	 0.036 		 roc_auc
0.086 	 0.004 		 average_precision
-0.015 	 0.005 		 adjusted_average_precision


In `config.yaml`, we now see the sampling parameters for `"unsupervised_multiple"`. If sampling is done in a different scenario, e.g., semisupervised multiple, this would also be stored in `config.yaml` using a different key in the `sampling` dict.

In [6]:
!cat config.yaml

standard_functions:
- name: standardize_columns
  parameters:
    cols_to_standardize:
anomaly_dataset:
  arguments:
    normal_labels:
    - n
    anomaly_labels:
sampling:
  unsupervised_multiple:
    n: 50
    n_steps: 10
    contamination_rate: 0.1
    shuffle: true
    random_seed: 42
    apply_random_seed: true
    keep_frequency_ratio_normals: false
    equal_frequency_normals: false
    keep_frequency_ratio_anomalies: false
    equal_frequency_anomalies: false
    include_description: true
    flatten_images: true


# 2

Now the experiment was performed and all information necessary to reproduce it is stored in `config.yaml`. In the second part, the results from above are reproduced.

In [7]:
# Load the dataset, perprocess it and make it into an anomaly dataset
# Note that it has to be specified that it is unsupervised
cd2 = ClassificationDataset(x_wilt, y_wilt, name="WILT2")
cd2.perform_operations_from_yaml(filepath='config.yaml')
ad2 = cd2.tranform_from_yaml(filepath='config.yaml', unsupervised=True)

In [8]:
# Rerun the experiment and see that the results are the same
eval_obj2 = EvaluationObject(algorithm_name="kNN")
for (x, y), sample_config in ad2.sample_from_yaml("config.yaml", type="unsupervised_multiple"): # because we sample multiple times from an unsupervised dataset, "unsupervised_multiple is used as type"
    algo = KNN()
    algo.fit(x)
    pred = algo.decision_scores_
    eval_obj2.add(ground_truth=y, prediction=pred, description=sample_config)
    
_ = eval_obj2.evaluate()

Evaluation on dataset WILT2 with normal labels ['n'] and anomaly labels ['w'].
Total of 10 datasets. Per dataset:
50 instances, contamination_rate 0.1.
Mean 	 Std_dev 	 Metric
0.332 	 0.036 		 roc_auc
0.086 	 0.004 		 average_precision
-0.015 	 0.005 		 adjusted_average_precision
