
## In this notebook, you will see all the steps sequentially performed to be able to utilize the complete functionality of OAB framework. The steps are as follows :
0. SETUP
1. DATA
2. DATA SELECTION
3. PREPROCESSING
4. SAMPLING
5. ALGORITHM TRAINING AND TESTING
6. EVALUATION
7. SHOW BENCHMARK RESULTS
8. REPRODUCIBILTY
9. EXTENDING THE BENCHMARK(with own Algorithm)

This notebook focuses on <b>Semisupervised Tabular Data</b>. Let's begin!

# **0. SETUP**

`oab` framework can be integrated in your Python environment  as a PyPi package  using the following command:

In [None]:
#ID 1(0)

#%%capture
# pip install oab
!pip install example-pkg-jd-kiel --extra-index-url=https://test.pypi.org/simple/

`Cloning` the repository:

oab is an open-source framework which can be accessed at https://github.com/ISDM-CAU-Kiel/oab. To use this .ipynb notebook successfully, the formerly mentioned repository needs to be cloned with the following command and this notebook must be run(if this is not the case already) within the cloned repository from the path:

<b>/oab/Tutorials/Semisupervised_Anomaly_Detection_on_Benchmark_Tabular_Data.ipynb</b>

In [None]:
#ID 2(0)
!git clone https://github.com/ISDM-CAU-Kiel/oab.git

Now, importing the necessary functions and internal variables :

In [None]:
#ID 3(0)


import sys
import os
from datetime import datetime 
from pathlib import Path
#sys.path.append('../..')           
sys.path.insert(0,f"{Path(os.getcwd()).parent}") # setting the parent directory of repository as current directory

%load_ext autoreload
%autoreload 2


# necessary imports for loading datasets as well as information from recipe files

from oab.data.semisupervised import SemisupervisedAnomalyDataset
from oab.data.load_dataset import load_dataset

from oab.data.load_recipe_functions import *
# necessary imports for algorithm comparisons and defining seeds
from oab.evaluation import EvaluationObject, ComparisonObject,all_metrics
import tensorflow as tf
import numpy as np
import random
import torch



## **0.1 NOTEBOOK AND CELL STRUCTURE** 

In this notebook there are certain sections where the user s required to enter its own information which are marked as comments of the form :

<b>### ADD YOUR CODE ###</b>  , so <b>###</b> can be searched to know what are those sections.

All cells are assigned an ID, as a comment at the top of the cell,for example as: <b>#ID 10(5)</b>, where 10 denotes the cell ID and 5 denotes the Sections.

## 0.2 DETAILS OF THIS BENCHMARK RUN

In [2]:
#ID 4(0)
### ADD YOUR BENCHMARK NAME HERE ###
benchmark_name="Paper_A" 


datasets_info_path=Path(os.getcwd()).parent/"oab"/"data"/"datasets.yaml" # getting path to "datasets.yaml" which contains information about all tabular datasets

recipes_parent_path=Path(os.getcwd()).parent/"notebooks"/"benchmark_tabular"
dataset_folder=recipes_parent_path/"datasets" # all dataset-folders are contained in this folder
#print(dataset_folder)

benchmark_type="sst"     # benchmark run for semisupervised tabular datasets(sst)
if not os.path.exists(recipes_parent_path/benchmark_name): #creating directory for this benchhmark for storing recipes
    os.makedirs(recipes_parent_path/benchmark_name)

    
time=datetime.now().strftime("%Y%m%d%H%M%S") # timestamp set for this run  
new_recipe_path=f"{recipes_parent_path}/{benchmark_name}/{time}-{benchmark_name}-{benchmark_type}-recipe.yaml" # recipe path for new recipe created in this run   
print(f"{time}-{benchmark_name}-{benchmark_type}-recipe.yaml")

20211217000524-Paper_A-sst-recipe.yaml


### For reproducing a previously-created recipe without adding new datasets and algorithms from this benchmark  ,  skip to :

### `#ID 20(5)`

# **1. DATA**

First of all, we will have a look at the Datasets that are pre-installed in OAB which can be used for benchmarking

In [3]:
# ID 5(1)
for i in get_tabular_dataset_names():
    print(f"{i[0]}.{i[1]}")



0.page-blocks
1.spambase
2.wilt
3.pulsar_star
4.forest_cover
5.NASA_ground_data
6.wine
7.boston
8.http
9.smtp
10.cardio
11.thyroid
12.musk
13.pima
14.shuttle
15.breastw
16.arrhytmia
17.ionosphere
18.optdigits
19.mammography
20.annthyroid
21.pendigits
22.vertebral



`oab` provides a variety of tabular datasets that can easily be loaded, 

`1.` If a user is interested in using her own tabular dataset, the following steps have to be followed: 

 (a) Ensure that `own` dataset(s) information is stored in the file `datasets.yaml` which is located at
<b>"/oab/data"</b>  (by executon of  #ID 7(1))
 
 
 (b) The `folder` containing the dataset must be stored in `datasets` folder of the OAB and name should be same as the dataset_name
 
 (c) Then, own dataset(s) can be loaded just like the pre-installed OAB datasets 
 
 `2.` If user's dataset is provided **via a URL**, then it would be downloaded and stored in the OAB's "datasets" folder.
 
 The files in the `folder` that are downloaded or stored manually in "datasets" folder , can be of variety of formats such as:
 
 1. 'csv           
 2. 'zip'
 3.'gz_single_file'
 4.'mat'
 5.'mat_old
 
 If user has her dataset file in one of these formats, or has multiple files, then `oab` automatically makes one file out of that which is then input to the oab. We will see the case here, when user loads her dataset through local `folder directory`.

Here's the structure of how the datasets' information are stored in `datasets.yaml`.

In [4]:
# ID 6(1)
!cat {datasets_info_path}

page-blocks:
  name: "page-blocks"
  short_description:
  foldername: "page-blocks"
  urls_dataset: ["https://pkgstore.datahub.io/machine-learning/page-blocks/page-blocks_csv/data/7c1adeffd3ce22181986879d92f9508c/page-blocks_csv.csv"]
  destination_filenames: ["page_blocks.csv"]
  dataset_format: 'csv'
  filenames_to_concatenate: null
  filename_in_folder: "page_blocks.csv" # Note: If we have multiple files, this should still be just one!
  load_csv_arguments:
    header: 0
  class_labels: "last" # or "first"
  url_yaml: "https://raw.githubusercontent.com/jandeller/test/main/test1.yaml" # for preprocessing and making an anomaly dataset
  destination_yaml: "page-blocks_preprocessing.yaml"
  credits: "Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. For more information, check https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification."
  

In [5]:
#ID 7(1)


### ADD OWN DATASET(S) DETAILS ###   see #ID 3(1) for exact parameters which are to be entered

own_datasets_info= [{
                        'dataset_name':'myTabularDataset',
                    }]


 # 'myTabularDataset' is the name of the Dataset(which the user loads for benchmarking)
### Add more dictionaries to the list `own_datasets_info` with datasets information like example below
                   #           { 
                   #               'dataset_name':'XYZDataset'
                   #               'filenames_to_concatenate':['train_example.csv','test_example.csv']
                   #               ....
                   #            }
                
  



### ADD DATASETNAME(S) FROM OAB'S DATASETS HERE ###

benchmark_datasets_list=['spambase']   # More of OAB's datasets can be added to this list




## contains dataset objects of own_datasets_list as well as benchmark_datasets_list   
datasets={}


# Adding and Loading own datasets
for dataset_details in own_datasets_info:
     datasets[dataset_details['dataset_name']]=load_own_tabular_dataset(**dataset_details)


print(datasets)


{'myTabularDataset': <oab.data.classification_dataset.ClassificationDataset object at 0x7f4483c790a0>}


# **2. DATA SELECTION**


Datasets can either be loaded directly as anomaly datasets or as classification datasets. In the former case, the dataset is automatically fully prepared and ready for sampling. In the latter case, further preprocessing is still possible and necessary.

**After adding and loading own dataset(s) in #ID 7(1) and now the user is able to select other benchmarking datasets:**

Now,  we'll have a look at all the datasets again which are pre-installed in OAB, so that they can be chosen for the benchmark run.

In [6]:
#ID 8(2)


for i in get_tabular_dataset_names():
    print(f"{i[0]}.{i[1]}")


0.NASA_ground_data
1.annthyroid
2.arrhytmia
3.boston
4.breastw
5.cardio
6.forest_cover
7.http
8.ionosphere
9.mammography
10.musk
11.myTabularDataset
12.optdigits
13.page-blocks
14.pendigits
15.pima
16.pulsar_star
17.shuttle
18.smtp
19.spambase
20.thyroid
21.vertebral
22.wilt
23.wine


### **2.1 Load anomaly detection datasets (with or without further preprocessing)**

In this section, we load some pre-installed data sets. This can be achieved using the `load_dataset` function. By default, it creates an anomaly dataset from which sampling is directly possible  but we can first create classsifcation dataset and then anomaly dataset,either with the preprocessing applied (`preprocess_classification_dataset=True`) i.e. standard or custom operations like treat_missing_values,delete_columns,etc. are performed, or without (`preprocess_classification_dataset=False`, default).

`In our case` we set have already imported own datasets with `anomaly_dataset=False ` and `preprocess_classification_dataset=False` in <b>#ID 7(1)</b> and we will also load the OAB datasets in the same way in <b>#ID 9(2)</b>

Note that as discussed in the paper, multiclass classification datasets like `spambase` and `annthyroid` are loaded with the class label `0` as normal label and all other labels as anomaly labels by default. (Alternatively, `oab` can automatically iterate through all classes as normal classes. This is not covered here.

In [7]:
#ID 9(2)

#### ADD YOUR OWN NUMBER OF DATASETS AND FROM OAB FOR BENCHMARKING  ###


for dataset_name in benchmark_datasets_list:  # loading benchmark's datasets
    datasets[dataset_name]=load_dataset(dataset_name,anomaly_dataset=False,preprocess_classification_dataset=False,dataset_folder=dataset_folder)


Credits: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


In [8]:
#ID 10(2)
print(datasets)

{'myTabularDataset': <oab.data.classification_dataset.ClassificationDataset object at 0x7f4483c790a0>, 'spambase': <oab.data.classification_dataset.ClassificationDataset object at 0x7f4483b72f40>}


# **3. PREPROCESSING**

Standard preprocessing steps(or Custom preprocessing steps which are defined by user) like deleting columns, encoding categorical values differently, or removing missing values can be performed  to tabular data. Therefore, these methods (as well as own preprocessing steps and how these are captured) are covered here in this section.

Here, we only show two preprocessing steps that are applied to datasets in `preprocess_datasets`(loaded in 2.2), which can also be performed individually depending upon requirement :
- Perform `Standard/Custom Preprocessing functions`
- `Transform the dataset into an anomaly dataset`

In [9]:
#ID 11(3)                            SCALING APPLIED

#used imports from #ID 2(1)                                          
for dataset_name in datasets:
    
    
    datasets[dataset_name].treat_missing_values()
    datasets[dataset_name].normalize_columns()
    datasets[dataset_name].delete_duplicates()
    operations=datasets[dataset_name].operations_performed
    dataset_info_store(dataset_name,new_recipe_path,info_type='standard_functions',content=operations) 
   


#print("preprocesing performed on datasets! ")    
#print(datasets)

The file <b>f"{time}-{benchmark_name}-{benchmark_type}-recipe.yaml",</b> now contains information about how to preprocess(i.e. perform scaling) the file 

In [10]:
#ID 12(3)

!cat {new_recipe_path}

myTabularDataset:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
spambase:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}


In [11]:
#ID 13(3)                            ANOMALY-DATASET CONVERSION PERFORMED

#used import from #ID 2 


#recipe_path=f"{benchmark_name}/{time}_{benchmark_name}_recipe.yaml"                                           

datasets_ad={}    
    # for storing dataset objects converted to anomaly-dataset
for dataset_name in datasets:   
    
     datasets_ad[dataset_name]= SemisupervisedAnomalyDataset(classification_dataset=datasets[dataset_name],
                                                       normal_labels=0)  
   
     normal_labels=datasets_ad[dataset_name].normal_labels 
     dataset_info_store(dataset_name,new_recipe_path,info_type='anomaly_dataset',content=normal_labels)   
                                                                            
print("datasets after adding anomaly-conversion datasets: ")    
print(datasets_ad)

datasets after adding anomaly-conversion datasets: 
{'myTabularDataset': <oab.data.semisupervised.SemisupervisedAnomalyDataset object at 0x7f4488b646d0>, 'spambase': <oab.data.semisupervised.SemisupervisedAnomalyDataset object at 0x7f4483b96f40>}


In [12]:
#ID 14(3)

!cat {new_recipe_path}

myTabularDataset:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels:
      - 0
      anomaly_labels:
spambase:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels:
      - 0
      anomaly_labels:


# **4. SAMPLING**

Here, we define the sampling parameters to sample from the datasets

In [13]:
#ID 15(4)

### ADD YOUR OWN SAMPLING PARAMETERS ###

# sampling parameters

# sampling parameters
training_split = 0.6                 # specifies the proportion of normal data points that will be used during training
max_contamination_rate = 0.4        # Maximum contamination rate of the test set. If this is exceeded, not all anomalies that exist are sampled 
n_steps = 4        # n_steps=10      # Number of samples to be taken


#These below are the possible sampling types to sample from datasets
sampling_types=['semisupervised_multiple','semisupervised_explicit_numbers_single','semisupervised_training_split_multiple','semisupervised_training_split_single']

sampling_type='semisupervised_training_split_multiple'  #by default for this run

sampling_params_current_run=[{'training_split':training_split,'max_contamination_rate':max_contamination_rate,'n_steps':n_steps},sampling_type] 

sampling=[{sampling_type:sampling_params_current_run[0]}]
print(sampling)

#storing sampling info to recipe
for dataset_name in datasets_ad:
    dataset_info_store(dataset_name,new_recipe_path,'sampling',content=sampling)

[{'semisupervised_training_split_multiple': {'training_split': 0.6, 'max_contamination_rate': 0.4, 'n_steps': 4}}]


The above sampling parameters are utilized in
<b>#ID 23(5)</b> 
for sampling the datasets before training the algorithms.

In [14]:
# ID 16(4)
!cat {new_recipe_path}

myTabularDataset:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels:
      - 0
      anomaly_labels:
- sampling:
    semisupervised_training_split_multiple:
      training_split: 0.6
      max_contamination_rate: 0.4
      n_steps: 4
spambase:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels:
      - 0
      anomaly_labels:
- sampling:
    semisupervised_training_split_multiple:
      training_split: 0.6
      max_contamination_rate: 0.4


Now, we will associate sampling information with each dataset loaded in the benchmark run:

In [15]:
#ID 17(4)
benchmarking_datasets={}

for (x,y) in datasets_ad.items():
    benchmarking_datasets[x]=[y,sampling_params_current_run]


print(benchmarking_datasets)    

{'myTabularDataset': [<oab.data.semisupervised.SemisupervisedAnomalyDataset object at 0x7f4488b646d0>, [{'training_split': 0.6, 'max_contamination_rate': 0.4, 'n_steps': 4}, 'semisupervised_training_split_multiple']], 'spambase': [<oab.data.semisupervised.SemisupervisedAnomalyDataset object at 0x7f4483b96f40>, [{'training_split': 0.6, 'max_contamination_rate': 0.4, 'n_steps': 4}, 'semisupervised_training_split_multiple']]}


The above dictionary <b>benchmarking_datasets</b> will be used for the Benchmarking as it contains all the information:"
    
    
    1.dataset_name
    2.final_dataset_object(preprocessed and anomaly-converted)
    3.sampling_info



# **5. ALGORITHM TRAINING AND TESTING**

We first download and import algorithms used for anomaly decection.

In [16]:
#ID 18(5)



# import anomaly detection algorithms from pyod
from pyod.models.ocsvm import OCSVM # fit and decision_function
from pyod.models.iforest import IForest
from pyod.models.pca import PCA
from pyod.models.auto_encoder import AutoEncoder
from pyod.models.vae import VAE
### ADD your algo import here ###


Firstly, we define hyperparameters for all algorithms and choose for benchmarking:

In [17]:
#ID 19(5)

  


### ADD YOUR OWN (HYPER)PARAMETERS AND THEIR VALUES FOR PRE-INSTALLED ALGOS###


 
lst_benchmark_algorithms =[  
    

    {   
       "algo_module_name": "pyod.models.pca" , 
       "algo_class_name": "PCA",
       "algo_name_in_result_table": "PCA",
       "algo_parameters":{'n_components': 0.9, 'svd_solver': 'full'},
       "fit": {'method_name': 'fit', 'params': {}}, 
       "decision_function": {'method_name': 'decision_function', 'params': {}}
    },
    {
      "algo_module_name": "pyod.models.auto_encoder"   ,
       "algo_class_name": "AutoEncoder",
       "algo_name_in_result_table": "AutoEncoder",
       "algo_parameters":   {'verbose': 0, 'hidden_neurons': [6, 3, 3, 6], 'random_state': 42},
        "fit": {'method_name': 'fit', 'params': {}}, 
       "decision_function": {'method_name': 'decision_function', 'params': {}}
       }
]


'''       ### uncomment to use these algos below for benchmarking ###
     {   
       "algo_module_name": "pyod.models.vae" , 
       "algo_class_name": "VAE",
       "algo_name_in_result_table": "VAE",
       "algo_parameters":  {'encoder_neurons': [6, 3], 'decoder_neurons': [3, 6], 'verbose': 0, 'random_state': 42},
       "fit": {'method_name': 'fit', 'params': {}}, 
       "decision_function": {'method_name': 'decision_function', 'params': {}}
    }
  ,
{   
       "algo_module_name": "pyod.models.ocsvm" , 
       "algo_class_name": "OCSVM",
       "algo_name_in_result_table": "OCSVM",
       "algo_parameters": {'degree': 3}, 
       "fit": {'method_name': 'fit', 'params': {}}, 
       "decision_function": {'method_name': 'decision_function', 'params': {}}
    },
   
       {
       "algo_module_name": "pyod.models.iforest",   
       "algo_class_name": "IForest",
       "algo_name_in_result_table": "IForest",   
       "algo_parameters": {'random_state': 42} ,
        "fit": {'method_name': 'fit', 'params': {}}, 
        "decision_function": {'method_name':'decision_function', 'params': {}}
        }

'''   
   
    
    
    
   
    

### ADD OWN ALGORITHM(S)  with algorithm specifications as shown above for OAB algorithms ###   

own_algorithms=[]   #add to this list e.g. { "algo_module_name": "own_algo" , "algo_class_name": "ownAlgoClass",.........."decision_function": {'method_name': 'decision_function', 'params': {}}} 

lst_benchmark_algorithms.extend(own_algorithms)


#seeds defined for ths benchmark run for obtaining consistent results 
seed=42

<b>LOAD YOUR RECIPE</b> to be repdroduced and use it in the current benchmark run.

In [18]:
#ID 20(5)           # Execute this cell only when you already have a recipe file  to load from      


### ADD AN OPTIONAL RECIPE  PATH TO ADD TO THIS BENCHMARK RUN START ###   

# Note: recipes of type "semisupervised tabular(sst) " i.e. of the format: 
#               "timestamp-benchmark_name-sst-recipe.yaml"
# can only be used for benchmarking in this notebook.

recipe_path=recipes_parent_path/"Paper_B"/"20211206050133-Paper_B-sst-recipe.yaml"

### ADD AN OPTIONAL RECIPE  PATH TO ADD TO THIS BENCHMARK RUN END ###   
    
### UNCOMMENT ONLY IF NO NEW DATASETS WERE ADDED IN THE BENCHMARK EXCEPT FROM RECIPE START ###     

#benchmarking_datasets={}
#lst_benchmark_algorithms=[]

### UNCOMMENT ONLY IF NO NEW DATASETS WERE ADDED IN THE BENCHMARK EXCEPT FROM RECIPE END ###  


!cat {recipe_path} 


annthyroid:
- dataset
- standard_functions:
  - name: normalize_columns
    parameters:
      cols_to_normalize:
      - 0
      - 1
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels: 0
      anomaly_labels:
- sampling:
    semisupervised_training_split_multiple:
      training_split: 0.7
      max_contamination_rate: 0.5
      n_steps: 1
      
myTabularDataset2:
- dataset
- standard_functions:
  - name: normalize_columns
    parameters:
      cols_to_normalize:
      - 0
      - 1
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels: 0
      anomaly_labels:
- sampling:
    semisupervised_training_split_multiple:
      training_split: 0.7

In [20]:
#ID 21(5)                # Execute this cell only when you already have a recipe file  to load from   
    
recipe_algos=data_from_recipe('algos',recipe_path) # all algo names from recipe extracted
#print(f"recipe_algos:\n{recipe_algos}")

recipe_datasets=data_from_recipe('datasets',recipe_path) # all dataset info(anomaly dataset object/anomalydataset params/sampling params) is perfomed and obtained
#print(f"\nrecipe_datasets:\n{recipe_datasets}")
 
recipe_seed=data_from_recipe('seed',recipe_path)  # obtained seeds to feed in this benchmark 
seed=recipe_seed   # seed of current benchmark is overwritten by recipe seed



pyod.models.vae----


annthyroid------
standard/custom preprocessing performed!
transformed to anomaly dataset!

myTabularDataset2------
standard/custom preprocessing performed!
transformed to anomaly dataset!


In [21]:
#ID 22(5)    



# adding recipe_datasets to benchmarking_datasets
for dataset_name in recipe_datasets:
    benchmarking_datasets[dataset_name]=recipe_datasets[dataset_name][:2]
#print(f"benchmarking_datasets: {benchmarking_datasets}") 
                                       
#adding algos from recipe_algos to lst_benchmarking_algos
for algo in recipe_algos:
    lst_benchmark_algorithms.append(algo)

#print(lst_benchmark_algorithms)

print("\nDatasets obtained from recipe:")    
for dataset_name in recipe_datasets:
    print(dataset_name)


print("\nAlgos obtained from recipe:") 
for algo_name in recipe_algos:
    print(algo_name['algo_module_name'])
    
  
print("\nAll Datasets for this benchmark run:")    
for dataset_name in benchmarking_datasets:
    print(dataset_name)

    
 
print("\nAll algos for this benchmark run:")
for algo in lst_benchmark_algorithms:
    #print(algo)
    print(algo['algo_module_name'])






Datasets obtained from recipe:
annthyroid
myTabularDataset2

Algos obtained from recipe:
pyod.models.vae

All Datasets for this benchmark run:
myTabularDataset
spambase
annthyroid
myTabularDataset2

All algos for this benchmark run:
pyod.models.pca
pyod.models.auto_encoder
pyod.models.vae


Now, For every benchmark dataset , we sample from that dataset to train the algorithms and then predict the outcomes for each dataset with each algortihm and then store results in a evaluation object, which is then added to the comparison object to show the final Benchmarking results

In [22]:
#ID 23(5)
co = ComparisonObject()

for dataset_name in list(benchmarking_datasets.keys()):
    print(f'-------{dataset_name}-------') 
   
    #print(mvtec_ad_own_datasets_list)
    for alg in lst_benchmark_algorithms:
        
        print("------"+alg["algo_class_name"])
        eval_obj = EvaluationObject(algorithm_name=alg["algo_name_in_result_table"])
        
        
        sampling_type=benchmarking_datasets[dataset_name][1][1]
        sampling_params=benchmarking_datasets[dataset_name][1][0]
        #print(sampling_type,sampling_params)
        for (x_train, x_test, y_test),sample_config in sample_semisupervised(dataset_name,sampling_type,sampling_params,benchmarking_datasets[dataset_name][0]):
                 
                torch.manual_seed(seed)
                random.seed(seed)
                tf.random.set_seed(seed)
                np.random.seed(seed) 
                os.environ['PYTHONHASHSEED'] = str(seed)
                torch.cuda.manual_seed(seed)
                torch.cuda.manual_seed_all(seed)
                torch.backends.cudnn.deterministic = True
                torch.backends.cudnn.benchmark = False
                torch.use_deterministic_algorithms(True)
                
                
                mod = __import__(alg["algo_module_name"], fromlist=[alg["algo_class_name"]])
                algo = getattr(mod, alg["algo_class_name"])(**alg['algo_parameters'])        
                print('.', end='') # update to see progress 
                getattr(algo,alg["fit"]["method_name"])(x_train, **alg["fit"]["params"])  # fitting algo
                pred = getattr(algo,alg["decision_function"]["method_name"])(x_test, **alg["decision_function"]["params"]) # decision functions
                eval_obj.add(ground_truth=y_test, prediction=pred, description=sample_config)  
        print("\n")    
        eval_desc = eval_obj.evaluate(print=True, metrics=['roc_auc', 'adjusted_average_precision', 'precision_recall_auc'])
        co.add_evaluation(eval_desc)
        print("\n")

-------myTabularDataset-------
------PCA
....

Evaluation on dataset myTabularDataset with normal labels [0] and anomaly labels [1, 2].
Total of 4 datasets. Per dataset:
147 training instances, 163 test instances, training contamination rate 0.0, test contamination rate 0.3987730061349693.
Mean 	 Std_dev 	 Metric
0.733 	 0.032 		 roc_auc
0.385 	 0.104 		 adjusted_average_precision
0.625 	 0.065 		 precision_recall_auc


------AutoEncoder
....

Evaluation on dataset myTabularDataset with normal labels [0] and anomaly labels [1, 2].
Total of 4 datasets. Per dataset:
147 training instances, 163 test instances, training contamination rate 0.0, test contamination rate 0.3987730061349693.
Mean 	 Std_dev 	 Metric
0.806 	 0.024 		 roc_auc
0.516 	 0.097 		 adjusted_average_precision
0.704 	 0.061 		 precision_recall_auc


------VAE
....

Evaluation on dataset myTabularDataset with normal labels [0] and anomaly labels [1, 2].
Total of 4 datasets. Per dataset:
147 training instances, 163 test ins

# **6. EVALUATION**

Here , we will see how different metrics can be selected when evaluating an algorithm's performance.

In previous section while creating an evalutation description,  we used all metrics for evaluation:

     eval_desc = eval_obj.evaluate(print=False, metrics=all_metrics)
    
    

In [23]:
#ID 24(6)

# to use a subset, first see which ones are available

print(all_metrics)

['roc_auc', 'average_precision', 'adjusted_average_precision', 'precision_n', 'adjusted_precision_n', 'precision_recall_auc']


In [24]:
#ID 25(6)

#### ADD YOUR OWN NUMBER OF METRICS ###

#Then we can  select an arbitrary subset
metrics=['roc_auc', 'precision_recall_auc']

# **7. SHOW BENCHMARK RESULTS**

We compare by printing, the results of the evaluations of different Algo-Dataset combinations.

\[Latex version: bold for highest, italics for second highest, ?\]

In [25]:
#ID 26(7)

# print results in easily readable format
co.print_results()

For roc_auc:
             myTabularDataset  spambase  annthyroid  myTabularDataset2  \
PCA                  0.732928  0.811608    0.783386           0.713112   
AutoEncoder          0.805808  0.811124    0.809295           0.813367   
VAE                  0.776060  0.811939    0.810223           0.772462   
Average              0.771599  0.811557    0.800968           0.766314   

              Average  
PCA          0.760258  
AutoEncoder  0.809899  
VAE          0.792671  
Average           NaN  
For adjusted_average_precision:
             myTabularDataset  spambase  annthyroid  myTabularDataset2  \
PCA                  0.385346  0.570856    0.467751           0.405694   
AutoEncoder          0.515698  0.569862    0.492707           0.567927   
VAE                  0.457699  0.571325    0.492887           0.488129   
Average              0.452914  0.570681    0.484448           0.487250   

              Average  
PCA          0.457412  
AutoEncoder  0.536549  
VAE          0.502510

In [26]:
#ID 28(7)

# print results in easily readable format with standard deviations
co.print_results(include_stdevs=True)

For roc_auc:
            myTabularDataset      spambase    annthyroid myTabularDataset2  \
PCA             0.733+-0.032  0.812+-0.013  0.783+-0.000      0.713+-0.000   
AutoEncoder     0.806+-0.024  0.811+-0.013  0.809+-0.000      0.813+-0.000   
VAE             0.776+-0.027  0.812+-0.013  0.810+-0.000      0.772+-0.000   
Average                0.772         0.812         0.801             0.766   

              Average  
PCA          0.760258  
AutoEncoder  0.809899  
VAE          0.792671  
Average           NaN  

For adjusted_average_precision:
            myTabularDataset      spambase    annthyroid myTabularDataset2  \
PCA             0.385+-0.104  0.571+-0.016  0.468+-0.000      0.406+-0.000   
AutoEncoder     0.516+-0.097  0.570+-0.016  0.493+-0.000      0.568+-0.000   
VAE             0.458+-0.107  0.571+-0.016  0.493+-0.000      0.488+-0.000   
Average                0.453         0.571         0.484             0.487   

              Average  
PCA          0.457412  
Auto

In [27]:
# ID 29(7)

co.print_latex(include_stdevs=True)

For roc_auc:
\begin{center}
\begin{tabular}{  c c c c c c  }
  & myTabularDataset & spambase & annthyroid & myTabularDataset2 & Average \\
  PCA & 0.733$\pm$0.032 & \textit{0.812$\pm$0.013} & 0.783$\pm$0.000 & 0.713$\pm$0.000 & 0.760 \\
  AutoEncoder & \textbf{0.806$\pm$0.024} & 0.811$\pm$0.013 & \textit{0.809$\pm$0.000} & \textbf{0.813$\pm$0.000} & \textbf{0.810} \\
  VAE & \textit{0.776$\pm$0.027} & \textbf{0.812$\pm$0.013} & \textbf{0.810$\pm$0.000} & \textit{0.772$\pm$0.000} & \textit{0.793} \\
  Average & 0.772 & 0.812 & 0.801 & 0.766 &    \\
\end{tabular}
\end{center}

For adjusted_average_precision:
\begin{center}
\begin{tabular}{  c c c c c c  }
  & myTabularDataset & spambase & annthyroid & myTabularDataset2 & Average \\
  PCA & 0.385$\pm$0.104 & \textit{0.571$\pm$0.016} & 0.468$\pm$0.000 & 0.406$\pm$0.000 & 0.457 \\
  AutoEncoder & \textbf{0.516$\pm$0.097} & 0.570$\pm$0.016 & \textit{0.493$\pm$0.000} & \textbf{0.568$\pm$0.000} & \textbf{0.537} \\
  VAE & \textit{0.458$\pm$0.1

# **8. REPRODUCIBILITY**

 ## **8.1 Creating recipes**

This section shows **how `oab` can be used to make sampling results easily reproducible** .
 

`yaml` files play an integral role in making reproducibility work, as they store the operations and parameters performed on the data.

We will see how to produce a recipe(.yaml) of the Benchmarkrun already performed  in <b>#ID 23(5)</b>

In <b>#ID 11(3) #ID 13(3) #ID 15(4)</b>,  We already performed operations on own datasets and OAB's datasets, and then already stored the daasets information as we can see below: 

In [28]:
#ID 30(8)
!cat {new_recipe_path}

myTabularDataset:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels:
      - 0
      anomaly_labels:
- sampling:
    semisupervised_training_split_multiple:
      training_split: 0.6
      max_contamination_rate: 0.4
      n_steps: 4
spambase:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels:
      - 0
      anomaly_labels:
- sampling:
    semisupervised_training_split_multiple:
      training_split: 0.6
      max_contamination_rate: 0.4


Now, we will store the information of  datasets and algorithms information from <b>Paper_B's</b> recipe
and only of the algorithms of this benchmark in the new recipe:

In [29]:
#ID 31(8)


# adding datasets from recipe used in in benchmark run in #ID 23(5)
for dataset_name in recipe_datasets:
    
    
    #storing anomaly parameters
    dataset_info_store(dataset_name,new_recipe_path,info_type='anomaly_dataset',content=recipe_datasets[dataset_name][0].normal_labels)
    
    # storing preprocesing parameters
    dataset_info_store(dataset_name,new_recipe_path,info_type='standard_functions',content=recipe_datasets[dataset_name][3]) 
    #dataset_info_store(dataset_name,new_recipe_path,info_type='custom_functions',content=recipe_datasets[dataset_name][4]) 
    
    
    #storing sampling parameters
    sampling_data=recipe_datasets[dataset_name][1]
    dataset_info_store(dataset_name,new_recipe_path,'sampling',content=sampling_data)
    
    

Now,we will store information about <b>Algorithms and their hyperparameters</b> in the recipe(.yaml) 

In [30]:
# 32(8)
for algo in lst_benchmark_algorithms:
    
    x=algo["algo_module_name"]
    y=['algo_name',
         
         {
         'init': 
          
               {

       'params':algo["algo_parameters"]
          
               },
        
        'fit':algo["fit"]   
        ,

        'decision_function':algo["decision_function"]
         },
         
         algo["algo_class_name"]
        
        ]
                 
     
    yaml=YAML(typ='rt')
    yaml_content = yaml.load(Path("./") / new_recipe_path)
    yaml_content[x]=y
    yaml_content['seed']=[seed]          # adding seed to new recipe
    yaml.dump(yaml_content, Path("./") /new_recipe_path)

In **f"{time}-{benchmark_name}-{benchmark_type}-recipe.yaml"**, we now see the sampling parameters, anomaly- dataset-conversion parameters, hyperparamters along with the algorithms for "unsupervised_multiple_with_training_split". If sampling is done in a different scenario, e.g., unsupervised multiple, this would also be stored in f"{benchmark_name}/{time}_{benchmark_name}_recipe.yaml" using a different key in the sampling dict.



In [31]:
#ID 33(8)
!cat {new_recipe_path}

myTabularDataset:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels:
      - 0
      anomaly_labels:
- sampling:
    semisupervised_training_split_multiple:
      training_split: 0.6
      max_contamination_rate: 0.4
      n_steps: 4
spambase:
- dataset
- standard_functions:
  - name: treat_missing_values
    parameters:
      missing_value: np.nan
      delete_attributes: true
  - name: normalize_columns
    parameters:
      cols_to_normalize:
  - name: delete_duplicates
    parameters: {}
- anomaly_dataset:
    arguments:
      normal_labels:
      - 0
      anomaly_labels:
- sampling:
    semisupervised_training_split_multiple:
      training_split: 0.6
      max_contamination_rate: 0.4


### 2. Reproducing the experiment

To reproduce the recipe created in the previous section,
we refer to <b>Section 5 #ID 20(5)</b> where we can reproduce the run as well as extend benchmarks!

# **9. EXTEND EXISTING BENCHMARK(own algorithm)**

To extend the existing benchmark here basically means to add  our own algorithm to the benchmark and to show the comparison results of pre-installed algorithms while also loading our own dataset.


1. We load the datasets. To know how to do that, we can refer to  **Section "1. Data" and "2. Data Selection"**
2. Then, load own algorithm as we will see in the next sub-section.

## **9.1 Loading own Algorithm**

In this subsection, you will see **how an own semisupervised anomaly detection algorithm** can easily be used within oab to be evaluated. We will see how a class representing an algorithm can be structured and how its performance is evaluated.

Of course, this is not the only way to use the functionality provided by oab. We do consider it to be the simplest way however.

In [32]:
#ID 34(9)

# download example algorithm and inspect content
import wget
wget.download('https://raw.githubusercontent.com/jandeller/test/main/RandomGuesserSemisupervised.py',f"{recipes_parent_path}/RandomGuesserSemisupervised.py")
!cat {recipes_parent_path}/RandomGuesserSemisupervised.py

import numpy as np

class RandomGuesserSemisupervised():

    def fit(self, X_train):
        pass
      
    def decision_function(self, X_test):
        "Assign a random number to each sample from the test set"
        n_samples = X_test.shape[0]
        return np.random.randn(n_samples)


The sample `RandomGuesser` algorithm shown here is - as the name suggests - a random guesser, i.e., it assigns random anomaly scores to the samples.

An algorithm used for unsupervised anomaly detection needs to specify a `fit(x_train)` method for training and a `decision_function(x_test)` method for inference that returns an anomaly score per data point in the test set.

It is of course possible to rename the method and field, use a method for accessing the anomaly scores, etc. Note that if this is done, the following code has to be changed accordingly. Adhering to the conventions described above (`fit(x_train)` and `decision_function(x_test)`) allows you to use the same interface as algorithms from [`PyOD`](https://pyod.readthedocs.io/en/latest/) as shown when [comparing algorithms using `oab`](https://colab.research.google.com/drive/1aV_itaYCJgzdZ1lQ7SUyHQ7z01xSPxDN?usp=sharing#scrollTo=QnAfCGTGL7xv).

In [33]:
#ID 35(9)
# used imports from #ID 3(0),#ID 18(5)
#used sampling parameters from #ID 15(4)

# and import the RandomGuesser
from notebooks.benchmark_tabular.RandomGuesserSemisupervised import RandomGuesserSemisupervised
 
own_algorithms=[{
    
       ### ADD YOUR OWN ALGO DETAILS IN THIS FORM ###
       "algo_module_name": "RandomGuesserSemisupervised",   
       "algo_class_name": "RandomGuesserSemisupervised",
       "algo_name_in_result_table": "RandomGuesserSemisupervised",
       "algo_parameters": {},
        "fit": {'method_name': 'fit', 'params': {}}, 
        "decision_function": {'method_name': 'decision_function', 'params': {}}
        }]


The `own_algorithms` list in the above cell #ID 35(9) can be added to lst_benchmarking_algos as mentioned in #ID 19(5) to use this algorithm in a benchmark run shown in #ID 23(5) along with other algorithms

In [34]:
#ID 36(9)


#  A comparison object is created for comparing the evaluations of different Algo-Dataset combinations
co = ComparisonObject()

for dataset_name in benchmarking_datasets:
  # evaluate the random guesser
  eval_obj = EvaluationObject(algorithm_name="RandomGuesser")
  for (X_train, X_test, y_test), settings in benchmarking_datasets[dataset_name][0].sample_multiple_with_training_split(training_split=training_split, 
                                                                  max_contamination_rate=max_contamination_rate, 
                                                                  n_steps=n_steps):
      print(".", end=" ") # update to see progress
      rg = RandomGuesserSemisupervised()
      rg.fit(X_train) # data is fitted to RandomGuesser
      pred = rg.decision_function(X_test) # and decision_scores_ is accessed
      eval_obj.add(y_test, pred, settings)
  print("\n")
  eval_desc = eval_obj.evaluate(metrics=['roc_auc', 'adjusted_average_precision', 'precision_recall_auc'])
  # added to comparison object
  co.add_evaluation(eval_desc)
  print("\n")


. . . . 

Evaluation on dataset myTabularDataset with normal labels [0] and anomaly labels [1, 2].
Total of 4 datasets. Per dataset:
147 training instances, 163 test instances, training contamination rate 0.0, test contamination rate 0.3987730061349693.
Mean 	 Std_dev 	 Metric
0.516 	 0.023 		 roc_auc
0.047 	 0.054 		 adjusted_average_precision
0.417 	 0.034 		 precision_recall_auc


. . . . 

Evaluation on dataset spambase with normal labels [0] and anomaly labels [1].
Total of 4 datasets. Per dataset:
1516 training instances, 1686 test instances, training contamination rate 0.0, test contamination rate 0.3997627520759193.
Mean 	 Std_dev 	 Metric
0.505 	 0.009 		 roc_auc
0.005 	 0.021 		 adjusted_average_precision
0.402 	 0.012 		 precision_recall_auc


. . . . 

Evaluation on dataset annthyroid with normal labels [0] and anomaly labels [1.0].
Total of 4 datasets. Per dataset:
3916 training instances, 3146 test instances, training contamination rate 0.0, test contamination rate 0.1697

As in the above code, We store the evaluations of our own algorithm in evaluation object which is then added to comparison object.Similarly, we can create evaluation objects for other algorithms and add them to comparison object for final benchmarking  as shown in Section 5

Finally, we show below the benchmarking results of our algorithm as described in "**Section 7. Show Benchmarking Results**"

In [35]:
#ID 37(9)

# print results in easily readable format
co.print_results()

For roc_auc:
               myTabularDataset  spambase  annthyroid  myTabularDataset2  \
PCA                    0.732928  0.811608    0.783386           0.713112   
AutoEncoder            0.805808  0.811124    0.809295           0.813367   
VAE                    0.776060  0.811939    0.810223           0.772462   
RandomGuesser          0.516209  0.505419    0.496305           0.516209   
Average                0.707751  0.735023    0.724802           0.703787   

                Average  
PCA            0.760258  
AutoEncoder    0.809899  
VAE            0.792671  
RandomGuesser  0.508535  
Average             NaN  
For adjusted_average_precision:
               myTabularDataset  spambase  annthyroid  myTabularDataset2  \
PCA                    0.385346  0.570856    0.467751           0.405694   
AutoEncoder            0.515698  0.569862    0.492707           0.567927   
VAE                    0.457699  0.571325    0.492887           0.488129   
RandomGuesser          0.046783  0.005

In [36]:
#ID 38(9)
# print results in easily readable format with standard deviations
co.print_results(include_stdevs=True)

For roc_auc:
              myTabularDataset      spambase    annthyroid myTabularDataset2  \
PCA               0.733+-0.032  0.812+-0.013  0.783+-0.000      0.713+-0.000   
AutoEncoder       0.806+-0.024  0.811+-0.013  0.809+-0.000      0.813+-0.000   
VAE               0.776+-0.027  0.812+-0.013  0.810+-0.000      0.772+-0.000   
RandomGuesser     0.516+-0.023  0.505+-0.009  0.496+-0.012      0.516+-0.023   
Average                  0.708         0.735         0.725             0.704   

                Average  
PCA            0.760258  
AutoEncoder    0.809899  
VAE            0.792671  
RandomGuesser  0.508535  
Average             NaN  

For adjusted_average_precision:
              myTabularDataset      spambase    annthyroid myTabularDataset2  \
PCA               0.385+-0.104  0.571+-0.016  0.468+-0.000      0.406+-0.000   
AutoEncoder       0.516+-0.097  0.570+-0.016  0.493+-0.000      0.568+-0.000   
VAE               0.458+-0.107  0.571+-0.016  0.493+-0.000      0.488+-0.000 

In [38]:
#ID 39(9)

co.print_latex(include_stdevs=True)

For roc_auc:
\begin{center}
\begin{tabular}{  c c c c c c  }
  & myTabularDataset & spambase & annthyroid & myTabularDataset2 & Average \\
  PCA & 0.733$\pm$0.032 & \textit{0.812$\pm$0.013} & 0.783$\pm$0.000 & 0.713$\pm$0.000 & 0.760 \\
  AutoEncoder & \textbf{0.806$\pm$0.024} & 0.811$\pm$0.013 & \textit{0.809$\pm$0.000} & \textbf{0.813$\pm$0.000} & \textbf{0.810} \\
  VAE & \textit{0.776$\pm$0.027} & \textbf{0.812$\pm$0.013} & \textbf{0.810$\pm$0.000} & \textit{0.772$\pm$0.000} & \textit{0.793} \\
  RandomGuesser & 0.516$\pm$0.023 & 0.505$\pm$0.009 & 0.496$\pm$0.012 & 0.516$\pm$0.023 & 0.509 \\
  Average & 0.708 & 0.735 & 0.725 & 0.704 &    \\
\end{tabular}
\end{center}

For adjusted_average_precision:
\begin{center}
\begin{tabular}{  c c c c c c  }
  & myTabularDataset & spambase & annthyroid & myTabularDataset2 & Average \\
  PCA & 0.385$\pm$0.104 & \textit{0.571$\pm$0.016} & 0.468$\pm$0.000 & 0.406$\pm$0.000 & 0.457 \\
  AutoEncoder & \textbf{0.516$\pm$0.097} & 0.570$\pm$0.016 & \t

So,This was our example algorithm. Other algorithms can be used to run and extend benchmarks,  Please refer  to <b>#ID 19(5)</b>.