This notebook focuses on how unique preprocessing can be applied to a dataset and recorded in the `yaml` file, making it easily reproducible.


The notebook `Reproducibility1.ipynb` shows how `oab` can generally be used to make results easily reproducible, ranging from preprocessing over converting the dataset into an anomaly dataset and finally sampling from this dataset.

In [1]:
import sys
sys.path.append('../..')

%load_ext autoreload
%autoreload 2

The unique preprocessing in this example is simply a scaling operation for all data points. (Note: This can be handled in `oab` using the `ClassificationDataset.scale()` operation. As it is a simple operation, it is used in this example.)

In [2]:
# download custom preprocessing function and inspect content
import wget
wget.download('https://raw.githubusercontent.com/jandeller/test/main/custom_preprocessing.py', "custom_preprocessing.py")
!cat custom_preprocessing.py

import numpy as np

def scale_values(values: np.ndarray, scaling_factor: float, message: str):
    print(message)
    return values * scaling_factor


**Note that the first parameter has to the `X` values of the dataset**, as this is automatically passed as first argument when calling custom preprocessing functions.

In `Reproducibility1.ipynb` (LINK HERE), it was shown how a `yaml` file that records all changes applied to a dataste is created. The custom preprocessing steps are added manually to that file by specifying the operation's name and parameters. In this case, the following would be added to the `yaml` file:

```
custom_functions:
  - name: scale_values
    parameters:
      scaling_factor: 0.5
      message: "Scaled all values by factor 0.5."
```

We next download a `yaml` file with this content.
Then, a dataset is loaded as classification dataset without any preprocessing applied to it, and the preprocessing step described in the `yaml` file, i.e., the custom scaling function, is applied to it.

In [3]:
# download yaml and inspect content
import wget
wget.download('https://raw.githubusercontent.com/jandeller/test/main/custom_preprocessing_config.yaml', "custom_preprocessing_config.yaml")
!cat custom_preprocessing_config.yaml

custom_functions:
  - name: scale_values
    parameters:
      scaling_factor: 0.5
      message: "Scaled all values by factor 0.5."


In [4]:
# load dataset (as classification dataset without any preprocessing applied)
from oab.data.load_dataset import load_dataset
wilt_cd = load_dataset('wilt', anomaly_dataset=False, preprocess_classification_dataset=False)

Credits: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


In [5]:
# inspect values from wilt dataset
wilt_cd.values

array([[120.3627737 , 205.5       , 119.3953488 , 416.5813953 ,
         20.67631835],
       [124.7395833 , 202.8       , 115.3333333 , 354.3333333 ,
         16.70715083],
       [134.6919643 , 199.2857143 , 116.8571429 , 477.8571429 ,
         22.49671178],
       ...,
       [119.0766871 , 247.9512195 , 113.3658537 , 808.0243902 ,
         24.83005893],
       [107.9444444 , 197.        ,  90.        , 451.        ,
          8.2148874 ],
       [119.7319277 , 182.2380952 ,  74.28571429, 301.6904762 ,
         22.94427836]])

In [7]:
wilt_cd.perform_operations_from_yaml(filepath="custom_preprocessing_config.yaml")

Scaled all values by factor 0.5.


In [8]:
# inspect values from wilt dataset -> Scaled by 0.5.
wilt_cd.values

array([[ 60.18138685, 102.75      ,  59.6976744 , 208.29069765,
         10.33815917],
       [ 62.36979165, 101.4       ,  57.66666665, 177.16666665,
          8.35357541],
       [ 67.34598215,  99.64285715,  58.42857145, 238.92857145,
         11.24835589],
       ...,
       [ 59.53834355, 123.97560975,  56.68292685, 404.0121951 ,
         12.41502946],
       [ 53.9722222 ,  98.5       ,  45.        , 225.5       ,
          4.1074437 ],
       [ 59.86596385,  91.1190476 ,  37.14285715, 150.8452381 ,
         11.47213918]])