# Get Data

Reproducible script for fetching data from [this benchmark suite](https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/)

In [1]:
# Black codeformatter (pip install nb_black). Disabling this cell has no consequences in terms of functionality.
# %load_ext lab_black

# Prepare Filesystem

Create all the necessary folders.

If you do not want to delete what you already have, set the following cell to 'raw'

In [2]:
%%bash
# clean dataset directories
rm -rf ../data
mkdir ../data

In [3]:
# %%bash
# cd ../data/

# # If data/raw does not exist => Make it.
# DIRECTORY=raw
# if [  ! -d "$DIRECTORY" ]; then
#     mkdir raw
#     echo Made empty data/raw directory
# fi

# Campos Benchmark Suite

Cf. [here for the website](https://www.dbs.ifi.lmu.de/research/outlier-evaluation/)

Cf. [here for the semantic one](https://www.dbs.ifi.lmu.de/research/outlier-evaluation/input/semantic.tar.gz) and [here for the literature suite](https://www.dbs.ifi.lmu.de/research/outlier-evaluation/input/literature.tar.gz)
This is the benchmark dataset suite from the following paper:
[1] Campos, Guilherme O., et al. "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study." Data mining and knowledge discovery 30.4 (2016).

## Semantic
Extract from [1]: "Semantically meaningful datasets for outlier evaluation are those in which certain
classes can be reasonably assumed to be associated with real-world instances that are
both rare and deviating—for example, ‘sick’ patients within a population dominated
by ‘healthy’ individuals."

In [4]:
%%bash

cd ../data

# get outlier-evaluation-semantic
curl -o benchmarks.tar.gz --remote-name https://www.dbs.ifi.lmu.de/research/outlier-evaluation/input/semantic.tar.gz
tar -xf benchmarks.tar.gz
rm benchmarks.tar.gz
mv semantic outlier_evaluation_semantic

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 82.5M  100 82.5M    0     0  8439k      0  0:00:10  0:00:10 --:--:-- 9884k


## Literature
Extract from [1]: "Table 1 lists those datasets of our study that are known to have appeared in the outlier detection literature."

In [5]:
%%bash 
cd ../data
# get outlier-evaluation-literature
curl -o benchmarks.tar.gz --remote-name https://www.dbs.ifi.lmu.de/research/outlier-evaluation/input/literature.tar.gz
tar -xf benchmarks.tar.gz
rm benchmarks.tar.gz
mv literature outlier_evaluation_literature

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44.1M  100 44.1M    0     0  6210k      0  0:00:07  0:00:07 --:--:-- 7203k


## Normalize Pendigits

The Pendigits dataset is not normalized (attribute values in range \[0, 100\] as opposed to \[0,1\]), and would require vastly different parameter settings compared to the other datasets that are. For consistency with the others, we normalize them to range \[0,1\].

### Imports

In [6]:
from sklearn import preprocessing

import pandas as pd
from pathlib import Path

from scipy.io import arff

### Functions

In [7]:
def arff_from_filepath(arff_filepath):
    data, meta = arff.loadarff(arff_filepath)
    df = pd.DataFrame(data)
    df = df.rename(columns=dict(Outlier="outlier"))
    df["outlier"] = df["outlier"].transform(lambda x: x.decode("utf8"))
    return df


def get_header_lines(filepath):
    with open(filepath) as file_in:
        header_lines = []
        for line in file_in:
            header_lines.append(line)
            if "@DATA" in line:
                break
    return header_lines


def df_to_lines(df):
    n_instances = df.shape[0]
    lines = []
    for i in range(n_instances):
        l = [str(e) for e in df.iloc[i].values.tolist()]
        l[-1] = "'{}'".format(l[-1])
        line = ",".join(l)
        line = line + "\n"
        lines.append(line)
    return lines


def write_lines(lines, filepath):
    assert isinstance(lines, list)
    with open(filepath, "a") as file_out:
        file_out.writelines(lines)
    return

In [8]:
def correct_pendigits_fp(original_fp):
    "Pendigits filepaths as downloaded from the repo do not respect the filename semantics. (Claim normalized when not.)"
    dp, stem, suffix = original_fp.parent, original_fp.stem, original_fp.suffix
    name, dupl, norm, version = stem.split("_")
    new_stem = "_".join([name, dupl, version])
    new_fp = dp / (new_stem + suffix)

    original_fp.rename(new_fp)
    return new_fp


def get_normalized_fp(original_fp):
    "Assuming a CORRECT filepath comes in."
    dp, stem, suffix = original_fp.parent, original_fp.stem, original_fp.suffix
    name, dupl, version = stem.split("_")
    new_stem = "_".join([name, dupl, "norm", version])
    new_fp = dp / (new_stem + suffix)
    return new_fp


def get_normalized_df(original_df, scaler=None):
    assert (
        scaler is not None
    ), "You need a scaler from sklearn preprocessing module, eg MinMaxScaler()"
    rescale_cols = [c for c in original_df.columns if c not in ["id", "outlier"]]
    original_X = original_df[rescale_cols].values
    rescaled_X = scaler.fit_transform(original_X)

    normalized_df = original_df.copy()
    normalized_df[rescale_cols] = rescaled_X
    normalized_df = normalized_df.round(4)
    return normalized_df


def make_normalized_arff(original_fp, scaler=None, verbose=True):
    original_df = arff_from_filepath(original_fp)
    normalized_df = get_normalized_df(original_df, scaler=scaler)
    normalized_fp = get_normalized_fp(original_fp)

    header_lines = get_header_lines(original_fp)
    data_lines = df_to_lines(normalized_df)
    lines = header_lines + data_lines
    write_lines(lines, normalized_fp)

    if verbose:
        msg = "{} DONE".format(normalized_fp.stem)
        print(msg)

    return

### Parameters

In [9]:
outlier_evaluation_literature_directory_name = "outlier_evaluation_literature"

### Execution

In [10]:
dp_literature = (
    Path().absolute().parent
    / "data"
    / outlier_evaluation_literature_directory_name
)
dp_literature

dp = dp_literature / "PenDigits"
assert dp.exists()

fps = list(dp.glob("*.arff"))
fps.sort()

In [11]:
fps

[PosixPath('/home/jonas/Projects/2021_ODD/anomaly-detection-evaluation/data/outlier_evaluation_literature/PenDigits/PenDigits_withoutdupl_norm_v01.arff'),
 PosixPath('/home/jonas/Projects/2021_ODD/anomaly-detection-evaluation/data/outlier_evaluation_literature/PenDigits/PenDigits_withoutdupl_norm_v02.arff'),
 PosixPath('/home/jonas/Projects/2021_ODD/anomaly-detection-evaluation/data/outlier_evaluation_literature/PenDigits/PenDigits_withoutdupl_norm_v03.arff'),
 PosixPath('/home/jonas/Projects/2021_ODD/anomaly-detection-evaluation/data/outlier_evaluation_literature/PenDigits/PenDigits_withoutdupl_norm_v04.arff'),
 PosixPath('/home/jonas/Projects/2021_ODD/anomaly-detection-evaluation/data/outlier_evaluation_literature/PenDigits/PenDigits_withoutdupl_norm_v05.arff'),
 PosixPath('/home/jonas/Projects/2021_ODD/anomaly-detection-evaluation/data/outlier_evaluation_literature/PenDigits/PenDigits_withoutdupl_norm_v06.arff'),
 PosixPath('/home/jonas/Projects/2021_ODD/anomaly-detection-evaluation

In [12]:
minmax_scaler = preprocessing.MinMaxScaler()

In [13]:
for fp in fps:
    fp = correct_pendigits_fp(fp)
    make_normalized_arff(fp, scaler=minmax_scaler, verbose=True)

PenDigits_withoutdupl_norm_v01 DONE
PenDigits_withoutdupl_norm_v02 DONE
PenDigits_withoutdupl_norm_v03 DONE
PenDigits_withoutdupl_norm_v04 DONE
PenDigits_withoutdupl_norm_v05 DONE
PenDigits_withoutdupl_norm_v06 DONE
PenDigits_withoutdupl_norm_v07 DONE
PenDigits_withoutdupl_norm_v08 DONE
PenDigits_withoutdupl_norm_v09 DONE
PenDigits_withoutdupl_norm_v10 DONE


## Fix Contaminations.

Unlike the semantic part of this dataset, the literature part has not been preprocessed to contain fixed contaminations.

These two datasets (Ionosphere and WPBC) only come in highly contaminated versions. In particular, their respective contaminations are much higher than all the other datasets present in the Campos benchmark, which basically means that one ends up comparing apples to oranges when you want to include these two. Therefore, we subsample the anomalies to create versions of this dataset with a contamination of 10%.

### Imports

In [14]:
import warnings

### Functions

In [15]:
def split_df(df):
    uvalues = df["outlier"].unique()

    no = [v for v in uvalues if "n" in v][0]
    yes = [v for v in uvalues if "y" in v][0]

    norm_df = df[df["outlier"] == no]
    anom_df = df[df["outlier"] == yes]
    return norm_df, anom_df


def _get_anom_subsample_df(norm_df, anom_df, fraction, random_state=42):
    anom_subsample_df = anom_df.sample(
        frac=fraction, random_state=random_state
    ).reset_index(drop=True)
    return (
        pd.concat([norm_df, anom_subsample_df])
        .sample(frac=1, random_state=random_state)
        .reset_index(drop=True)
    )


def _get_norm_subsample_df(norm_df, anom_df, fraction, random_state=42):
    norm_subsample_df = norm_df.sample(
        frac=fraction, random_state=random_state
    ).reset_index(drop=True)
    return (
        pd.concat([anom_df, norm_subsample_df])
        .sample(frac=1, random_state=random_state)
        .reset_index(drop=True)
    )


def _get_n_anom(contamination, n_norm):
    return (n_norm * contamination) / (1 - contamination)


def _get_n_norm(contamination, n_anom):
    return (n_anom * (1 - contamination)) / contamination


def get_anom_subsample_dfs(df, contamination=0.1, n_versions=1, random_state=42):
    norm_df, anom_df = split_df(df)

    n_inst_old, n_norm_old, n_anom_old = df.shape[0], norm_df.shape[0], anom_df.shape[0]
    contam_old = n_anom_old / n_inst_old

    must_subsample_anomalies = contamination / contam_old < 1.0
    must_subsample_normal = contamination / contam_old > 1.0

    if must_subsample_anomalies:
        # New DataFrames by subsampling anomalies
        dfs = []
        n_anom_new = _get_n_anom(contamination, n_norm_old)
        fraction = n_anom_new / n_anom_old
        assert fraction < 1.0, "You are subsampling so the fraction needs to be lower"

        for i in range(n_versions):
            random_state_sample = random_state * i + i
            dfs.append(
                _get_anom_subsample_df(
                    norm_df, anom_df, fraction, random_state=random_state_sample
                )
            )
    elif must_subsample_normal:
        # New DataFrames by subsampling normal instances
        dfs = []
        n_norm_new = _get_n_norm(contamination, n_anom_old)
        fraction = n_norm_new / n_norm_old
        assert fraction < 1.0, "You are subsampling so the fraction needs to be lower"

        for i in range(n_versions):
            random_state_sample = random_state * i + i
            dfs.append(
                _get_norm_subsample_df(
                    norm_df, anom_df, fraction, random_state=random_state_sample
                )
            )
    else:
        warnings.warn("YOU RUN INTO THE EDGECASE OF THE DATASET BEING PERFECT ALREADY")
        pass

    return dfs

In [16]:
def _get_version_suffix(version_id=1):
    return "v{:02d}".format(version_id)


def _get_contamination_suffix(contamination=0.1):
    assert isinstance(contamination, float)
    assert 0.0 < contamination <= 1.0
    return "{:02d}".format(int(contamination * 100))


def get_subsample_fp(filepath, version_id=1, contamination=0.1):
    contamination_suffix = _get_contamination_suffix(contamination=contamination)

    dp = filepath.parent
    stem, suffix = filepath.stem, filepath.suffix

    if fp_has_version(filepath):
        # Add the old version suffix
        stem, version_suffix = stem.split("_v")
        new_fn = "{}_{}_v{}{}".format(
            stem, contamination_suffix, version_suffix, suffix
        )
    else:
        # You can add a novel version suffix
        version_suffix = _get_version_suffix(version_id=version_id)
        new_fn = "{}_{}_{}{}".format(stem, contamination_suffix, version_suffix, suffix)
    return dp / new_fn

In [17]:
def make_subsampled_arffs(
    original_fp, contamination=0.1, n_versions=1, random_state=42, verbose=True
):
    original_df = arff_from_filepath(original_fp)
    subsample_dfs = get_anom_subsample_dfs(
        original_df,
        contamination=contamination,
        n_versions=n_versions,
        random_state=random_state,
    )

    subsample_fps = [
        get_subsample_fp(original_fp, version_id=i + 1, contamination=contamination)
        for i in range(len(subsample_dfs))
    ]

    header_lines = get_header_lines(original_fp)
    print(len(header_lines))

    for df, fp in zip(subsample_dfs, subsample_fps):
        data_lines = df_to_lines(df)
        lines = header_lines + data_lines
        write_lines(lines, fp)

        if verbose:
            msg = "{} DONE".format(fp.stem)
            print(msg)
    return

In [18]:
def fp_has_version(filepath):
    "Check if filepath has version"
    return "_v" in filepath.stem

In [19]:
def add_contamination_to_original_fp(original_fp, filepath_has_version=False):
    original_df = arff_from_filepath(original_fp)
    norm_df, anom_df = split_df(original_df)
    orig_contamination = anom_df.shape[0] / original_df.shape[0]

    # New filename
    dp = original_fp.parent
    stem, suffix = original_fp.stem, original_fp.suffix
    contamination_suffix = _get_contamination_suffix(contamination=orig_contamination)

    if filepath_has_version:
        stem, version_suffix = stem.split("_v")
        new_fn = "{}_{}_v{}{}".format(
            stem, contamination_suffix, version_suffix, suffix
        )
    else:
        # Just add to back
        new_fn = "{}_{}{}".format(stem, contamination_suffix, suffix)

    new_fp = dp / new_fn
    original_fp.rename(new_fp)
    return new_fp

### Parameters

Parameters related to the subsampling procedure.

In [20]:
CONTAMINATIONS = [0.02, 0.05, 0.1]
RANDOM_STATE = 42
N_VERSIONS = 10

### Execution

In [21]:
dp_literature = (
    Path().absolute().parent
    / "data"
    / outlier_evaluation_literature_directory_name
)
dp_literature

PosixPath('/cw/dtaijupiter/NoCsBack/dtai/jonass/Projects/2021-ODD/anomaly-detection-evaluation/data/outlier_evaluation_literature')

In [22]:
fps_to_preprocess = [d for d in dp_literature.rglob("*.arff")]
fps_to_preprocess.sort()

In [23]:
for fp_pp in fps_to_preprocess:

    if fp_has_version(fp_pp):
        # No need to make versions of versions
        n_versions = 1
    else:
        n_versions = N_VERSIONS

    for contamination in CONTAMINATIONS:
        make_subsampled_arffs(
            fp_pp,
            contamination=contamination,
            n_versions=n_versions,
            random_state=RANDOM_STATE,
            verbose=True,
        )

    fp_pp = add_contamination_to_original_fp(
        fp_pp, filepath_has_version=fp_has_version(fp_pp)
    )

33
ALOI_02_v01 DONE
ALOI_02_v02 DONE
ALOI_02_v03 DONE
ALOI_02_v04 DONE
ALOI_02_v05 DONE
ALOI_02_v06 DONE
ALOI_02_v07 DONE
ALOI_02_v08 DONE
ALOI_02_v09 DONE
ALOI_02_v10 DONE
33
ALOI_05_v01 DONE
ALOI_05_v02 DONE
ALOI_05_v03 DONE
ALOI_05_v04 DONE
ALOI_05_v05 DONE
ALOI_05_v06 DONE
ALOI_05_v07 DONE
ALOI_05_v08 DONE
ALOI_05_v09 DONE
ALOI_05_v10 DONE
33
ALOI_10_v01 DONE
ALOI_10_v02 DONE
ALOI_10_v03 DONE
ALOI_10_v04 DONE
ALOI_10_v05 DONE
ALOI_10_v06 DONE
ALOI_10_v07 DONE
ALOI_10_v08 DONE
ALOI_10_v09 DONE
ALOI_10_v10 DONE
33
ALOI_norm_02_v01 DONE
ALOI_norm_02_v02 DONE
ALOI_norm_02_v03 DONE
ALOI_norm_02_v04 DONE
ALOI_norm_02_v05 DONE
ALOI_norm_02_v06 DONE
ALOI_norm_02_v07 DONE
ALOI_norm_02_v08 DONE
ALOI_norm_02_v09 DONE
ALOI_norm_02_v10 DONE
33
ALOI_norm_05_v01 DONE
ALOI_norm_05_v02 DONE
ALOI_norm_05_v03 DONE
ALOI_norm_05_v04 DONE
ALOI_norm_05_v05 DONE
ALOI_norm_05_v06 DONE
ALOI_norm_05_v07 DONE
ALOI_norm_05_v08 DONE
ALOI_norm_05_v09 DONE
ALOI_norm_05_v10 DONE
33
ALOI_norm_10_v01 DONE
ALOI_norm_

# HiCS High-Dimensional Benchmark Suite
Synthetic datasets from [2] Keller, Fabian, Emmanuel Muller, and Klemens Bohm. "HiCS: High contrast subspaces for density-based outlier ranking." 2012 IEEE 28th international conference on data engineering. IEEE, 2012.
Cf. [here](https://www.ipd.kit.edu/mitarbeiter/muellere/HiCS/synth.zip)

Cf. [here for website](https://www.ipd.kit.edu/mitarbeiter/muellere/HiCS)

In [24]:
# %%bash

# cd ../data/raw


# # get synthetic datasets from HiCS
# curl -o synth.zip --remote-name https://www.ipd.kit.edu/mitarbeiter/muellere/HiCS/synth.zip
# unzip synth.zip -d HiCS_synthetic_datasets
# rm synth.zip

# Custom Datasets of Special Interest

Datasets that we obtained (or generated) outside of existing benchmark suites

## Zoo dataset
This dataset has some easily interpretable features about animals.
We use this in the paper to showcase the explainability aspect of AD-Mercs.

https://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data

In [25]:
# %%bash
# cd ../data/raw
# if [  ! -d UCI ]; then
#     mkdir UCI
#     echo Made empty UCI directory
# fi
# cd UCI
# curl -o zoo.data --remote-name https://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data

## NBA-data

TODO: Perhaps edit the download script to really fetch the data from a specific commit to make it more robust.

In [26]:
# %%bash
# cd ../data/raw
# if [  ! -d NBA ]; then
#     mkdir NBA
#     echo Made empty NBA directory
# fi
# cd NBA

# curl -o clean-01.csv https://raw.githubusercontent.com/eliavw/nba-anomaly-generator/master/data/clean/nba-clean-01.csv