# Introduction

In this quickstart, we will get Data Detective running on your dataset as quickly as possible.


In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
import torch
import torchvision.transforms as transforms

from torchvision.datasets import MNIST
from typing import Dict, Union

from constants import FloatTensor
from src.aggregation.rankings import RankingAggregator, RankingAggregationMethod
from src.data_detective_engine import DataDetectiveEngine
from src.datasets.tutorial_dataset import TutorialDataset
from src.enums.enums import DataType



## Step 1: Dataset Implementation

### CSV Dataset Example

The easiest way to get started with Data Detective for CSV data is to use the `CSVDataset` class. This class accepts the path for a CSV file as well as a dictionary containing the datatypes for each column in the CSV file. 

The CSV file can contain numbers, text, or images represented in the CSV file as absolute paths. The datatype options available in the CSV Dict include: 
- `DataType.CONTINUOUS`
- `DataType.MULTIDIMENSIONAL` 
- `DataType.CATEGORICAL` 
- `DataType.TEXT`
- `DataType.IMAGE`
- `DataType.SEQUENTIAL`

If it suits your use case, fill in the blank code is available below to create the CSVDataset below. Otherwise, skip to `Dataset Construction` to find out how to build your own dataset.




In [None]:
from datasets.csv_dataset import CSVDataset

dataset = CSVDataset(
    # change filepath to your csv filepath
    filepath="your_csv_filepath.csv",
    # change dictionary to map from csv filenames to data types
    datatypes={
        "column1": DataType.CONTINUOUS,
        "column2": DataType.MULTIDIMENSIONAL,
        # ...
        "column_k": DataType.IMAGE,
    }
)

Note: if there is an `IMAGE` column in the CSV dataset that contains image paths, they will automatically be loaded into the dataset via `np.load`. 

### Dataset Construction

If dealing with data that does not easily serialize in CSV format, it is easier to create your own dataset to work within the Data Detective framework. Your dataset needs to satisfy the following requirements: 

1. It must override the `__getitem__` method that returns a dictionary mapping from each data column key to the data value. 
2. It must contain a `datatypes` method that returns a dictionary mapping from each data column key to the column's datatype. 
3. It must inherit from `torch.utils.data.DataType`.
4. \[optional\] It is convenient, but not necessary, to define a `__len__` method. `


Before diving in, let's look at a very simple dataset that consists of 10 columns of normal random variables. 

In [11]:
from pandas import DataFrame
from torch.utils.data import Dataset

class NormalDataset(Dataset):
    def __init__(self, num_cols: int = 10, dataset_size: int = 10000, loc: float = 0.):
        """
        Creates a normal dataset with column `feature_k` for k in [0, num_cols) 
        @param num_cols: number of columns to have
        @param dataset_size: number of datapoints to have
        @param loc: the mean of the data. 
        """
        self.dataset_size = dataset_size
        self.columns = [f"feature_{j}" for j in range(num_cols)]

        dataframe: DataFrame = pd.DataFrame({
            f"feature_{i}": np.random.normal(loc, 1, size=dataset_size)
            for i in range(num_cols)
        }, columns=self.columns)

        self.dataframe = dataframe

    def __getitem__(self, index: int) -> Dict[str, float]:
        """
        Returns a dict containing each column mapped to its value. 
        """
        return self.dataframe.iloc[index].to_dict()

    def __len__(self):
        return self.dataset_size

    def datatypes(self) -> Dict[str, DataType]:
        """
        Returns a dictionary mapping each column to its datatype.
        """
        return {
            column_name: DataType.CONTINUOUS
            for column_name in self.columns
        }

dataset = NormalDataset() 

Above, you can see that the dataset has both of the requirements above:

1. It overrides `__getitem__` to provide a dict mapping from each column to a single value. 
2. It overrides `datatypes` to map the same keys in `__getitem__` to their datatypes. 
3. It inherits from `torch.utils.data.Dataset`.

For complete clarity, let's take a look at the outputs of (1) and (2) below: 


In [6]:
dataset.__getitem__(0)

{'feature_0': -1.8818520390267668,
 'feature_1': 0.48079518810297267,
 'feature_2': 0.3253609247058997,
 'feature_3': -0.8423106628622142,
 'feature_4': -0.8097134235785098,
 'feature_5': -2.2650311419827744,
 'feature_6': -0.5544628629028009,
 'feature_7': 0.06647392399256227,
 'feature_8': 0.8387809291885946,
 'feature_9': -0.34279014635692356}

In [8]:
dataset.datatypes()

{'feature_0': <DataType.CONTINUOUS: 'continuous'>,
 'feature_1': <DataType.CONTINUOUS: 'continuous'>,
 'feature_2': <DataType.CONTINUOUS: 'continuous'>,
 'feature_3': <DataType.CONTINUOUS: 'continuous'>,
 'feature_4': <DataType.CONTINUOUS: 'continuous'>,
 'feature_5': <DataType.CONTINUOUS: 'continuous'>,
 'feature_6': <DataType.CONTINUOUS: 'continuous'>,
 'feature_7': <DataType.CONTINUOUS: 'continuous'>,
 'feature_8': <DataType.CONTINUOUS: 'continuous'>,
 'feature_9': <DataType.CONTINUOUS: 'continuous'>}

Note that both dictionaries contain identical keys, indicating that no datatypes are missed in the definition of the `datatypes` function. 

Below is the skeleton code for a dataset construction. Fill it in with your desired implemenetation of `__getitem__` and `datatypes`, and any initialization you may need to do.  

In [None]:
class YourDataset(Dataset):
    def __init__(self):
        """
        Sets up the dataset. This can include steps like:
            - loading csv paths
            - reading in text data
            - cleaning and preprocessing
        """
    
        """
        YOUR CODE HERE
        PUT YE CODE HERE, MATEY
        ARR
        """

    def __getitem__(self, index: int) -> Dict[str, float]:
        """
        Returns a dict containing each column mapped to its value. 
        """
    
        """
        YOUR CODE HERE
        AHOY, YE SCURVY CODER! WRITE YER MAGIC HERE!
        """


        return self.dataframe.iloc[index].to_dict()

    def datatypes(self) -> Dict[str, DataType]:
        """
        Returns a dictionary mapping each column to its datatype.
        """

        """
        YOUR CODE HERE
        AHOY, YE SCURVY CODER! WRITE YER MAGIC HERE!
        """

    # NOTE: conveninet, but not optional, to add __len__ method
    # def __len__(self) -> int: 
    #     pass

# put initialization code here or fix if needed
dataset = YourDataset()

Now that you've written your dataset, lets make sure everything is in ship shape!

In [None]:
dataset[0]

In [None]:
dataset.datatypes()

In [9]:
assert(isinstance(dataset[0], dict))
assert(isinstance(dataset.datatypes(), dict))
assert(dataset[0].keys() == dataset.datatypes().keys())

# Step 2: Data Object Creation

The *data object* is a dictionary that consists of the preprocessed dataset and (optionally) its splits. More information about setting up the data object is available in the tutorial; for the purpose of the quickstart, splitting and organization is done for you. 

In [15]:
inference_size: int = 20
everything_but_inference_size: int = dataset.__len__() - inference_size
inference_dataset, everything_but_inference_dataset = torch.utils.data.random_split(dataset, [inference_size, dataset.__len__() - inference_size])
    
train_size: int = int(0.6 * len(everything_but_inference_dataset))
val_size: int = int(0.2 * len(everything_but_inference_dataset))
test_size: int = len(everything_but_inference_dataset) - train_size - val_size
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(everything_but_inference_dataset, [train_size, val_size, test_size])

data_object = {
    "entire_set": dataset,
    "everything_but_inference_set": everything_but_inference_dataset,
    "inference_set": inference_dataset,
    "split_group_set": {
          # unordered splits belong here
          # in this example, train/val/test are included, but this dict can be as long
          # as desired and can contain an arbitrary number of named splits 
          "train/val/test": {
               "training_set": train_dataset,
               "validation_set": val_dataset,
               "test_set": test_dataset,
          },
        # Example of k-fold split:
        # "fold_0": {
        #      "training_set": train_datasets[0],
        #      "test_set": test_datasets[0],
        # },
        # "fold_1": {
        #      "training_set": train_datasets[1],
        #      "test_set": test_datasets[1],
        # },
        # ...
        # "fold_k": {
        #      "training_set": train_datasets[j],
        #      "test_set": test_datasets[j],
        # }
    }
}

print(f"size of inference_dataset: {inference_dataset.__len__()}")
print(f"size of everything_but_inference_dataset: {everything_but_inference_dataset.__len__()}")
print(f"size of train_dataset: {train_dataset.__len__()}")
print(f"size of entire dataset: {dataset.__len__()}")
print(f"size of val_dataset: {val_dataset.__len__()}")
print(f"size of test_dataset: {test_dataset.__len__()}")

size of inference_dataset: 20
size of everything_but_inference_dataset: 9980
size of train_dataset: 5988
size of entire dataset: 10000
size of val_dataset: 1996
size of test_dataset: 1996


# Step 3: Setting up a Validation Schema

## Step 3.1: Specifying Validators and Options

The validation schema contains information about the types of checks that will be executed by the Data Detective Engine and the transforms that Data Detective will use. More detailsd about creating your own validation schema is available in the tutorial; below is the validation schema that we recommend to get started. 

In [14]:
validation_schema : Dict = {
    "default_inclusion": False,
    "validators": {
        "unsupervised_anomaly_data_validator": {},
        "unsupervised_multimodal_anomaly_data_validator": {},
        "split_covariate_data_validator": {},
        "ood_inference_data_validator": {}
    }
}

## Step 3.2: Specifying Transforms

It may be the case that you are using a data modality that has little to no method infrastructure in Data Detective. The simplest way to make use of all of Data Detective's functionality is to use a transform that maps this data modality to a well-supported modality in Data Detective such as multidimensional data. In our example, we will be making use of a pretrained resnet50 backbone to map images to 2048 dimensional vectors. This will allow us to make use of methods built for multidimensional data on our image representations. 

More information about introducing custom transforms into Data Detective and cusotmizing the transform schema is available in the main tutorial and the ExtendingDD tutorial.


In [5]:
transform_schema : Dict = {
    "transforms": {
        "image": [{
            "name": "resnet50",
            "in_place": "False",
            "options": {},
        }],
    }
}
     
full_validation_schema: Dict = {
    **validation_schema, 
    **transform_schema
}

# Step 4: Running the Data Detective Engine

Now that the full validation schema and data object are prepared, we are ready to run the Data Detective Engine.

In [6]:
data_detective_engine = DataDetectiveEngine()

# 1 thread, --- 220.85648322105408 seconds ---
# multithreadinng (joblib), --- 149.11400604248047 seconds ---
# thread pools, --- 81.38025784492493 seconds ---
# data-level caching, clean cache, --- 75.22503590583801 seconds ---
# sample-level caching, clean cache--- 26.184876918792725 seconds ---
# data-level caching, dirty cache, --- 22.925609827041626 seconds ---
# sample-level caching, dirty cache, --- 19.73765206336975 seconds ---


start_time = time.time()
results = data_detective_engine.validate_from_schema(full_validation_schema, data_object)
print("--- %s seconds ---" % (time.time() - start_time))

running validator class unsupervised_anomaly_data_validator...
running validator class unsupervised_multimodal_anomaly_data_validator...


  def _pt_shuffle_rec(i, indexes, index_mask, partition_tree, M, pos):
  def delta_minimization_order(all_masks, max_swap_size=100, num_passes=2):
  def _reverse_window(order, start, length):
  def _reverse_window_score_gain(masks, order, start, length):
  def _mask_delta_score(m1, m2):
  def identity(x):
  def _identity_inverse(x):
  def logit(x):
  def _logit_inverse(x):
  def _build_fixed_single_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _build_fixed_multi_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _init_masks(cluster_matrix, M, indices_row_pos, indptr):
  def _rec_fill_masks(cluster_matrix, indices_row_pos, indptr, indices, M, ind):
  def _single_delta_mask(dind, masked_inputs, last_mask, data, x, noop_code):
  def _delta_masking(masks, x, curr_delta_inds, varying_rows_out,
  def _jit_build_partition_tree(xmin, xmax, ymi

running validator class split_covariate_data_validator...
running validator class ood_inference_data_validator...
thread 11393855488 entered to handle validator method iforest_anomaly_validator_methodthread 11410681856 entered to handle validator method cblof_anomaly_validator_method
thread 11410681856:    running cblof_anomaly_validator_method...

thread 11393855488:    running iforest_anomaly_validator_method...
thread 11427508224 entered to handle validator method pca_anomaly_validator_method
thread 11427508224:    running pca_anomaly_validator_method...
thread 11444334592 entered to handle validator method iforest_multimodal_anomaly_validator_method
thread 11444334592:    running iforest_multimodal_anomaly_validator_method...
thread 11461160960 entered to handle validator method pca_multimodal_anomaly_validator_method
thread 11461160960:    running pca_multimodal_anomaly_validator_method...
thread 11477987328 entered to handle validator method cblof_multimodal_anomaly_validator_met



thread 11427508224: finished




thread 11461160960: finished
thread 11511640064: finished
thread 11528466432: finished
thread 11578945536: finished
thread 11595771904: finished
thread 11444334592: finished




thread 11629424640: finished
thread 11393855488: finished
thread 11612598272: finished
thread 11477987328: finished




thread 11562119168: finished
thread 11410681856: finished
thread 11494813696: finished
thread 11545292800: finished
--- 15.769446849822998 seconds ---


Great! Let's start to look at and analyze the results we've collected.

# Step 5: Interpreting Results using the Built-In Rank Aggregator

To do rank aggregation, create a rankings object and either aggregate completely with the `aggregate_rankings` or aggregate by a single modality with the `aggregate_modal_rankings`. See below 

In [None]:
from enum import Enum

import pandas as pd
import scipy
from typing import List

from src.aggregation.rankings import RankingAggregationMethod, RankingAggregator

aggregator = RankingAggregator(results_object=results)
modal_rankings = aggregator.aggregate_modal_rankings("unsupervised_anomaly_data_validator", [RankingAggregationMethod.LOWEST_RANK, RankingAggregationMethod.HIGHEST_RANK, RankingAggregationMethod.ROUND_ROBIN], given_data_modality="feature_0")
total_rankings = aggregator.aggregate_rankings("unsupervised_anomaly_data_validator", [RankingAggregationMethod.LOWEST_RANK, RankingAggregationMethod.HIGHEST_RANK, RankingAggregationMethod.ROUND_ROBIN])
total_rankings

### Appendix 1A: Complete list of validator methods

| name | path | method description | data types | operable split types | 
| ---- | ---- | ------------------ | ---------- | -------------------- | 
| adbench_validator_method | src/validator_methods/validator_method_factories/adbench_validator_method_factory.py | factory generating adbench methods that perform anomaly detection | multidimensional data | entire set | 
| adbench_multimodal_validator_method | src/validator_methods/validator_method_factories/adbench_multimodal_validator_method_factory.py | factory generating adbench methods that perform anomaly detection by concatenating all multidimensional columns first to be able to draw conclusions jointly from the data | multidimensional data | entire set | 
| adbench_ood_inference_validator_method | src/validator_methods/validator_method_factories/adbench_ood_inference_validator_method_factory.py | factory generating methods that perform ood testing given a source set and a target/inference set using adbench emthods | multidimensional data | inference_set, everything_but_inference_set | 
| chi square validator method | src/validator_methods/chi_square_validator_method.py | chi square test for testing CI assumptions between two categorical variables | categorical data | entire_set |
| diffi anomaly explanation validator method | src/validator_methods/diffi_anomaly_explanation_validator_method.py | A validator method for explainable anomaly detection using the DIFFI feature importance method. | multidimensional | entire_set |
| fcit validator method | src/validator_methods/fcit_validator_method.py | A method for determining conditionanl independence of two multidimensional vectors given a third. | continuous, categorical, or multidimensional | entire_set |
| kolmogorov smirnov multidimensional split validator | src/validator_methods/kolmogorov_smirnov_multidimensional_split_validator_method.py | KS testing over multidimensional data for split covariate shift. | multidimensional | entire_set |
| kolmogoriv smirnov normality validator method | src/validator_methods/kolmogorov_smirnov_normality_validator_method.py | KS testing over continuous data for normality assumption. | continuous | entire_set | 
| kolmogorov smirnov split validator method | src/validator_methods/kolmogorov_smirnov_split_validator_method.py | KS testing over continuous data for split covariate shift. |  continuous | entire_set |  
| kruskal wallis multidimensional split validator method | src/validator_methods/kruskal_wallis_multidimensional_split_validator_method.py | kruskal wallis testing over multidimensional data for split covariate shift. | multidimensional | entire_set | 
| kruskal wallis split validator method | src/validator_methods/kruskal_wallis_split_validator_method.py | kruskal wallis testing over continuous data for split covariate shift. | continuous | entire_set |  
| mann whitney multidimensional split validator method | src/validator_methods/mann_whitney_multidimensional_split_validator_method.py | mann whitney testing over multidimensional data for split covariate shift. | multidimensional | entire_set |
| mann whitney split validator method | src/validator_methods/mann_whitney_split_validator_method.py | mann whitney testing over continuous data for split covariate shift. | continuous | entire_set |  
| shap tree validator method | src/validator_methods/shap_tree_validator_method.py |     A validator method for explainable anomaly detection using Shapley values. | multidimensional | entire_set | 



### Appendix 1B: Complete list of validators
| name | path | method description | 
| ---- | ---- | ------------------ | 
| test | test | test | 

### Appendix 1C: Complete list of transforms. 

| name | path | method description | 
| ---- | ---- | ------------------ | 
| test | test | test | 