# Introduction

In this quickstart, we will get Data Detective running on your dataset as quickly as possible.

To get up and running on your own dataset as quickly as possible, we have formatted the tutorial so that sections requiring your own code / inputs are **bolded**; all other sections can be customized as needed, but can be skimmed / run as-is. 
 


In [3]:
import numpy as np
import pandas as pd
import os
import sys
import time
import torch

from typing import Dict, Union

src_path = os.path.abspath(os.path.join(os.pardir, '.'))
if src_path not in sys.path:
    sys.path.insert(0, src_path)

from constants import FloatTensor
from src.data_detective_engine import DataDetectiveEngine
from src.enums.enums import DataType

## **Step 1: Dataset Implementation**

### **Option 1: CSV Dataset Example**

The easiest way to get started with Data Detective for CSV data is to use the `CSVDataset` class. This class accepts the path for a CSV file as well as a dictionary containing the datatypes for each column in the CSV file. 

The CSV file can contain numbers, text, or images represented in the CSV file as absolute paths. The datatype options available in the CSV Dict include: 
- `DataType.CONTINUOUS`
- `DataType.MULTIDIMENSIONAL` 
- `DataType.CATEGORICAL` 
- `DataType.TEXT`
- `DataType.IMAGE`
- `DataType.SEQUENTIAL`

If it suits your use case, fill in the blank code is available below to create the CSVDataset below. Otherwise, skip to `Dataset Construction` to find out how to build your own dataset.




In [4]:
from src.datasets.csv_dataset import CSVDataset

dataset = CSVDataset(
    # change filepath to your csv filepath
    filepath="your_csv_filepath.csv",
    # change dictionary to map from csv filenames to data types
    datatypes={
        "column1": DataType.CONTINUOUS,
        "column2": DataType.MULTIDIMENSIONAL,
        # ...
        "column_k": DataType.IMAGE,
    }
)

TypeError: CSVDataset.__init__() got an unexpected keyword argument 'datatypes'

Note: if there is an `IMAGE` column in the CSV dataset that contains image paths, they will automatically be loaded into the dataset via `np.load`. 

### **Option 2: Dataset Construction**

If dealing with data that does not easily serialize in CSV format, it is easier to create your own dataset to work within the Data Detective framework. Your dataset needs to satisfy the following requirements: 

1. It must override the `__getitem__` method that returns a dictionary mapping from each data column key to the data value. 
2. It must contain a `datatypes` method that returns a dictionary mapping from each data column key to the column's datatype. 
3. It must inherit from `torch.utils.data.DataType`.
4. \[optional\] It is convenient, but not necessary, to define a `__len__` method. `


Before diving in, let's look at a very simple dataset that consists of 10 columns of normal random variables. 

In [5]:
from pandas import DataFrame
from torch.utils.data import Dataset

from src.datasets.data_detective_dataset import DataDetectiveDataset

class NormalDataset(DataDetectiveDataset):
    def __init__(self, num_cols: int = 10, dataset_size: int = 1000, loc: float = 0.):
        """
        Creates a normal dataset with column `feature_k` for k in [0, num_cols) 
        @param num_cols: number of columns to have
        @param dataset_size: number of datapoints to have
        @param loc: the mean of the data. 
        """
        self.dataset_size = dataset_size
        self.columns = [f"feature_{j}" for j in range(num_cols)]

        dataframe: DataFrame = pd.DataFrame({
            f"feature_{i}": np.random.normal(loc, 100, size=dataset_size)
            for i in range(num_cols)
        }, columns=self.columns)

        self.dataframe = dataframe
        
        super().__init__(
            show_id=False, 
            include_subject_id_in_data=False,
            sample_ids = [str(s) for s in list(range(dataset_size))],
            subject_ids = [str(s) for s in list(range(dataset_size))]
        )

    def __getitem__(self, index: int) -> Dict[str, float]:
        """
        Returns a dict containing each column mapped to its value. 
        """
        return self.dataframe.iloc[index].to_dict()

    def __len__(self):
        return self.dataset_size

    def datatypes(self) -> Dict[str, DataType]:
        """
        Returns a dictionary mapping each column to its datatype.
        """
        return {
            column_name: DataType.CONTINUOUS
            for column_name in self.columns
        }


class BrainStudy(torch.utils.data.Dataset):
    def __init__(self, pathlist=None, **kwargs):
        super().__init__(**kwargs)
        self.df = pd.read_csv(pathlist) 

    def datatypes(self) -> Dict[str, DataType]:
        """
        Specify datatype for each column name. 
            # SEQUENCE
            # TEXT
            # IMAGE
            # CONTINUOUS
            # CATEGORICAL
            # MULTIDIMENSIONAL  vs  MULTIVARIATE vs VECTOR
        
            DataType.CONTINUOUS
            DataType.CATEGORICAL
            DataType.MULTIDIMENSIONAL  # maybe replace with MULTIVARIATE
            DataType.IMAGE  # maybe replace with GRID
            DataType.TIME_SERIES  # maybe replace with SEQUENCE

            okay what needs to happen to make this semi-working?
                - age: depictable by normal distribution or something.
                - sex: 50/50
                - cognitive score: uniform random normal
                - clin history?
        """
        self.datatype_dict =  {
            "age": DataType.CONTINUOUS,
            "sex": DataType.CATEGORICAL,
            "cognitive_score": DataType.CONTINUOUS,
            "clinical_history": DataType.TEXT,#path to .txt?
            "brain_MRI": DataType.IMAGE, #path
            "brain_PET": DataType.IMAGE,#path
            "activity_monitor": DataType.SEQUENCE,#path
            "speech_derived_features": DataType.MULTIVARIATE,#path to .npy ?
        }
        return self.datatype_dict    
        
    def __getitem__(self, idx: Union[int, slice, list]) -> Dict[str, Union[FloatTensor, int]]:
        """
        Returns a dictionary with column names and values for a specific idx. 
        """
        # for data inputs that are paths, code to load the file from the path needs to be provided in getitem
        # code to load in sequence, image, or multivariate data
        return {k:self.df[k][idx] for k in self.datatype_dict.keys()}


# data_object = BrainStudy(pathlist='/path/to/my/pathlist.csv')

# what it would look like for several splits
# data_object = {'train': BrainStudy(pathlist='/path/to/my/train.csv'),
#                'val':BrainStudy(pathlist='/path/to/my/val.csv'),
#                'internal_test':BrainStudy(pathlist='/path/to/my/internal_test.csv'),}


data_detective_engine = DataDetectiveEngine()

dataset = NormalDataset() 

In [6]:
dataset.dataframe

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9
0,9.300063,-97.374910,-96.961273,24.775876,-76.340308,10.264906,-29.034723,-68.896520,97.494640,-135.473589
1,20.553508,68.276241,-140.235791,244.178943,27.266424,-21.111289,79.587102,-36.905937,-110.592820,-94.693051
2,-155.726528,-174.909921,77.271585,-57.098531,188.743502,74.660625,-283.491604,-86.624980,-104.280010,-138.306136
3,54.776698,76.078832,64.231359,-3.192938,46.981805,-9.770307,-41.783360,-238.702544,26.666898,-11.566353
4,-63.292024,-71.964680,23.916598,196.419677,98.804761,-36.098428,45.084206,-73.319461,-35.938992,10.728857
...,...,...,...,...,...,...,...,...,...,...
995,-50.380569,4.097705,-129.201919,-81.490865,-106.335408,-79.894530,-176.756361,-74.087531,137.382467,52.273184
996,-62.553597,-115.251967,54.033940,-27.163218,-96.409990,192.698246,13.120121,-160.596039,-94.320426,303.796196
997,109.960845,34.005859,131.514136,-80.500192,-1.429799,201.709522,-73.453879,60.290769,-194.102823,52.695278
998,65.515262,-47.671069,-107.860181,18.710771,-151.231098,-79.421653,-213.612551,-4.620410,82.456411,96.011243


Above, you can see that the dataset has both of the requirements above:

1. It overrides `__getitem__` to provide a dict mapping from each column to a single value. 
2. It overrides `datatypes` to map the same keys in `__getitem__` to their datatypes. 
3. It inherits from `torch.utils.data.Dataset`.

For complete clarity, let's take a look at the outputs of (1) and (2) below: 


In [None]:
dataset.__getitem__(0)

In [None]:
dataset.datatypes()

Note that both dictionaries contain identical keys, indicating that no datatypes are missed in the definition of the `datatypes` function. 

Below is the skeleton code for a dataset construction. Fill it in with your desired implemenetation of `__getitem__` and `datatypes`, and any initialization you may need to do.  

In [None]:
class YourDataset(Dataset):
    def __init__(self):
        """
        Sets up the dataset. This can include steps like:
            - loading csv paths
            - reading in text data
            - cleaning and preprocessing
        """
    
        """
        YOUR CODE HERE
        PUT YE CODE HERE, MATEY
        ARR
        """

    def __getitem__(self, index: int) -> Dict[str, float]:
        """
        Returns a dict containing each column mapped to its value. 
        """
    
        """
        YOUR CODE HERE
        AHOY, YE SCURVY CODER! WRITE YER MAGIC HERE!
        """


        return self.dataframe.iloc[index].to_dict()

    def datatypes(self) -> Dict[str, DataType]:
        """
        Returns a dictionary mapping each column to its datatype.
        """

        """
        YOUR CODE HERE
        AHOY, YE SCURVY CODER! WRITE YER MAGIC HERE!
        """

    # NOTE: convenient, but not optional, to add __len__ method
    # def __len__(self) -> int: 
    #     pass

# put initialization code here or fix if needed
dataset = YourDataset()

Now that you've written your dataset, lets make sure everything is in ship shape!

In [None]:
dataset[0]

In [None]:
dataset.datatypes()

In [None]:
assert(isinstance(dataset[0], dict))
assert(isinstance(dataset.datatypes(), dict))
assert(dataset[0].keys() == dataset.datatypes().keys())

# Step 2: Data Object Creation

The *data object* is a dictionary that consists of the preprocessed dataset and (optionally) its splits. More information about setting up the data object is available in the [main tutorial](Tutorial.ipynb) and the [ExtendingDD tutorial](ExtendingDD.ipynb).; for the purpose of the quickstart, splitting and organization is done for you. 

In [8]:
from src.datasets.data_detective_dataset import dd_random_split


inference_size: int = 20
everything_but_inference_size: int = dataset.__len__() - inference_size
inference_dataset, everything_but_inference_dataset = dd_random_split(dataset, [inference_size, dataset.__len__() - inference_size])
    
train_size: int = int(0.6 * len(everything_but_inference_dataset))
val_size: int = int(0.2 * len(everything_but_inference_dataset))
test_size: int = len(everything_but_inference_dataset) - train_size - val_size
train_dataset, val_dataset, test_dataset = dd_random_split(everything_but_inference_dataset, [train_size, val_size, test_size])

data_object = {
    "entire_set": dataset,
    "everything_but_inference_set": everything_but_inference_dataset,
    "inference_set": inference_dataset,
    # unordered splits belong here
    # in this example, train/val/test are included, but this section can be as long
    # as desired and can contain an arbitrary number of named splits 
    "train/val/test": {
        "training_set": train_dataset,
        "validation_set": val_dataset,
        "test_set": test_dataset,
    },
    # Example of k-fold split:
    # "fold_0": {
    #      "training_set": train_datasets[0],
    #      "test_set": test_datasets[0],
    # },
    # "fold_1": {
    #      "training_set": train_datasets[1],
    #      "test_set": test_datasets[1],
    # },
    # ...
    # "fold_k": {
    #      "training_set": train_datasets[j],
    #      "test_set": test_datasets[j],
    # }
}

print(f"size of inference_dataset: {inference_dataset.__len__()}")
print(f"size of everything_but_inference_dataset: {everything_but_inference_dataset.__len__()}")
print(f"size of train_dataset: {train_dataset.__len__()}")
print(f"size of entire dataset: {dataset.__len__()}")
print(f"size of val_dataset: {val_dataset.__len__()}")
print(f"size of test_dataset: {test_dataset.__len__()}")

size of inference_dataset: 20
size of everything_but_inference_dataset: 980
size of train_dataset: 588
size of entire dataset: 1000
size of val_dataset: 196
size of test_dataset: 196


# Step 3: Setting up a Validation Schema

## Step 3.1: Specifying Validators and Options

The validation schema contains information about the types of checks that will be executed by the Data Detective Engine and the transforms that Data Detective will use. More detailsd about creating your own validation schema is available in the [main tutorial](Tutorial.ipynb); below is the validation schema that we recommend to get started. 

In [9]:
validation_schema : Dict = {
    "validators": {
        "unsupervised_anomaly_data_validator": {},
        "unsupervised_multimodal_anomaly_data_validator": {},
        # "split_covariate_data_validator": {},
        # "ood_inference_data_validator": {}
    }
}

## Step 3.2: Specifying Transforms

It may be the case that you are using a data modality that has little to no method infrastructure in Data Detective. The simplest way to make use of all of Data Detective's functionality is to use a transform that maps this data modality to a well-supported modality in Data Detective such as multidimensional data. In our example, we will be making use of a pretrained resnet50 backbone to map images to 2048 dimensional vectors. This will allow us to make use of methods built for multidimensional data on our image representations. 

More information about introducing custom transforms into Data Detective and customizing the transform schema with pre-existing transforms is available in the [main tutorial](Tutorial.ipynb) and explanations on how to create/use your own transforms are available in the [ExtendingDD tutorial](ExtendingDD.ipynb).


In [10]:
transform_schema : Dict = {
    "transforms": {
        "IMAGE": [{
            "name": "resnet50",
            "in_place": "False",
            "options": {},
        }],
    }
}
     
full_validation_schema: Dict = {
    **validation_schema, 
    **transform_schema
}

# Step 4: Running the Data Detective Engine

Now that the full validation schema and data object are prepared, we are ready to run the Data Detective Engine.

In [11]:
data_detective_engine = DataDetectiveEngine()

start_time = time.time()
results = data_detective_engine.validate_from_schema(full_validation_schema, data_object)
print("--- %s seconds ---" % (time.time() - start_time))

File data/tmp/cache.pkl does not exist. Cache not loaded.
running validator class unsupervised_anomaly_data_validator...


Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



running validator class unsupervised_multimodal_anomaly_data_validator...
thread 13191606272 entered to handle validator method iforest_anomaly_validator_method
thread 13191606272:    running iforest_anomaly_validator_method...
thread 13225185280 entered to handle validator method cblof_anomaly_validator_method
thread 13225185280:    running cblof_anomaly_validator_method...
thread 13258764288 entered to handle validator method pca_anomaly_validator_method
thread 13258764288:    running pca_anomaly_validator_method...
thread 13292343296 entered to handle validator method iforest_multimodal_anomaly_validator_method
thread 13292343296:    running iforest_multimodal_anomaly_validator_method...
thread 13292343296: finished
thread 13325922304 entered to handle validator method cblof_multimodal_anomaly_validator_method
thread 13325922304:    running cblof_multimodal_anomaly_validator_method...




thread 13325922304: finished
thread 13359501312 entered to handle validator method pca_multimodal_anomaly_validator_method
thread 13359501312:    running pca_multimodal_anomaly_validator_method...
thread 13359501312: finished
Cache written to data/tmp/cache.pkl
--- 16.063323974609375 seconds ---


In [12]:
results

{'unsupervised_multimodal_anomaly_data_validator': {'iforest_multimodal_anomaly_validator_method': {'results': {'0': -0.0862931329771553,
    '1': -0.028697584689485023,
    '2': 0.024358381245969285,
    '3': -0.06266998619667924,
    '4': -0.07805045791561549,
    '5': -0.06952155111372121,
    '6': -0.09832764631454627,
    '7': -0.0559590373245758,
    '8': -0.029813977323780716,
    '9': -0.09596123945299173,
    '10': -0.07686946006081441,
    '11': 0.014332544316682128,
    '12': -0.06385488282954288,
    '13': -0.06053789469310594,
    '14': -0.035212270743033236,
    '15': 0.018467543317766177,
    '16': -0.09277173223684254,
    '17': -0.07889541421652613,
    '18': -0.07788394748592248,
    '19': 0.031124417914152158,
    '20': -0.07403553527180279,
    '21': -0.05523383537745391,
    '22': -0.07549215174528662,
    '23': -0.10265810560961741,
    '24': -0.06241490573393854,
    '25': -0.04307632472170447,
    '26': 0.00039398355370018345,
    '27': -0.05762185070780662,
   

Great! Let's start to look at and analyze the results we've collected.

# Step 5: Interpreting Results using the Built-In Rank Aggregator

To do rank aggregation, create a rankings object and either aggregate completely with the `aggregate_results_modally` or aggregate by a single modality with the `aggregate_results_multimodally`. See below 

In [13]:
from enum import Enum

import pandas as pd
import scipy
from typing import List

from src.aggregation.rankings import RankingAggregationMethod, ResultAggregator

aggregator = ResultAggregator(results_object=results)
# modal_rankings = aggregator.aggregate_results_modally("unsupervised_anomaly_data_validator", [RankingAggregationMethod.LOWEST_RANK, RankingAggregationMethod.HIGHEST_RANK, RankingAggregationMethod.ROUND_ROBIN], given_data_modality="feature_0")
total_rankings = aggregator.aggregate_results_multimodally("unsupervised_multimodal_anomaly_data_validator", [RankingAggregationMethod.LOWEST_RANK, RankingAggregationMethod.HIGHEST_RANK])
total_rankings.sort_values("lowest_rank")

Unnamed: 0,results_iforest_multimodal_anomaly_validator_method_rank,results_cblof_multimodal_anomaly_validator_method_rank,results_pca_multimodal_anomaly_validator_method_rank,lowest_rank_agg_rank,highest_rank_agg_rank,results_iforest_multimodal_anomaly_validator_method_score,results_cblof_multimodal_anomaly_validator_method_score,results_pca_multimodal_anomaly_validator_method_score
0,851,798,822,783,870,-0.086293,211.373889,262.357225
1,260,393,322,303,339,-0.028698,282.731913,355.841754
10,745,489,749,667,587,-0.076869,264.264865,276.641933
100,170,70,136,109,106,-0.014218,367.598450,407.102684
101,987,910,988,978,954,-0.109523,188.238070,192.067690
...,...,...,...,...,...,...,...,...
995,505,460,485,407,561,-0.054316,267.956917,324.040962
996,146,75,33,86,52,-0.009166,365.834453,459.635671
997,214,214,206,153,266,-0.020343,318.358166,385.141587
998,431,506,411,409,502,-0.047702,262.416929,338.370036



### Appendix 1A: Complete list of validator methods

| name | path | method description | data types | operable split types | 
| ---- | ---- | ------------------ | ---------- | -------------------- | 
| adbench_validator_method | src/validator_methods/validator_method_factories/adbench_validator_method_factory.py | factory generating adbench methods that perform anomaly detection | multidimensional data | entire set | 
| adbench_multimodal_validator_method | src/validator_methods/validator_method_factories/adbench_multimodal_validator_method_factory.py | factory generating adbench methods that perform anomaly detection by concatenating all multidimensional columns first to be able to draw conclusions jointly from the data | multidimensional data | entire set | 
| adbench_ood_inference_validator_method | src/validator_methods/validator_method_factories/adbench_ood_inference_validator_method_factory.py | factory generating methods that perform ood testing given a source set and a target/inference set using adbench emthods | multidimensional data | inference_set, everything_but_inference_set | 
| chi square validator method | src/validator_methods/chi_square_validator_method.py | chi square test for testing CI assumptions between two categorical variables | categorical data | entire_set |
| diffi anomaly explanation validator method | src/validator_methods/diffi_anomaly_explanation_validator_method.py | A validator method for explainable anomaly detection using the DIFFI feature importance method. | multidimensional | entire_set |
| fcit validator method | src/validator_methods/fcit_validator_method.py | A method for determining conditionanl independence of two multidimensional vectors given a third. | continuous, categorical, or multidimensional | entire_set |
| kolmogorov smirnov multidimensional split validator | src/validator_methods/kolmogorov_smirnov_multidimensional_split_validator_method.py | KS testing over multidimensional data for split covariate shift. | multidimensional | entire_set |
| kolmogoriv smirnov normality validator method | src/validator_methods/kolmogorov_smirnov_normality_validator_method.py | KS testing over continuous data for normality assumption. | continuous | entire_set | 
| kolmogorov smirnov split validator method | src/validator_methods/kolmogorov_smirnov_split_validator_method.py | KS testing over continuous data for split covariate shift. |  continuous | entire_set |  
| kruskal wallis multidimensional split validator method | src/validator_methods/kruskal_wallis_multidimensional_split_validator_method.py | kruskal wallis testing over multidimensional data for split covariate shift. | multidimensional | entire_set | 
| kruskal wallis split validator method | src/validator_methods/kruskal_wallis_split_validator_method.py | kruskal wallis testing over continuous data for split covariate shift. | continuous | entire_set |  
| mann whitney multidimensional split validator method | src/validator_methods/mann_whitney_multidimensional_split_validator_method.py | mann whitney testing over multidimensional data for split covariate shift. | multidimensional | entire_set |
| mann whitney split validator method | src/validator_methods/mann_whitney_split_validator_method.py | mann whitney testing over continuous data for split covariate shift. | continuous | entire_set |  
| duplicate high dimensional validator method | src/validator_methods/duplicate_high_dimensional_validator_method.py | multidimensional, text, sequence | entire_set |
| duplicate sample validator method |src/validator_methods/duplicate_sample_validator_method.py  | any | entire_set |
| near duplicate multidimensional validator method | src/validator_methods/near_duplicate_multidimensional_validator_method.py | multidimensional | entire_set |
| near duplicate sample validator method | src/validator_methods/near_duplicate_sample_validator_method.py | multidimensional, continuous, categorical | entire_set | 

