# Introduction

In this tutorial, we will explain the basics of using Data Detective to perform a Data Investigation. In this tutorial, we will go through the steps of setting up and running a basic Data Detective Investigation, including: 
1. Configuring a dataset that works with Data Detective. 
2. Setting up a schema that defines the Data Detective investigation. 
3. Executing a data detective investigation. 
4. Summarizing results using the built-in Rank Aggregator


We will also include supplemental tutorials for some of Data Detective's more advanced features, including: 
- Extending the Data Detective investigation with custom validations
- Extending the transform library to map custom datatypes to supported datatypes


Let's get started!


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
import torch
import torchvision.transforms as transforms

from torchvision.datasets import MNIST
from typing import Dict, Union

from constants import FloatTensor
from src.aggregation.rankings import RankingAggregator, RankingAggregationMethod
from src.data_detective_engine import DataDetectiveEngine
from src.datasets.tutorial_dataset import TutorialDataset
from src.enums.enums import DataType



# Dataset Construction

## Requirements for a Dataset

For a dataset to work within the Data Detective framework, it needs to satisfy the following requirements: 

1. It must override the `__getitem__` method that returns a dictionary mapping from each data column key to the data value. 
2. It must contain a `datatypes` method that returns a dictionary mapping from each data column key to the column's datatype. 

Let's examine what this looks like in practice.


## Dataset Implementation

In this tutorial, we will create a heterogeneous dataset that consists of the following items: 

- MNIST images
- MNIST labels
- 10-dimensional normal distribution (μ=0, σ=1)

The full dataset can be found under src/datasets/tutorial_dataset


In [2]:
dataset = TutorialDataset(
    root='./data/MNIST',
    download=True,
    transform=transforms.Compose([
        transforms.ToTensor() 
    ])
)

In [None]:
dataset.show_datapoint(0)

# Setting up the Data Object

The *data object* consists of the preprocessed dataset and (optionally) its splits. In order to make use of split validation techniques such as distribution shift between splits, the data splits must be included. 

For the purpose of thsi tutorial, we are going to hold out 20 samples that we are going to use to model "real world" samples that will be encountered at inference time. We will then perform a 60/20/20 split between train, validation, and test sets.

In [7]:
inference_size: int = 20
everything_but_inference_size: int = dataset.__len__() - inference_size
inference_dataset, everything_but_inference_dataset = torch.utils.data.random_split(dataset, [inference_size, dataset.__len__() - inference_size])
    
train_size: int = int(0.6 * len(everything_but_inference_dataset))
val_size: int = int(0.2 * len(everything_but_inference_dataset))
test_size: int = len(everything_but_inference_dataset) - train_size - val_size
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(everything_but_inference_dataset, [train_size, val_size, test_size])

data_object: Dict[str, torch.utils.data.Dataset] = {
    "training_set": train_dataset,
    "validation_set": val_dataset,
    "test_set": test_dataset,
    "entire_set": dataset,
    "everything_but_inference_set": everything_but_inference_dataset,
    "inference_set": inference_dataset
}

new_data_object = {
    "entire_set": dataset,
    "everything_but_inference_set": everything_but_inference_dataset,
    "inference_set": inference_dataset,
    "unordered_split_groups": {
        "train/val/test": {
            "training_set": train_dataset,
            "validation_set": val_dataset,
            "test_set": test_dataset,
       },
    }
}

#### how do we want this to be applied?
# - what method types are there?
#   - methods that only operate on the entire set (easy)
#   - methods that operate on a group of unordered splits (train/val/test, perhaps)
#   - methods that operate on a group of ordered splits (inference / non-inference)
#     - these have to be ordered/named anyways, so we might as well just keep them as is, which makes this easy too
# - so, what we need to do 

print(f"size of inference_dataset: {inference_dataset.__len__()}")
print(f"size of everything_but_inference_dataset: {everything_but_inference_dataset.__len__()}")
print(f"size of train_dataset: {train_dataset.__len__()}")
print(f"size of entire dataset: {dataset.__len__()}")
print(f"size of val_dataset: {val_dataset.__len__()}")
print(f"size of test_dataset: {test_dataset.__len__()}")

size of inference_dataset: 20
size of everything_but_inference_dataset: 30
size of train_dataset: 18
size of entire dataset: 50
size of val_dataset: 6
size of test_dataset: 6


# Setting up a Validation Schema

## Specifying Validators and Options

The validation schema contains information about the types of checks that will be executed by the Data Detective Engine and the transforms that Data Detective will use. Before discussing the validation schema, it is important to define two key modular components that make up Data Detective investigations. Data Detective's functionality can be divided into a modular, implementation-heavy component referred to as a *validator method* and an easily toggleable component referred to as a *validator*.

A *validator method* performs a specific type of test for a data issue on a specific data type. These validator methods primarily consist of the code needed to take a dataset and run an evaluation on it that produces either a positive or negative result or a score. Some examples of validator methods include:
- Mann-Whitney U-Test to examine distribution shift between train/test splits on tabular data
- Kernel Conditional Independence (KCI) test for validating causal assumptions on vectorvalued
data
- Isolation forest trained/evaluated on image histograms for identifying anomalies in imaging
data

*Validations* are toggleable, datatype-agnostic collections of validator methods that are serially applied to the dataset. Each validator targets a single problem that may arise in data, including: 
- Shift between different data splits
- Outlier and anomaly detection
- Violation of parametric assumptions on the data
- Violation of expected casual structures / conditional independences in the 

In the validation schema, users only specify the validators that they would like to use. This abstracts away details concerning what methods should be used for which columns and simplifies the process for searching for a particular flavor of issues to a few lines of code. Below is the validation schema that we will use for the tutorial. 

In [4]:
validation_schema : Dict = {
    "default_inclusion": False,
    "validators": {
        "unsupervised_anomaly_data_validator": {},
        "unsupervised_multimodal_anomaly_data_validator": {},
        "split_covariate_data_validator": {},
        "ood_inference_data_validator": {}
    }
}

A few notes: 
- *default inclusion* referes to whether the validation should include default validators (validators with `is_default` set to `True`)
- Each validator maps to an object that specifies additional options. For this tutorial, we will use the default settings, but these options include: 
  - special kwargs to include and pass to the validator methods
  - additional filtering regarding which columns the validator should be applied to
 
  
  
 

## Specifying Transforms

It may be the case that you are using a data modality that has little to no method infrastructure in Data Detective. The simplest way to make use of all of Data Detective's functionality is to use a transform that maps this data modality to a well-supported modality in Data Detective such as multidimensional data. In our example, we will be making use of a pretrained resnet50 backbone to map our MNIST images to 2048 dimensional vectors. This will allow us to make use of methods built for multidimensional data on our MNIST image representations.

In [5]:
transform_schema : Dict = {
    "transforms": {
        "image": [{
            "name": "resnet50",
            "in_place": "False",
            "options": {},
        }],
    }
}
     
full_validation_schema: Dict = {
    **validation_schema, 
    **transform_schema
}

# Running the Data Detective Engine

Now that the full validation schema and data object are prepared, we are ready to run the Data Detective Engine.

In [6]:
data_detective_engine = DataDetectiveEngine()

# 1 thread, --- 220.85648322105408 seconds ---
# multithreadinng (joblib), --- 149.11400604248047 seconds ---
# thread pools, --- 81.38025784492493 seconds ---
# data-level caching, clean cache, --- 75.22503590583801 seconds ---
# sample-level caching, clean cache--- 26.184876918792725 seconds ---
# data-level caching, dirty cache, --- 22.925609827041626 seconds ---
# sample-level caching, dirty cache, --- 19.73765206336975 seconds ---


start_time = time.time()
results = data_detective_engine.validate_from_schema(full_validation_schema, data_object)
print("--- %s seconds ---" % (time.time() - start_time))

running validator class unsupervised_anomaly_data_validator...
running validator class unsupervised_multimodal_anomaly_data_validator...


  def _pt_shuffle_rec(i, indexes, index_mask, partition_tree, M, pos):
  def delta_minimization_order(all_masks, max_swap_size=100, num_passes=2):
  def _reverse_window(order, start, length):
  def _reverse_window_score_gain(masks, order, start, length):
  def _mask_delta_score(m1, m2):
  def identity(x):
  def _identity_inverse(x):
  def logit(x):
  def _logit_inverse(x):
  def _build_fixed_single_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _build_fixed_multi_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _init_masks(cluster_matrix, M, indices_row_pos, indptr):
  def _rec_fill_masks(cluster_matrix, indices_row_pos, indptr, indices, M, ind):
  def _single_delta_mask(dind, masked_inputs, last_mask, data, x, noop_code):
  def _delta_masking(masks, x, curr_delta_inds, varying_rows_out,
  def _jit_build_partition_tree(xmin, xmax, ymi

running validator class split_covariate_data_validator...
running validator class ood_inference_data_validator...
thread 11393855488 entered to handle validator method iforest_anomaly_validator_methodthread 11410681856 entered to handle validator method cblof_anomaly_validator_method
thread 11410681856:    running cblof_anomaly_validator_method...

thread 11393855488:    running iforest_anomaly_validator_method...
thread 11427508224 entered to handle validator method pca_anomaly_validator_method
thread 11427508224:    running pca_anomaly_validator_method...
thread 11444334592 entered to handle validator method iforest_multimodal_anomaly_validator_method
thread 11444334592:    running iforest_multimodal_anomaly_validator_method...
thread 11461160960 entered to handle validator method pca_multimodal_anomaly_validator_method
thread 11461160960:    running pca_multimodal_anomaly_validator_method...
thread 11477987328 entered to handle validator method cblof_multimodal_anomaly_validator_met



thread 11427508224: finished




thread 11461160960: finished
thread 11511640064: finished
thread 11528466432: finished
thread 11578945536: finished
thread 11595771904: finished
thread 11444334592: finished




thread 11629424640: finished
thread 11393855488: finished
thread 11612598272: finished
thread 11477987328: finished




thread 11562119168: finished
thread 11410681856: finished
thread 11494813696: finished
thread 11545292800: finished
--- 15.769446849822998 seconds ---


Great! Let's start to look at and analyze the results we've collected.

# Interpreting Results using the Built-In Rank Aggregator

In [None]:
dataset.show_datapoint(0)