# Comparing data distributions

** We want to compare the different representations and meta-features of two distributions to characterize their similarities and differences (e.g. original data VS generated data). **

- Data format : autoML

In [2]:
# First dataset
input_dir1 = 'sample_data/iris'
basename1 = 'iris'

# Second dataset
input_dir2 = 'sample_data/iris'
basename2 = 'iris'

## Comparison

- ** Overall meta-features ** (descriptors): we compute simple distances between the descriptors of each dataset.
- ** Individual features/variables ** (column comparison):
    - Numerical:
        - Kolmogorov-Smirnov
    - Categorical:
        - Chi square
- ** Discriminant ** (row comparison): we label the data with 0 or 1 according to their original dataset and then train a binary classifier on it. This is the method used to train GANs. More sophisticated the classifier which succeeds in separating the data is, more similar they are. If the classifier can't separate the data, maybe they are to similar, maybe the classifier isn't good enough. 
- ** Landmark: ** performance in prediction of the target among various models and metrics.
- ** Change of representations: ** we train an auto-encoder on dataset A and benchmark it on dataset B (and reciprocally). The intuition behind this is that similar data will be compressible in the same latent space. This principle could be applied to other changes of representation.
- ** Causal inference: ** comparison of causal inference results. Do we notice the same causal links between the variables?

Draft:
- sklearn.metrics.mutual\_info\_score: This is equal to the Kullback-Leibler divergence of the joint distribution with the product distribution of the marginals.
- KL divergence
- Jensen Shannon divergence/distance
- Wasserstein distance (minimum cost of turning one "pile of dirt" into the other)

In [3]:
# Imports

# AutoML and Comparator
problem_dir = 'data_manager/'  
from sys import path
path.append(problem_dir)
%matplotlib inline
%load_ext autoreload
%autoreload 2

from auto_ml import AutoML
from comparator import Comparator

### Read data

In [4]:
comparator = Comparator(AutoML(input_dir1, basename1), AutoML(input_dir2, basename2))

  self.comparison_matrix.set_value('kolmogorov-smirnov', column, kolmogorov_smirnov(data1[column], data2[column]))


### Visualization

In [5]:
comparator.show_descriptors()

Ratio: 0.0
Skewness min: 0.0
Skewness max: 0.0
Skewness mean: 0.0


In [6]:
comparator.show_comparison_matrix()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
kolmogorov-smirnov,"(0.0, 1.0)","(0.0, 1.0)","(0.0, 1.0)","(0.0, 1.0)"


In [7]:
comparator.dcov()

1.0

In [8]:
#np.linalg.norm(np.array([[1, 2, 3], [3, 4, 5], [4, 5, 6]]) - np.array([[0, 1, 2], [1, 2, 3]]))