# Comparing data distributions

** We want to compare the different representations and meta-features of two distributions to characterize their similarities and differences (e.g. original data VS generated data). **

- Data format : autoML

In [1]:
datasets = {'iris': ('sample_data/iris', 'iris'),
            'iris_1': ('sample_data/iris_1', 'iris'),
            'iris_2': ('sample_data/iris_2', 'iris'),
            'mimic': ('sample_data/mimic_data', 'mimic'),
            'mushrooms': ('sample_data/mushrooms', 'mushrooms'),
            'chems': ('sample_data/chems', 'chems'),
            'credit': ('sample_data/credit_data', 'credit'),
            'squares': ('sample_data/squares', 'squares'),
            'squares_2': ('sample_data/squares_2', 'squares')}

# First dataset
input_dir1, basename1 = datasets['squares']

# Second dataset
input_dir2, basename2 = datasets['squares_2']

## Comparison

- ** Overall meta-features ** (descriptors): we compute simple distances between the descriptors of each dataset.
- ** Individual features/variables ** (column comparison):

    - Numerical:
        - Kolmogorov-Smirnov test
        
    - Categorical:
        - TODO
    
    - Other:
        - Mutual information score: This is equal to the Kullback-Leibler divergence of the joint distribution with the product distribution of the marginals.
        - Kullback-Leibler divergence

- ** Discriminant ** (row comparison): we label the data with 0 or 1 according to their original dataset and then train a binary classifier on it. This is the method used to train GANs. More sophisticated the classifier which succeeds in separating the data is, more similar they are. If the classifier can't separate the data, maybe they are to similar, maybe the classifier isn't good enough. 
- ** Landmark: ** performance in prediction of the target among various models and metrics.
- ** Change of representations: ** we train an auto-encoder on dataset A and benchmark it on dataset B (and reciprocally). The intuition behind this is that similar data will be compressible in the same latent space. This principle could be applied to other changes of representation.
- ** Causal inference: ** comparison of causal inference results. Do we notice the same causal links between the variables?

Draft:
- Jensen Shannon divergence/distance
- Wasserstein distance (minimum cost of turning one "pile of dirt" into the other)
- Chi square

In [2]:
# Imports

# AutoML and Comparator
problem_dir = 'data_manager/'  
from sys import path
path.append(problem_dir)
%matplotlib inline
%load_ext autoreload
%autoreload 2

from auto_ml import AutoML
from comparator import Comparator

### Read data

In [3]:
comparator = Comparator(AutoML(input_dir1, basename1), AutoML(input_dir2, basename2))

No info file file found.
No info file file found.


### Visualization

In [4]:
#comparator.compare_descriptors(norm='euclidean')
comparator.show_descriptors()

Ratio: 0.0029773423133451077
Skewness min: 0.09513895507332704
Skewness max: 1.2995932888495911
Skewness mean: 0.0037448375915909438


In [5]:
comparator.show_comparison_matrix()

Unnamed: 0,0.0,0.0.1,1.0,1.0.1,1.0.2,1.0.3,1.0.4,1.0.5,0.0.2,0.0.3,...,0.0.54,0.0.55,0.0.56,0.0.57,0.0.58,0.0.59,0.0.60,0.0.61,0.0.62,0.0.63
Kullback-Leibler divergence,"(0.00013541724268031105, 0.0001285034541915177)","(8.081884640253364e-05, 8.003979589988252e-05)","(0.00019897880101027774, 0.00019453604415646948)","(0.00025377043430213457, 0.00025603270558830545)","(0.00028083543166161065, 0.00028517689362542046)","(0.0006014616906716763, 0.0006190892649315331)","(0.0006740598296550888, 0.0007120343901799052)","(0.00045192696409608775, 0.0004893777769819428)","(0.0005543991830757934, 0.0005917524789834155)","(0.0006608414363197725, 0.0007649625359101948)",...,"(0.0008127066900154795, 0.0007371815090398959)","(0.0008710646891021464, 0.0008067466336485521)","(0.0006655784449927458, 0.0006349118275768813)","(0.0002134127088395531, 0.00020694678379471528)","(0.00010182937070697695, 0.00010061920701327)","(5.254730968363534e-05, 5.206821135286701e-05)","(0.0001815153329762446, 0.0001781378483628083)","(0.0001144350614666115, 0.00011268865932006567)","(0.00028489675335397315, 0.00030532116859765396)","(5.073303394861081e-05, 5.290429233756474e-05)"
Mutual information,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,...,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861,1.09861
Kolmogorov-Smirnov,"(0.9721388769824261, 0.0)","(0.9464209172738963, 0.0)","(0.9198456922417488, 0.0)","(0.8984140591513073, 0.0)","(0.8799828546935277, 0.0)","(0.8786969567081011, 0.0)","(0.8988164694115686, 0.0)","(0.9184864144024004, 0.0)","(0.9454909151525254, 0.0)","(0.9723287214535756, 0.0)",...,"(0.9718531218745535, 0.0)","(0.9458494070581511, 0.0)","(0.9192741820260037, 0.0)","(0.8972710387198172, 0.0)","(0.8779825689384197, 0.0)","(0.8774110587226747, 0.0)","(0.894127732533219, 0.0)","(0.9165595085012145, 0.0)","(0.9443240540090015, 0.0)","(0.9709951658609768, 0.0)"


In [None]:
# Only if same number of samples !
#comparator.dcov()

In [None]:
# Only if same number of samples ?
#comparator.datasets_distance()

In [None]:
#np.linalg.norm(np.array([[1, 2, 3], [3, 4, 5], [4, 5, 6]]) - np.array([[0, 1, 2], [1, 2, 3]]))