# Comparing data distributions

** We want to compare the different representations and meta-features of two distributions to characterize their similarities and differences (e.g. original data VS generated data). **

- Data format : autoML

In [1]:
data_dir = '../../data'

datasets = {'iris': (data_dir + '/iris', 'iris'),
            'iris_1': (data_dir + '/iris_1', 'iris'),
            'iris_2': (data_dir + '/iris_2', 'iris'),
            'mimic': (data_dir + '/mimic', 'mimic'),
            'mimic_artif': (data_dir + '/mimic', 'mimic_artif'),
            'mushrooms': (data_dir + '/mushrooms', 'mushrooms'),
            'mushrooms_gen_sam': (data_dir + '/mushrooms_gen_sam', 'mushrooms_gen_sam'),
            'chems': (data_dir + '/chems', 'chems'),
            'credit': (data_dir + '/credit_data', 'credit'),
            'squares': (data_dir + '/squares', 'squares'),
            'squares_2': (data_dir + '/squares_2', 'squares'),
            'titanic' : (data_dir + '/titanic', 'titanic'),
            'adult' : (data_dir + '/adult', 'adult')}

# First dataset.
input_dir1, basename1 = datasets['adult']
#input_dir1, basename1 = datasets['mimic']

# Second dataset.
input_dir2, basename2 = datasets['adult']
#input_dir2, basename2 = datasets['mimic_artif']

## Comparison

- ** Overall meta-features ** (descriptors): we compute simple distances between the descriptors of each dataset.
- ** Individual features/variables ** (column comparison):

    - Numerical:
        - Kolmogorov-Smirnov test
        
    - Categorical, binary:
        - Mutual information score: This is equal to the Kullback-Leibler divergence of the joint distribution with the product distribution of the marginals
        - Kullback-Leibler divergence
        - Jensen-Shannon divergence

- ** Discriminant ** (row comparison): we label the data with 0 or 1 according to their original dataset and then train a binary classifier on it. This is the method used to train GANs. More sophisticated the classifier which succeeds in separating the data is, more similar they are. If the classifier can't separate the data, maybe they are to similar, maybe the classifier isn't good enough. 
- ** Landmark: ** performance in prediction of the target among various models and metrics.
- ** Change of representations: ** we train an auto-encoder on dataset A and benchmark it on dataset B (and reciprocally). The intuition behind this is that similar data will be compressible in the same latent space. This principle could be applied to other changes of representation.
- ** Causal inference: ** comparison of causal inference results. Do we notice the same causal links between the variables?

Draft:
- Wasserstein distance (minimum cost of turning one "pile of dirt" into the other)
- Chi square
- Metrics of **privacy** and **resemblance** between two datasets:
    - Area under MDA curve with threshold
    - MMD

In [2]:
# AutoML
import sys
main_path = '../../'
sys.path.append(main_path + 'code/auto_ml')
sys.path.append(main_path + 'code/processing')
sys.path.append(main_path + 'code/functions')
sys.path.append(main_path + 'code/models')
sys.path.append(main_path + 'data')

%matplotlib inline
%reload_ext autoreload
%autoreload 2

from auto_ml import AutoML
from comparator import Comparator

### Read data

In [3]:
ds1 = AutoML(input_dir1, basename1)
ds2 = AutoML(input_dir2, basename2)

In [5]:
from auto_ml import AutoML
from comparator import Comparator

print('ds1 train/test comparator')
auto_comparator = Comparator(ds1)
print('\nds1/ds2 comparator')
comparator = Comparator(ds1, ds2)
#df1 = AutoML.from_csv(input_dir1, basename1, 'final_df_sdv.csv')
#df2 = AutoML.from_csv(input_dir2, basename2, 'artificial_df.csv')4
#comparator = Comparator(df1, df2)

ds1 train/test comparator
1 dataset detected: comparison between train and test sets.

ds1/ds2 comparator
2 datasets detected: ready for comparison.
Datasets are equal


### Visualization

In [6]:
ds1.show_feat_type()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
Type,Numerical,Categorical,Numerical,Categorical,Numerical,Categorical,Categorical,Categorical,Categorical,Binary,Numerical,Numerical,Numerical,Categorical


### Processing

In [7]:
auto_comparator.process_data()
comparator.process_data()

### Distance between descriptors

In [8]:
#comparator.compare_descriptors(norm='euclidean')
auto_comparator.show_descriptors()
print()
comparator.show_descriptors()

Ratio: 0.002015355086372361
Symb ratio: 0.0
Class deviation: nan
Missing proba: 0.0
Skewness min: 0.014042678191729219
Skewness max: 0.5106436446632792
Skewness mean: 0.14262075608362323

Ratio: 0.0
Symb ratio: 0.0
Class deviation: nan
Missing proba: 0.0
Skewness min: 0.0
Skewness max: 0.0
Skewness mean: 0.0


### Individual features comparison

In [9]:
auto_comparator.show_comparison_matrix()
comparator.show_comparison_matrix()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
Kolmogorov-Smirnov,"(0.029, 0.0)",,"(0.02, 0.029)",,"(0.319, 0.0)",,,,,,"(0.912, 0.0)","(0.954, 0.0)","(0.476, 0.0)",
Kullback-Leibler divergence,,"(0.001, 0.001)",,"(0.001, 0.001)",,"(0.0, 0.0)","(0.001, 0.001)","(0.001, 0.001)","(0.0, 0.0)","(0.0, 0.0)",,,,"(0.002, 0.001)"
Mutual information,,2.197,,2.773,,1.946,2.616,1.792,1.609,0.693,,,,2.88
Jensen-Shannon divergence,,0,,0,,0,0,0,0,0,,,,0


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
Kolmogorov-Smirnov,"(0.028, 0.0)",,"(0.003, 0.996)",,"(0.323, 0.0)",,,,,,"(0.917, 0.0)","(0.953, 0.0)","(0.467, 0.0)",
Kullback-Leibler divergence,,"(0.0, 0.0)",,"(0.0, 0.0)",,"(0.0, 0.0)","(0.0, 0.0)","(0.0, 0.0)","(0.0, 0.0)","(0.0, 0.0)",,,,"(0.0, 0.0)"
Mutual information,,2.197,,2.773,,1.946,2.708,1.792,1.609,0.693,,,,3.606
Jensen-Shannon divergence,,0,,0,,0,0,0,0,0,,,,0


### Binary classification scores

- Discrimination between ds1 train set and ds1 test set

In [10]:
auto_comparator.show_classifier_score()

from sklearn.ensemble import RandomForestClassifier
auto_comparator.show_classifier_score(clf=RandomForestClassifier(n_estimators=200))

from sklearn.neural_network import MLPClassifier
auto_comparator.show_classifier_score(clf=MLPClassifier(hidden_layer_sizes=(100, 100)))

  'recall', 'true', average, warn_for)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


             precision    recall  f1-score   support

  Dataset 1       1.00      0.80      0.89      6511
  Dataset 2       0.00      0.00      0.00         0

avg / total       1.00      0.80      0.89      6511



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


             precision    recall  f1-score   support

  Dataset 1       1.00      1.00      

- Discrimination between ds1 and ds2

In [11]:
comparator.show_classifier_score()

from sklearn.ensemble import RandomForestClassifier
comparator.show_classifier_score(clf=RandomForestClassifier(n_estimators=200))

from sklearn.neural_network import MLPClassifier
comparator.show_classifier_score(clf=MLPClassifier(hidden_layer_sizes=(100, 100)))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


             precision    recall  f1-score   support

  Dataset 1       0.43      0.49      0.46      5645
  Dataset 2       0.56      0.50      0.53      7379

avg / total       0.50      0.49      0.50     13024



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


             precision    recall  f1-score   support

  Dataset 1       1.00      1.00      

### Privacy/Resemblance metric
- ** MDA: ** Minimum Distance Accumulation
- Privacy: Area above curve on the left of the threshold
- Resemblance: Area under curve on the right of the threshold

In [None]:
comparator.compute_mda(norm='manhattan', precision=0.1, threshold=0.4)
comparator.show_mda()

** MMD: ** Maximum Mean Discrepancy

In [None]:
#comparator.show_mmd()
# TODO

In [None]:
# Only if same number of samples !
#comparator.dcov()

In [None]:
# Only if same number of samples !
# Norm = 'l0',
#        'manhattan' or 'l1', 
#        'euclidean' or 'l2',
#        'minimum',
#        'maximum',
#comparator.datasets_distance(axis=0, norm='manhattan')