# Characterizing data distribution

In [1]:
input_dir = 'sample_data'
basename = 'iris'

### We want to represent the dataset in a human-friendly format to get a good impression of it. This is a kind of dataset "identity card".

- Data format : autoML

### Characterization

##### Visualization
- Scatter plot matrix
- Classes distribution
- Hierarchical clustering with heatmap matrix
- Hierarchical clustering with correlation matrix
- Principal components analysis (PCA)
- Linear discriminant analysis (LDA)
- T-distributed stochastic neighbor embedding (t-SNE algorithm)

##### Meta features
- ClassProbabilityMin $= min_{i=1 \dots n}(p(Class_i))= min_{i=1 \dots n}(\frac{NumberOfInstances\_Class_i}{TotleNumberOfInstances}) $
- ClassEntropy $= mean(-\sum_{i=1}^{n}p(Class_i)ln(p(Class_i)))$ where $p(Class_i)$ is the probability of having an instance of Class\_i
- DatasetRatio $=\frac{NumberOfFeatures}{NumberOfInstances}$
- Landmark[Some\_Model]: accuracy of [Some\_Model] applied on dataset.
- LandmarkDecisionNodeLearner \& LandmarkRandomNodeLearner: Both are decision tree with max\_depth=1. 'DecisionNode' considers all features when looking for best split, and 'RandomNode' considers only 1 feature, where comes the term 'random'.
- SkewnessMin: min over skewness of all features. Skewness measures the symmetry of a distribution. A skewness value > 0 means that there is more weight in the left tail of the distribution.
- NumSymbols: For each categorial feature, compute how many unique values there is ???
- Kurtosis = Fourth central moment divided by the square of the variance $=\frac{E[(x_i-E[x_i])^4]}{[E[(x_i-E[x_i])^4]]^2}$ where $x_i$ is the ith feature. 
- PCAKurtosis: Transform the dataset X by PCA, then compute the kurtosis

In [2]:
# Imports

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Hierarchical clustering
import matplotlib.pyplot as pylab
import matplotlib as mpl

import scipy
import scipy.cluster.hierarchy as sch
import scipy.spatial.distance as dist
import string
import time
import sys, os
import getopt

# Principal component analysis
from sklearn.decomposition import PCA

# Discriminant analysis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# T-SNE
from sklearn.manifold import TSNE

# AutoML
problem_dir = 'data_manager/'  
from sys import path
path.append(problem_dir)
%matplotlib inline
%load_ext autoreload
%autoreload 2

from auto_ml import AutoML

### Read data

In [3]:
D = AutoML(input_dir, basename)

In [6]:
D.show_info()

Usage: Sample dataset iris data
Name: Iris
Task: Multiclass classification
Target type: Numerical
Feat type: ['float64' 'float64' 'float64' 'float64']
Metric: Bac metric
Time budget: 1200
Feat num: 4
Target num: 3
Label num: 3
Train num: 105
Valid num: 15
Test num: 30
Has categorical: 0
Has missing: 0
Is sparse: 0


In [4]:
data_df = D.get_data_as_df()
data_df['X_train'].head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,4.4,3.0,1.3,0.2
1,4.7,3.2,1.6,0.2
2,6.1,2.6,5.6,1.4
3,6.4,3.1,5.5,1.8
4,5.8,4.0,1.2,0.2


In [5]:
data_df['y_train'].head()

Unnamed: 0,setosa,versicolor,virginica
0,0,1,0
1,0,1,0
2,0,0,1
3,0,0,1
4,0,1,0


### Simplification

- Replace missing values, NaN, Inf.
- Replace missing categorical variables.
- One hot encoding for categorical variables
- Normalization

In [None]:
processed_data = D.get_processed_data()
X = processed_data['X_train']
y = processed_data['y_train']

# Visualization

In [None]:
D.show_descriptors()