In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# import data
data = pd.read_csv('../data/raw/data.csv')

**Purpose of this Notebook:** This is an example notebook for a semi-polished EDA meant primarily for development. Such a notebook follow these guidelines:
* most computational heavy lifting is done in `run.py`. It shouldn't take more than 1-2 minutes to run code.
    - if plotting code takes a long time, consider saving image files in scripts.
* non-trivial code is used in the notebook, but it is still _simple_ and it still _runs fast_. Import functions from library code when possible.
* There is markdown explaining what's done in each (series of) code cell(s). This serves as documentation for people wanting to understand the details of your work, as well as for you to later refactor and use in a report!

# EDA

All data together is bimodal, concentrated on the left-side of the distribution, with a long tail. It may be a combination of more than one distribution.

In [None]:
ax = sns.distplot(pd.melt(data)['value'])
plt.suptitle('total distribution of all variables');

The observed data seems to consists of independent, normally distributed variables, most of which seem to be drawn from different distributions:

In [None]:
ax = pd.plotting.scatter_matrix(data)
plt.suptitle('Independent Gaussians');

The five distributions likely have different means and standard deviations (with the possible exception of $x_0$ and $x_1$). However, each looks gaussian.

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12,4))
sns.violinplot(data=pd.melt(data), x='variable', y='value', ax=axes[0])
axes[0].title.set_text('Violin plot of each variable') 
pd.melt(data).groupby('variable')['value'].plot(kind='kde', ax=axes[1])
axes[1].title.set_text('distribution of each variable') 
plt.tight_layout();

The mean and standard deviation for each variable is given in the table below:

In [None]:
pd.concat([data.mean().rename('means'), data.std().rename('standard deviations')], axis=1)

Using the normality test of D'Agostino and Pearson, we see no evidence rejecting the normality of the variables:

In [None]:
from scipy.stats import normaltest

data.apply(lambda x:pd.Series(normaltest(x), index=['skew-test + kurtosis-test', 'p-value'])).T

We verify that $x_0$ and $x_1$ likely come from different distributions using a KS-test:

In [None]:
from scipy.stats import ks_2samp
res = ks_2samp(data['x_0'], data['x_1'])

print(pd.Series({'ks statistic': res.statistic, 'p-value': res.pvalue}).to_string())