# STATISTICS Applied to data science

## Exercises PART 2: Inferential statistics, effect sizes and power tests

Employing descriptive statistics is one of the main steps of the POC stage (proof of concept) and extremely helpful during model evaluation. In this notebook you'll find some common routines for descriptive statistics in Python, and exercises about data transformation and scaling. 


![Image](inferential.png)

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats import power
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 8)
%matplotlib inline
# jupyter lab configs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Hypothesis testing

Write your own t-test for two independent samples (or an F-test, if you prefer) and calculate the p-value of the statistic.  

The corresponding p-value of the **calculated t statistic** can be obtained using the cumulative distribution function of the t-ditribution (look at `stats.t.cdf(t_calc, df)`, where t_calc is the statistic, and df is the degrees of freedom, whichs is number of observations -1).

The critical value of t for a given alpha and degrees of freedom can be obtained using for example `stats.t.ppf(0.95,df=10)`

![Image](ttest.jpg)

In [None]:
from scipy.stats import t, norm

In [None]:
def a_ttest(x, y):
    # 
    #
    #
    #
    #
    return t, p_value

# Effect sizes and Power tests

Cohens's D

In [None]:
# calculate the Cohen's d between two samples
from numpy.random import randn
from numpy.random import seed
 
# function to calculate Cohen's d for independent samples
def cohen_d(d1, d2):
    # calculate the size of samples
    n1, n2 = len(d1), len(d2)
    # calculate the variance of the samples - Which variance (biased or unbiased) is being used here?
    s1, s2 = np.var(d1, ddof=1), np.var(d2, ddof=1)
    # calculate the pooled standard deviation
    s = np.sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))
    # calculate the means of the samples
    u1, u2 = np.mean(d1), np.mean(d2)
    # calculate the effect size
    return (u1 - u2) / s

In [None]:
 # seed random number generator
seed(1)
# prepare data
data1 = 10 * randn(10000) + 50
data2 = 10 * randn(10000) + 100
# calculate cohen's d
d = cohen_d(data1, data2)
print('Mean sample 1: %.3f' % np.mean(data1), '- Mean sample 2: %.3f' % np.mean(data2))
print('Cohens d: %.3f' % d)

### POWER tests with *statsmodels*

**For a 2-sample test:**


`statsmodels.stats.power.tt_ind_solve_power(effect_size=None, nobs1=None, alpha=None, power=None, ratio=1.0, alternative='two-sided')`  


Here `effect size` means the standardized effect size, i. e., the diference between the two means divided by the standard deviation.  


You can solve for one of the desired parameters `effect_size`, `nobs1`, `alpha`, or `power`

What is the number of samples required to detect an effect size of 1, given alpha=0.05 and power of 0.8?

In [None]:
power.tt_ind_solve_power(effect_size=2, nobs1=None, alpha=0.05, power=0.80, ratio=1.0, alternative='two-sided')

In [None]:
power.tt_ind_solve_power(effect_size=None, nobs1=100, alpha=0.05, power=0.9, ratio=1.0, alternative='two-sided')

In [None]:
power.tt_ind_solve_power(effect_size=1, nobs1=None, alpha=0.05, power=0.80, ratio=1.0, alternative='two-sided')

What happens if we want to have more power in the test? (increase to 0.9)

In [None]:
power.tt_ind_solve_power(effect_size=1, nobs1=None, alpha=0.05, power=0.90, ratio=1.0, alternative='two-sided')

In [None]:
power.tt_ind_solve_power(effect_size=0.5, nobs1=30, alpha=0.05, power=None, ratio=1.0, alternative='two-sided')

---

Graphics from http://www.luminousmen.com