# Protocol
1- Define Hypotheses

2- Assumption Check

3- Select Proper Test 

4- Check the p value and conclude

Test Types 
* 1-Normality Tests
* 1.1-Variance Tests
* 2-Correlation Tests
* 3-Stationary Tests
* 4-Parametric Statistical Hypothesis Tests
* 5-Nonparametric Statistical Hypothesis Tests

---

### Modules

In [2]:
import pandas as pd 
import numpy as np 
import scipy.stats as stats


---
# 1. Normality Tests

Aim: Tests whether a data sample has a Gaussian distribution.

Condition: Observations in each sample are independent and identically distributed (iid).
* `Kolmogorov-Smirnov:` stats.kstest(data) : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
* `Shapiro-Wilk’s W test:` stats.shapiro(data): https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html#scipy.stats.shapiro

* `D’Agostino and Pearson’s:` stats.normaltest(data,'norm') : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html


In [40]:
def normality_tests(data):
    s_normalirty, s_p_value = stats.shapiro(data) 
    d_normalirty, d_p_value = stats.normaltest(data)  
    k_normalirty, k_p_value = stats.kstest(data, 'norm') 
    return(s_p_value,d_p_value,k_p_value)#,a_p_value )  


In [41]:
sample = stats.norm.rvs(size= 40, loc= 60, scale = 20, )   

normality_tests(sample) 

(0.48842114210128784, 0.10368551610614321, 0.0)

---
## 1.1 Variance Tests
Aim: Test whether or not the lists a, b and c come from populations with equal variances.

- `Levene`    stats.levene(group1,group2) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html
- `Barlet's` stats.bartlett https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html#scipy.stats.bartlett 

Note 1: Levene’s test is an alternative to Bartlett’s test bartlett in the case where there are significant deviations from normality.

Note 2: Levene has advantage to use choice of center in which you can discard outliers by saying trimmed. 

In [56]:
sample_1 = stats.norm.rvs(size= 40, loc= 60, scale = 20)  
sample_2 = stats.norm.rvs(size= 40, loc= 90, scale = 10)  

def variance_tests(data1,data2):
    l_stat, l_p = stats.levene(data1,data2, center='median')  
    b_stat, b_p = stats.bartlett(data1,data2)  
    return(l_p, b_p) 

variance_tests(data1 = sample_1, data2 = sample_2)  

(0.0003680609086010507, 1.3343701836070315e-06)

---
---
# 2. Correlation Tests

H0: The two samples are independent.

H1: Dependency exits between the samples.

## Correlation Tests Part I:
---
Aim: Tests whether two samples have a linear relationship.

* `Pearson’s Correlation Coefficient:` stats.pearsonr(sample_1,sample_2) : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html 

Condition: 

-Observations in each sample are independent and identically distributed (iid).

-Observations in each sample are normally distributed.

-Observations in each sample have the same variance.


In [37]:
sample_1 = stats.norm.rvs(size= 40, loc= 60, scale = 20, )   
sample_2 = stats.norm.rvs(size= 40, loc= 100, scale = 10, )   

p_stat, p_p = stats.pearsonr(sample_1,sample_2) 
 

SpearmanrResult(correlation=-0.07823639774859288, pvalue=0.6313293203806494)

## Correlation Tests Part II:
---
Aim: Tests whether two samples have a monotonic relationship.

Condition: 

-Observations in each sample are independent and identically distributed (iid).

-Observations in each sample can be ranked. 
* `Spearman’s Rank Correlation:` stats.spearmanr(sample_1,sample_2) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html
* `Kendall’s Rank Correlation` stats.kendalltau(sample_1,sample_2) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html


In [57]:
sample_1 = stats.norm.rvs(size= 40, loc= 60, scale = 20, )   
sample_2 = stats.norm.rvs(size= 40, loc= 100, scale = 10, )   

def cor(data1,data2):
    s_stat, s_p = stats.spearmanr(data1,data2) 
    k_stat, k_p = stats.kendalltau(data1,data2) 
    return(s_p,k_p)

cor(data1 = sample_1, data2 = sample_2) 

(0.7579920913991244, 0.6920053103832037)

## Correlation Tests Part III:
---
Aim: Tests whether two categorical variables are related or independent.

Condition:

-Observations used in the calculation of the contingency table are independent.

-25 or more examples in each cell of the contingency table.
* `Chi-Squared Test` stats.chi2_contingency https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

In [58]:
sample_1 = stats.norm.rvs(size= 40, loc= 60, scale = 20, )   
sample_2 = stats.norm.rvs(size= 40, loc= 100, scale = 10, )  
table = [sample_1, sample_2]  # table = [[40,50,60],[2,7,9]] 

def cor(data):
    stat, p, dof, expected = stats.chi2_contingency(data)
    return(p)

cor(data = table) 

1.3654632995142594e-30

---
---
# 3. Stationary Tests
! statsmodels 

Aim: Tests whether a time series has a unit root, e.g. has a trend or more generally is autoregressive.

Condition:
-Observations in are temporally ordered.


* `Augmented Dickey-Fuller Unit Root Test` stats.adfuller https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html

In [5]:
from statsmodels.tsa.stattools import adfuller
 
sample_1 = np.linspace(0,100,100) 

stat, p, lags, obs, crit, t = adfuller(sample_1)
p

0.8623863863502361

Aim: Tests whether a time series is trend stationary or not.

Condition:
-Observations in are temporally ordered.

* `Kwiatkowski-Phillips-Schmidt-Shin` kpss https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.kpss.html

In [8]:
from statsmodels.tsa.stattools import kpss

sample_1 = np.linspace(0,100,100) 

stat, p, lags, crit = kpss(sample_1) 


look-up table. The actual p-value is smaller than the p-value returned.



---
---
# 4. Parametric Statistical Hypothesis Tests
Aim: Tests whether the means of two independent samples are significantly different.

Condition:

-Observations in each sample are independent/paired and identically distributed (iid).

-Observations in each sample are normally distributed.

-Observations in each sample have the same variance.

Up to 2 samples:
* Normal: `Student’s t-test` stats.ttest_ind(sample_1,sample_2) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
* Paired: `Paired Student’s t-test` stats.ttest_rel(sample_1, sample_2) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html 

After two samples:
* Normal: `Analysis of Variance Test (ANOVA)` stats.f_oneway(sample_1, sample_2, sample_3) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html
* Paired: `Repeated Measures ANOVA Test` AnovaRM https://www.statsmodels.org/stable/generated/statsmodels.stats.anova.AnovaRM.html

In [38]:
sample_1 = stats.norm.rvs(size= 40, loc= 60, scale = 20, )   
sample_2 = stats.norm.rvs(size= 40, loc= 100, scale = 10, )  
sample_3 = stats.norm.rvs(size= 40, loc= 60, scale = 20, )   

def t_test(data1=[],data2=[], type = 0):
    if type == 0:
        o_stat, o_p = stats.ttest_1samp(data1, popmean = np.mean(data1))  
        return (o_stat, o_p )
    elif type == 1: 
        t_stat, t_p = stats.ttest_ind(data1, data2) 
        return(t_stat, t_p)
    elif type == 2:
        p_stat, p_p = stats.ttest_rel(data1, data2) 
        return(p_stat, p_p)

a_stat, a_p = stats.f_oneway(sample_1, sample_2, sample_3)  

(0.0, 1.0)

Four different drugs lead to different reaction times.

In [28]:
from statsmodels.stats.anova import AnovaRM

#create data
df = pd.DataFrame({'Target': np.repeat([1, 2, 3, 4, 5], 4),'Effect': np.tile([1, 2, 3, 4], 5),
                   'Reaction': stats.norm.rvs(size = 20, loc= 25, scale = 10)  }) 
print(AnovaRM(data=df, depvar='Reaction', subject='Target', within=['Effect']).fit())

# Inspired From: https://www.statology.org/repeated-measures-anova-python/ 

               Anova
       F Value Num DF  Den DF Pr > F
------------------------------------
Effect  0.8129 3.0000 12.0000 0.5110



---
---
# 5. Nonparametric Statistical Hypothesis Tests
Aim: Tests whether the distributions of two independent samples are equal or not.

Condition:

-Observations in each sample are independent/paired and identically distributed (iid).

-Observations in each sample can be ranked.

* Normal: `Mann-Whitney U Test` stats.mannwhitneyu(sample_1,sample_2) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html 
* Normal: `Kruskal-Wallis H Test` stats.kruskal(sample_1,sample_2) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html
* Paired: `Wilcoxon Signed-Rank Test` stats.wilcoxon(sample_1,sample_2) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html 
* Paired: ` stats.friedmanchisquare(sample_1,sample_2)

In [None]:
sample_1 = stats.norm.rvs(size= 40, loc= 60, scale = 20, )   
sample_2 = stats.norm.rvs(size= 40, loc= 100, scale = 10, )   
sample_3 = stats.norm.rvs(size= 40, loc= 100, scale = 10, )   

m_stat, m_p = stats.mannwhitneyu(sample_1,sample_2) 
k_stat, k_p = stats.kruskal(sample_1,sample_2) 
w_stat, w_p = stats.wilcoxon(sample_1,sample_2) 
f_stat, f_p = stats.friedmanchisquare(sample_1,sample_2,sample_3) 
 

---

In [5]:
Grades1 = pd.Series(stats.norm.rvs(size= 40, loc= 80, scale = 10, ) )
Grades2 = pd.Series(stats.norm.rvs(size= 40, loc= 60, scale = 20, ) )
Grades3 = pd.Series(stats.norm.rvs(size= 40, loc= 20, scale = 10, ) ) 
d = {'Class 1':Grades1, 'Class 2':Grades2, 'Class 3':Grades3}
data= pd.DataFrame( d ) 
data.head() 

Unnamed: 0,Class 1,Class 2,Class 3
0,83.42817,78.776195,11.457225
1,93.250218,64.607907,7.837001
2,75.725762,64.599814,26.956862
3,73.218575,75.727275,26.987854
4,91.755958,58.014597,26.59455


# References
- https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/ 
- https://www.statology.org/repeated-measures-anova-python/ 