# SciPy stats Notebook

## Overview of Scipy Stats library
https://docs.scipy.org/doc/scipy/reference/stats.html

This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more.
1. probability distribution
2. summary statistics
3. frequency statistics
4. Correlation functions: f_oneway(*args[, axis]) Perform one-way ANOVA.
5. Statistical tests

# What is ANOVA?
ANOVA, which stands for Analysis of Variance, is a statistical test for analyzing if there is statistically significant difference between the means of **three or more groups**. One-way ANOVA is used when there is one independent variable, while a two-way ANOVA is used for two independent variables.

However ANOVA does not tell us which specific groups are significantly different from each other. Post hoc tests are needed to investigate where the significant difference lies. For data that meets the assumption of homogeneity of variances, Tukey's honestly significant difference (HSD) post hoc test would be a great option.  For data with unequal variances, Games Howell post hoc test can be used. 
ref:[https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php]

At the moment Scipy.stats module does not have the function to perform Tukey's HSD post hoc test, statsmodels is one of the modules that allow users to do so. However, SciPy has put it into the development road map, hopefully tukey_hsd will soon be available in its coming release. 

https://www.statsmodels.org/dev/generated/statsmodels.stats.multicomp.pairwise_tukeyhsd.html

https://scipy.github.io/devdocs/reference/generated/scipy.stats.tukey_hsd.html?highlight=tukey_hsd#scipy.stats.tukey_hsd

## The assumption of ANOVA (ref from laerd)
1. Dependent variable at interval or ratio level 
2. IV should consist of 2 or more categorical, independent group 
3. independence of observations, not repeatedly measures
4. no significant outliers, as outlier reduces validity
5. DV should be approximately normally distributed for each category of IV; shapiro-wilk test of normality
6. homogeneity of variances; Levene's test


# The Dataset and hypotheis

### The dataset
The dataset is from an imaginary crop yield experiment adopted from [link]. 

There are 4 variables in the dataset, planting location (block 1,2,3,4), planting density (1=low, 2=high), types of fertilizer used(type 1,2,3), and crop yield per acre. 

I am interested in if there is statistically significant differences in yield between the 3 types of fertilizer. 

(https://www.scribbr.com/statistics/one-way-anova/)

### The hypothesis
H0: There is no significant difference between the means of yield of the 3 types of fertilizer. <br>
H1: There is significant difference between the means of yield of the 3 types of fertilizer. 

In [2]:
import pandas as pd
import scipy.stats as stats
import seaborn as sns

df = pd.read_csv('crop_data.csv')

df.head()

Unnamed: 0,density,block,fertilizer,yield
0,1,1,1,177.228692
1,2,2,1,177.550041
2,1,3,1,176.408462
3,2,4,1,177.703625
4,1,1,1,177.125486


### Check if the data fits the assumption of ANOVA

#### Assumption 1 - Dependent variable at interval or ratio level 
In this case the dependent variable is yield, and have a look of its descriptive statistics.

In [5]:
# set the dependent variable as yield
dependent = df['yield']
dependent.describe()

count     96.000000
mean     177.015476
std        0.664548
min      175.360840
25%      176.468696
50%      177.058105
75%      177.398571
max      179.060899
Name: yield, dtype: float64

#### Assumption 2 - Independent variable should consist of 2 or more categorical, independent group
In this case the independent variable is type of fertilizer

In [6]:
# show number of fertilizers in the df
df.fertilizer.unique()

array([1, 2, 3], dtype=int64)

In [7]:
# convert the list of int to str for later use
fertilizer = df['fertilizer'].astype('str')
fertilizer

0     1
1     1
2     1
3     1
4     1
     ..
91    3
92    3
93    3
94    3
95    3
Name: fertilizer, Length: 96, dtype: object

In [8]:
# set independent variable as fertilizer
independent = fertilizer
independent.value_counts()

2    32
1    32
3    32
Name: fertilizer, dtype: int64

#### Assumption 3 - Experiment design

#### Assumption 4 - outlier

#### Assumption 5 - Dependent variable should be approximately normally distributed for each category of IV
Shapiro-Wilk test will be used to test for all 3 types of fertilizer

#### Assumption 6 - Homogeneity of variances
Levene's test will be used to test.

## Running ANOVA

## Post hoc test

While ANOVA tells us that there are significant difference in mean yield between the 3 fertilizers, it did not tell us the specifics of which ones are different from the others. So we need to carry out post hoc test to find out. 
As mentioned at the beginning SciPy does not have the function for Tukey_hsd yet, we will use statsmodels module.

## Reference
***


# END