In [1]:
# import notebooks
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sn
import scipy.stats as st
from statistics import multimode

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd


<br>
<center><b style="color:black;font-size:40px;"> Homework 5: ANOVAs </b></center>  
<br>

**Interpreting p-values**  
<br>
**Null** hypothesis: $H_{0}: \bar{x}=\mu$  
**Left** tailed test **alternative** hypothesis: $H_{1}: \bar{x}<\mu$  
**Right** tailed test **alternative** hypothesis: $H_{1}: \bar{x}>\mu$  
<br>
$p<\alpha \rightarrow $ Reject $H_{0}$  
$p>\alpha \rightarrow$ Fail to reject $H_{0}$



# Problem #1:
1. Import the file “Biocharcolas.xlsx” you used for the Midterm Project into a Jupyter notebook. Plot the treatment means and standard deviations for the “Grav” variable:
  - Note: most boxplots plot quartiles which are NOT means and standard deviations. We recommend using something like the df.plot function in Pandas (described here (Links to an external site.))  
  - Feel free to copy relevant set-up code from what you already developed for the Midterm!

### Using the `groupby()` function as described in the homework assignment:

The `df.groupby` function has many useful applications, such as aggregated computations of statistical parameters. The basic structure of code to be used is:  
```Python
df.groupby("col_to_group_by").agg([func_1, func_2, func_3])
```
Once the dataframe and the column are chosen, the functions can be decided on:
- NumPy:
    - Mean: `np.mean`
    - Standard deviation: `np.std`
- SciPy
    - Standard error: `scipy.stats.sem`
`df.groupby.agg` can only take a list of functions in as input, so `np.std * 2` cannot be used, however custom functions can be defined.  


**Visualization:**  

Pandas plotting function is a really useful one-liner: `df.plot`  
```Python
df.plot(kind="barh", y = "mean", legend = False, title = "Title")
```
In this general structure, for a bar chart.


# Problem #2:
2. Perform a 1-way ANOVA for the factor “Soil” (HINT: check out the `df.dropna()` function)
  - Generate your hypotheses
  - Calculate the 1-way ANOVA p-value
  - In a markdown cell, interpret your findings including what the p-value from an ANOVA can tell you and any conclusions you can make about the factor you measured. With only 2 groups to compare, was a 1-way ANOVA the right test to perform?
  

From Studio 7:  
>  **Verification**  
Once you've completed the manual calculations, create a box plot for each of the data sets. You can type the values into a spreadsheet and ssave it as a CSV file, then load it into pandas.  
Finally, compare your answer to the answer from `scipy.stats`  
>```Python
st.f_oneway(df["x1"],df["x2"],df["x3"],df["x4"])
```

# Problem #3:
3. Perform a 1-way ANOVA for the factor “Biochar”
  - Generate your hypotheses
  - Calculate the 1-way ANOVA p-value
  - In a markdown cell, interpret your findings including any conclusions you can make about the factor you measured.

# Problem #4:
4. Perform a 2-way ANOVA for the factors “Soil” and “Biochar” with the interaction
  - Generate your hypotheses
  - Calculate the 2-way ANOVA p-values
  - In a markdown cell, interpret your findings including any conclusions you can make about the factors and interaction you measured.
  
From studio 8:
>In this case we need a two-way ANOVA.  We are not going to dive into the math of the two-way ANOVA, but it basically works the same as the one way.  We are ultimately comparing in-group variance to between-group variance.
>
>The syntax is a little weird.  Just go with it.
>
>```Python
model = ols('height ~ C(water) + C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
>```

# Problem #5:
5. Perform a Tukey’s HSD posthoc test on your 2-way ANOVA (Hint: first use the df.dropna() function again, but only using the “Grav” column)
  - Generate the pair-wise comparison table for a Tukey’s HSD posthoc test (from statsmodels.stats.multicomp import pairwise_tukeyhsd – described here (Links to an external site.))
  - In a markdown cell, give an overall interpretation the findings of your posthoc test (please don’t describe every pair-wise comparison!). Are there any trends? Does this visually corroborate with your graph from question 2? Does this reinforce your findings from both 1-way ANOVAs?
  - In a markdown cell, describe whether every pair-wise comparison tells you something useful?

In [None]:
# tukey = pairwise_tukeyhsd(endog=df['height'],groups=df['sun'],alpha=0.05)

# print(tukey)

`dropna()` command:
```Python
DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
```
**Arguements:**
- axis
    - 0 to drop rows with missing values
    - 1 to drop columns with missing values
- how
    - 'any': drop is any NAN/missing value is present
    - 'all': drop if all the values are missing or NaN
- thresh: threshold for NaN values
- inplace if 'True' then make changes in the dataplace itself

# Extra Credit: 
 - Extra Credit: Perform a Multifactorial ANOVA for the factors “Soil”, “Biochar”, and “day” (use the same syntax for a 2-way ANOVA but add a factor) for the variable “Grav”.
  - Generate your hypotheses
  - Calculate the Multifactorial ANOVA p-values
  - In a markdown cell, interpret your findings including any conclusions you can make about the factors and interaction you measured.