# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

In this lab you will: 

- Use ANOVA for testing multiple pairwise comparisons 
- Interpret results of an ANOVA and compare them to a t-test

## Load the data

Start by loading in the data stored in the file `'ToothGrowth.csv'`: 

In [31]:
# Your code here
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as st
import numpy as np


df= pd.read_csv("ToothGrowth.csv")
df

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5
5,10.0,VC,0.5
6,11.2,VC,0.5
7,11.2,VC,0.5
8,5.2,VC,0.5
9,7.0,VC,0.5


## Generate the ANOVA table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage:  

In [6]:
# Your code here

formula = 'len ~ C(supp) + dose'
lm = ols(formula,df).fit()
table = sm.stats.anova_lm(lm, typ = 2)
print(table)

               sum_sq    df           F        PR(>F)
C(supp)    205.350000   1.0   11.446768  1.300662e-03
dose      2224.304298   1.0  123.988774  6.313519e-16
Residual  1022.555036  57.0         NaN           NaN


## Interpret the output

Make a brief comment regarding the statistics and the effect of supplement and dosage on tooth length: 

In [None]:
# Your comment here

"""
Based on the output, at a 5 % level of significance, the p value associated with both the supplement and dose is
less than 5%. Thus, we can conclude that there is a significant effect on tooth length by both supplement and dosage 
with the dosage being more significant.


"""

## Compare to t-tests

Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples: 

In [41]:
# Your code here

filt_1 = df["supp"] == "VC"
filt_2 = df["supp"] == "OJ"


sample_VC = np.array(df.loc[filt_1, "len"])
sample_OJ = np.array(df.loc[filt_2, "len"])







Now run a t-test between these two groups and print the associated two-sided p-value: 

In [40]:
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups

tstat, pvalue = st.ttest_ind(sample_VC,sample_OJ)

print("P value = ",pvalue)




P value =  0.06039337122412849


## A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [42]:
# Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
# Compare the p-value to that of the t-test above. 
# They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

formula = 'len ~ C(supp)'
lm = ols(formula,df).fit()
table = sm.stats.anova_lm(lm, typ = 2)
print(table)


               sum_sq    df         F    PR(>F)
C(supp)    205.350000   1.0  3.668253  0.060393
Residual  3246.859333  58.0       NaN       NaN


## Run multiple t-tests

While the 2-category ANOVA test is identical to a 2-tailed t-test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [43]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2-sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [52]:
# Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
# for all combinations of the supplement-dose groups listed above. 
# (Since there isn't a control group, compare each group to every other group.)


filt_1 = (df["supp"] == "VC") & (df["dose"] == 0.5)
filt_2 = (df["supp"] == "OJ") & (df["dose"] == 0.5)
filt_3 = (df["supp"] == "VC") & (df["dose"] == 1.0)
filt_4 = (df["supp"] == "OJ") & (df["dose"] == 1.0)
filt_5 = (df["supp"] == "VC") & (df["dose"] == 2.0)
filt_6 = (df["supp"] == "OJ") & (df["dose"] == 2.0)


sample_VC_1 = np.array(df.loc[filt_1, "len"])
sample_OJ_1 = np.array(df.loc[filt_2, "len"])
sample_VC_2 = np.array(df.loc[filt_3, "len"])
sample_OJ_2 = np.array(df.loc[filt_4, "len"])
sample_VC_3 = np.array(df.loc[filt_5, "len"])
sample_OJ_3 = np.array(df.loc[filt_6, "len"])




tstat_1, pvalue_1 = st.ttest_ind(sample_OJ_1,sample_VC_1)
tstat_2, pvalue_2 = st.ttest_ind(sample_OJ_1,sample_VC_2)
tstat_3, pvalue_3 = st.ttest_ind(sample_OJ_1,sample_VC_3)

tstat_4, pvalue_4 = st.ttest_ind(sample_OJ_2,sample_VC_1)
tstat_5, pvalue_5 = st.ttest_ind(sample_OJ_2,sample_VC_2)
tstat_6, pvalue_6 = st.ttest_ind(sample_OJ_2,sample_VC_3)

tstat_7, pvalue_7 = st.ttest_ind(sample_OJ_3,sample_VC_1)
tstat_8, pvalue_8 = st.ttest_ind(sample_OJ_3,sample_VC_2)
tstat_9, pvalue_9 = st.ttest_ind(sample_OJ_3,sample_VC_3)

tstat_10, pvalue_10 = st.ttest_ind(sample_OJ_1,sample_OJ_2)
tstat_11, pvalue_11 = st.ttest_ind(sample_OJ_1,sample_OJ_3)

tstat_12, pvalue_12 = st.ttest_ind(sample_VC_1,sample_VC_2)
tstat_13, pvalue_13 = st.ttest_ind(sample_VC_1,sample_VC_3)

tstat_14, pvalue_14 = st.ttest_ind(sample_OJ_2,sample_OJ_3)
tstat_15, pvalue_15 = st.ttest_ind(sample_VC_2,sample_VC_3)



print("P value for group ('OJ', 0.5) & group ('VC', 0.5) = ",pvalue_1)
print("P value for group ('OJ', 0.5) & group ('VC', 1.0) = ", pvalue_2)
print("P value for group ('OJ', 0.5) & group ('VC', 2.0) = ", pvalue_3)

print("P value for group ('OJ', 1.0) & group ('VC', 0.5) = ", pvalue_4)
print("P value for group ('OJ', 1.0) & group ('VC', 1.0) = ", pvalue_5)
print("P value for group ('OJ', 1.0) & group ('VC', 2.0) = ", pvalue_6)

print("P value for group ('OJ', 2.0) & group ('VC', 0.5) = ", pvalue_7)
print("P value for group ('OJ', 2.0) & group ('VC', 1.0) = ", pvalue_8)
print("P value for group ('OJ', 2.0) & group ('VC', 2.0) = ", pvalue_9)

print("P value for group ('OJ', 0.5) & group ('OJ', 1.0) = ", pvalue_10)
print("P value for group ('OJ', 0.5) & group ('OJ', 2.0) = ", pvalue_11)

print("P value for group ('VC', 0.5) & group ('VC', 1.0) = ", pvalue_12)
print("P value for group ('VC', 0.5) & group ('VC', 2.0) = ", pvalue_13)

print("P value for group ('OJ', 1.0) & group ('OJ', 2.0) = ", pvalue_14)
print("P value for group ('VC', 1.0) & group ('VC', 2.0) = ", pvalue_15)




P value for group ('OJ', 0.5) & group ('VC', 0.5) =  0.005303661339923052
P value for group ('OJ', 0.5) & group ('VC', 1.0) =  0.04223992429368205
P value for group ('OJ', 0.5) & group ('VC', 2.0) =  7.025409196997986e-06
P value for group ('OJ', 1.0) & group ('VC', 0.5) =  1.3372624230559434e-08
P value for group ('OJ', 1.0) & group ('VC', 1.0) =  0.0007807261651774468
P value for group ('OJ', 1.0) & group ('VC', 2.0) =  0.09583711277517494
P value for group ('OJ', 2.0) & group ('VC', 0.5) =  1.3381068810881244e-11
P value for group ('OJ', 2.0) & group ('VC', 1.0) =  2.3131084633597503e-07
P value for group ('OJ', 2.0) & group ('VC', 2.0) =  0.9637097790041267
P value for group ('OJ', 0.5) & group ('OJ', 1.0) =  8.357559281443774e-05
P value for group ('OJ', 0.5) & group ('OJ', 2.0) =  3.4018585295016214e-07
P value for group ('VC', 0.5) & group ('VC', 1.0) =  6.492264598157612e-07
P value for group ('VC', 0.5) & group ('VC', 2.0) =  4.957285658438862e-09
P value for group ('OJ', 1.0)

## Summary

In this lesson, you implemented the ANOVA technique to generalize testing methods to multiple groups and factors.