# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

In this lab you will: 

- Use ANOVA for testing multiple pairwise comparisons 
- Interpret results of an ANOVA and compare them to a t-test

## Load the data

Start by loading in the data stored in the file `'ToothGrowth.csv'`: 

In [29]:
# Your code here
import pandas as pd
from statsmodels.formula.api import ols
import statsmodels.api as sm
df = pd.read_csv('ToothGrowth.csv')

In [32]:
df.tail()

Unnamed: 0,len,supp,dose
55,30.9,OJ,2.0
56,26.4,OJ,2.0
57,27.3,OJ,2.0
58,29.4,OJ,2.0
59,23.0,OJ,2.0


## Generate the ANOVA table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage:  

In [30]:
# Your code here

lm = ols('len~C(supp) + C(dose)',df).fit()
table = statsmodels.stats.anova.anova_lm(lm, typ =2)
print(table)

               sum_sq    df          F        PR(>F)
C(supp)    205.350000   1.0  14.016638  4.292793e-04
C(dose)   2426.434333   2.0  82.810935  1.871163e-17
Residual   820.425000  56.0        NaN           NaN


## Interpret the output

Make a brief comment regarding the statistics and the effect of supplement and dosage on tooth length: 

In [None]:
# Your comment here

The anova table shows an extremely low number for supp and even lower for dosage, indicating that both of these influence the control variable with dose having a greater influence

## Compare to t-tests

Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples: 

In [38]:
# Your code here
VC= df[df['supp'] == 'VC']
OJ= df[df['supp'] == 'OJ']
VC_l = VC['len']
OJ_l = OJ['len']
VC_mean = VC['len'].mean()
OJ_mean = OJ['len'].mean()

Now run a t-test between these two groups and print the associated two-sided p-value: 

In [39]:
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups
import scipy
scipy.stats.ttest_ind(VC_l,OJ_l,equal_var=False)

Ttest_indResult(statistic=-1.91526826869527, pvalue=0.06063450788093387)

## A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [None]:
# Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
# Compare the p-value to that of the t-test above. 
# They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

In [41]:
lm2 = ols('len~C(supp)',df).fit()
table2 = statsmodels.stats.anova.anova_lm(lm2, typ =2)
print(table2)

               sum_sq    df         F    PR(>F)
C(supp)    205.350000   1.0  3.668253  0.060393
Residual  3246.859333  58.0       NaN       NaN


## Run multiple t-tests

While the 2-category ANOVA test is identical to a 2-tailed t-test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [44]:
pd.DataFrame(df.groupby(['supp', 'dose'])['len'])

Unnamed: 0,0,1
0,"(OJ, 0.5)",30 15.2 31 21.5 32 17.6 33 9.7 34...
1,"(OJ, 1.0)",40 19.7 41 23.3 42 23.6 43 26.4 44...
2,"(OJ, 2.0)",50 25.5 51 26.4 52 22.4 53 24.5 54...
3,"(VC, 0.5)",0 4.2 1 11.5 2 7.3 3 5.8 4 ...
4,"(VC, 1.0)",10 16.5 11 16.5 12 15.2 13 17.3 14...
5,"(VC, 2.0)",20 23.6 21 18.5 22 33.9 23 25.5 24...


In [42]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2-sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [None]:
# Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
# for all combinations of the supplement-dose groups listed above. 
# (Since there isn't a control group, compare each group to every other group.)

## Summary

In this lesson, you implemented the ANOVA technique to generalize testing methods to multiple groups and factors.