# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

In this lab you will: 

- Use ANOVA for testing multiple pairwise comparisons 
- Interpret results of an ANOVA and compare them to a t-test

## Load the data

Start by loading in the data stored in the file `'ToothGrowth.csv'`: 

In [1]:
# Your code here
import numpy as np
import pandas as pd
df = pd.read_csv('ToothGrowth.csv')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 3 columns):
len     60 non-null float64
supp    60 non-null object
dose    60 non-null float64
dtypes: float64(2), object(1)
memory usage: 1.5+ KB


## Generate the ANOVA table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage:  

In [3]:
# Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [29]:
formula = 'len ~ C(supp) + dose'
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:                    len   R-squared:                       0.704
Model:                            OLS   Adj. R-squared:                  0.693
Method:                 Least Squares   F-statistic:                     67.72
Date:                Sat, 11 Apr 2020   Prob (F-statistic):           8.72e-16
Time:                        18:52:29   Log-Likelihood:                -170.21
No. Observations:                  60   AIC:                             346.4
Df Residuals:                      57   BIC:                             352.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         9.2725      1.282      7.231

## Interpret the output

Make a brief comment regarding the statistics and the effect of supplement and dosage on tooth length: 

In [None]:
# Your comment here

## Compare to t-tests

Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples: 

In [13]:
# Your code here
with_oj = df[df['supp'] == 'OJ']
with_vc = df[df['supp'] == 'VC']

In [9]:
from scipy import stats

Now run a t-test between these two groups and print the associated two-sided p-value: 

In [16]:
with_vc.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


In [17]:
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups
print(stats.ttest_ind(with_oj['len'], with_vc['len']))

Ttest_indResult(statistic=1.91526826869527, pvalue=0.06039337122412849)


## A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [20]:
with_vc.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


In [32]:
# Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
formula = 'len ~ C(supp)'
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(lm.summary())
# Compare the p-value to that of the t-test above. 
# They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

                            OLS Regression Results                            
Dep. Variable:                    len   R-squared:                       0.059
Model:                            OLS   Adj. R-squared:                  0.043
Method:                 Least Squares   F-statistic:                     3.668
Date:                Sat, 11 Apr 2020   Prob (F-statistic):             0.0604
Time:                        18:54:09   Log-Likelihood:                -204.87
No. Observations:                  60   AIC:                             413.7
Df Residuals:                      58   BIC:                             417.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        20.6633      1.366     15.127

## Run multiple t-tests

While the 2-category ANOVA test is identical to a 2-tailed t-test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [7]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2-sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [None]:
# Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
# for all combinations of the supplement-dose groups listed above. 
# (Since there isn't a control group, compare each group to every other group.)

## Summary

In this lesson, you implemented the ANOVA technique to generalize testing methods to multiple groups and factors.