## <center> ANOVA </center>

The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

### One-way ANOVA

The one-way ANOVA tests whether the mean of some numeric variable differs across the levels of one categorical variable. It essentially answers the question: do any of the group means differ from one another? 

The scipy library has a function for carrying out one-way ANOVA tests called scipy.stats.f_oneway().

Ex: Twenty-two patients undergoing cardiac bypass surgery were randomized to one of three ventilation groups:
    
- Group I: Patients received 50% nitrous oxide and 50% oxygen mixture continuously for 24 h.
- Group II: Patients received a 50% nitrous oxide and 50% oxygen mixture only dirng the operation.
- Group III: Patients received no nitrous oxide but received 35-50% oxygen for 24 h.
    
The data show red cell folate levels for the three groups after 24h' ventilation.

In [2]:
data = pd.read_csv("altman_910.csv")
data.head()

Unnamed: 0,value,group
0,243,1
1,251,1
2,275,1
3,291,1
4,347,1


In [3]:
# Sort them into groups
groups = data.groupby("group").groups
g1 = data["value"][groups[1]]
g2 = data["value"][groups[2]]
g3 = data["value"][groups[3]]

# Do the one-way ANOVA
F_statistic, pVal = stats.f_oneway(g1, g2, g3)
print((F_statistic, pVal))

(3.7113359882669763, 0.043589334959178244)


The test output yields an F-statistic of 3.7113359 and a p-value of 0.043589334, indicating that there is a significant difference between the means of each group.

Another way to carry out an ANOVA test is to use the statsmodels library, which allows you to specify a model with a formula syntax that mirrors that used by the R programming language.

In [4]:
model = ols('value ~ C(group)',         
            data = data).fit()
                
anova_result = sm.stats.anova_lm(model)
anova_result

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(group),2.0,15515.766414,7757.883207,3.711336,0.043589
Residual,19.0,39716.097222,2090.320906,,


### Two-way ANOVA

Compared to one-way ANOVAs, the analysis with two-way ANOVAs has a new element. We can look not only if each of the factors is significant; we can also check if the interaction of the factors has a significant influence on the distribution of the data. 

EX: Let us take for example measurements of fetal head circumference, by four observers in three fetuses, from a study investigating the reproducibility of ultrasonic fetal head circumference data.

In [5]:
data = pd.read_csv("altman_12_6.csv")
data.head()

Unnamed: 0,hs,fetus,observer
0,14.3,1,1
1,14.0,1,1
2,14.8,1,1
3,13.6,1,2
4,13.6,1,2


In [8]:
# Determine the ANOVA with interaction
formula = 'hs ~ C(fetus) + C(observer) + C(fetus):C(observer)'
lm = ols(formula, data).fit()
anovaResults = sm.stats.anova_lm(lm)
anovaResults

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(fetus),2.0,324.008889,162.004444,2113.101449,1.051039e-27
C(observer),3.0,1.198611,0.399537,5.211353,0.006497055
C(fetus):C(observer),6.0,0.562222,0.093704,1.222222,0.3295509
Residual,24.0,1.84,0.076667,,


In words: While—as expected—different fetuses show highly significant differences in their head size (p < 0:001), also the choice of the observer has a significant effect (p < 0:05). However, no individual observer was significantly off with any individual fetus (p > 0:05).