# ANOVA

ANOVA is a method used to compare the means of more than two populations. So far, we
have considered only a single population or at the most two populations.  A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two independent variables. The statistical distribution used in ANOVA is the F distribution, whose characteristics are as follows:

1. The F-distribution has a single tail (toward the right) and contains only positive values

![](data/f_dist.png)

2. The F-statistic, which is the critical statistic in ANOVA, is the ratio of variation between the sample means to the variation within the sample. The formula is as follows.
$$F = \frac{variation\ between\ sample\ means}{variation\ within\ the\ samples}$$  


3. The different populations are referred to as treatments.
4. A high value of the F statistic implies that the variation between samples is considerable compared to variation within the samples. In other words, the populations or treatments from which the samples are drawn are actually different from one another.
5. Random variations between treatments are more likely to occur when the variation within the sample is considerable.

Use a one-way ANOVA when you have collected data about one categorical independent variable and one quantitative dependent variable. The independent variable should have at least three levels (i.e. at least three different groups or categories).

ANOVA tells you if the dependent variable changes according to the level of the independent variable. For example:

+ Your independent variable is social media use, and you assign groups to low, medium, and high levels of social media use to find out if there is a difference in hours of sleep per night.
+ Your independent variable is brand of soda, and you collect data on Coke, Pepsi, Sprite, and Fanta to find out if there is a difference in the price per 100ml.

ANOVA determines whether the groups created by the levels of the independent variable are statistically different by calculating whether the means of the treatment levels are different from the overall mean of the dependent variable.

If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.

ANOVA uses the F-test for statistical significance. This allows for comparison of multiple means at once, because the error is calculated for the whole set of comparisons rather than for each individual two-way comparison (which would happen with a t-test).

The F-test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than the variance between groups, the F-test will find a higher F-value, and therefore a higher likelihood that the difference observed is real and not due to chance.

The assumptions of the ANOVA test are the same as the general assumptions for any parametric test:

+ **Independence of observations:** the data were collected using statistically-valid methods, and there are no hidden relationships among observations. If your data fail to meet this assumption because you have a confounding variable that you need to control for statistically, use an ANOVA with blocking variables.
+ **Normally-distributed response variable:** The values of the dependent variable follow a normal distribution.
+ **Homogeneity of variance:** The variation within each group being compared is similar for every group. If the variances are different among the groups, then ANOVA probably isn’t the right fit for the data.

## One-Way-ANOVA

A few agricultural research scientists have planted a new variety of cotton called “AB
cotton.” They have used three different fertilizers – A, B, and C – for three separate
plots of this variety. The researchers want to find out if the yield varies with the type of
fertilizer used. Yields in bushels per acre are mentioned in the below table. Conduct an
ANOVA test at a 5% level of significance to see if the researchers can conclude that there
is a difference in yields.

| Fertilizer A | Fertilizer b | Fertilizer c |
|--------------|--------------|--------------|
|     40       |     45       |     55       |
|     30       |     35       |     40       |
|     35       |     55       |     30       |
|     45       |     25       |     20       |

Null hypothesis: $H_0 : \mu_1 = \mu_2 = \mu_3$  
Alternative hypothesis: $H_1 : \mu_1 ! = \mu_2 ! = \mu_3$

the level of significance: $\alpha$=0.05

In [1]:
import scipy.stats as stats

a=[40,30,35,45]
b=[45,35,55,25]
c=[55,40,30,20]

stats.f_oneway(a,b,c)

F_onewayResult(statistic=0.10144927536231883, pvalue=0.9045455407589628)

Since the calculated p-value (0.904)>0.05, we fail to reject the null hypothesis.There is no significant difference between the three treatments, at a 5% significance level.

## Two-way-ANOVA 

A botanist wants to know whether or not plant growth is influenced by sunlight exposure and watering frequency. She plants 30 seeds and lets them grow for two months under different conditions for sunlight exposure and watering frequency. After two months, she records the height of each plant, in inches.

In [2]:
import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

In [3]:
df[:10]

Unnamed: 0,water,sun,height
0,daily,low,6
1,daily,low,6
2,daily,low,6
3,daily,low,5
4,daily,low,6
5,daily,med,5
6,daily,med,5
7,daily,med,6
8,daily,med,4
9,daily,med,5


In [4]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(water),8.533333,1.0,16.0,0.000527
C(sun),24.866667,2.0,23.3125,2e-06
C(water):C(sun),2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


We can see the following p-values for each of the factors in the table:

**water:** p-value = .000527  
**sun:** p-value = .0000002  
**water*sun:** p-value = .120667  

Since the p-values for water and sun are both less than .05, this means that both factors have a statistically significant effect on plant height.

And since the p-value for the interaction effect (.120667) is not less than .05, this tells us that there is no significant interaction effect between sunlight exposure and watering frequency.