### Anova (analysis of variance)

If you have several samples that you think might be different then you can do an ANOVA test. This is useful because testing to see if there's a difference between more than two groups would require several tests.

ANOVA helps us to check: is the variance within a group due to the variance between groups?

There is some terminology to be aware of:

Firstly, recall these definitions:

Degrees of freedom : $ \nu = n - 1 $, where n is the length of data

Sum of squares : $ \sum{(x_i - \bar{x})^2} $ 

Variance: $ \sigma^2 = \frac{\sum{(x_i - \bar{x})^2}}{n} $

And there are some new terms

Overall mean $ \bar{x} $ : the mean for all data points in the entire sample. Otherwise just the sample mean

Sum of Squares Total (SST) - Variance for all data points from $ \bar{x} $

Sum of Squares Treatment (SStr) - Variance between groups (variance of group means)

Sum of Squares Errors (SSE) - Variation inside groups (distance from group element to group mean)


In [1]:
### The data

# Because dataframes are like matrices, we must have something for blank data. Enter blanks as NA
# This is still more elegant than a MD array...
groups = data.frame(
    "a"=c(2,2,3,1),
    "b"=c(3,4,4,5),
    "c"=c(5,6,6,7)
)
groups

sampleData = c()
for (group in groups) {
    sampleData = c(sampleData, na.omit(group)) 
}
sampleData

a,b,c
<dbl>,<dbl>,<dbl>
2,3,5
2,4,6
3,4,6
1,5,7


### Hypothesis

$ H_0 : \mu_1 = \mu_2 = \mu_3 $

$ H_a : $ one or several populations are different

$ \alpha : 0.05 $

In [2]:
alpha = 0.05

### Test statistic

$ F = \frac{SStr \space / \space Dtr}{SSE \space / \space DE} $

These terms are defined below

In [3]:
# Sample mean (overall mean)

sampleMean = mean(sampleData)
sampleMean

In [4]:
# Sum of squares total

SST = 0
for (sampleDatum in sampleData) {
    SST = SST + (sampleDatum - sampleMean)^2
}
SST

In [5]:
# Degrees of freedom total

DFT = length(sampleData) - 1
DFT

In [6]:
# Sum of squares treatment

SStr = 0
for (group in groups) {
    group = na.omit(group) #because we don't want to calculate the NA values
    groupMean = mean(group)
    groupLength = length(group)
    SStr = SStr + (groupMean - sampleMean)^2 * groupLength
}
SStr

In [7]:
# Degrees of freedom treatment

DFtr = length(groups) - 1
DFtr

In [8]:
# Sum of squares errors

SSE = 0
for (group in groups) {
    group = na.omit(group) #because we don't want to calculate the NA values
    groupMean = mean(group)
    for (value in group) {
        SSE = SSE + (value - groupMean)^2
    }
}
SSE

In [9]:
# Degrees of freedom errors

# With SSE, we are measuring each data point's distance from the mean of that group.
# Therefore each group has its own degrees of freedom

DFE = length(sampleData) - length(groups)
DFE

In [10]:
# ACID TEST
# This should be true. If it's not there is something broken

stopifnot(SST == SStr + SSE)
stopifnot(DFT == DFtr + DFE)

In [15]:
# And now, the test statistic

testStatistic = (SStr / DFtr) / (SSE / DFE)
testStatistic

In [16]:
# Compared to the value given by the F distribution
# If the alpha is 0.5, then the fScore is the critical value that defines the lower end for the region of rejection

criticalValue = qf(1-alpha, df1 = DFtr, df2 = DFE)
criticalValue

If the test statistic is within the region of rejection, then, we can reject the null hypothesis