### Anova (analysis of variance)

If you have several samples that you think might be different then you can do an ANOVA test. This is useful because testing to see if there's a difference between more than two groups would require several tests.

ANOVA helps us to check: is the variance within a group due to the variance between groups?

There is some terminology to be aware of:

Firstly, recall these definitions:

Degrees of freedom : $ \nu = n - 1 $, where n is the length of data

Sum of squares : $ \sum{(x_i - \bar{x})^2} $ 

Variance: $ \sigma^2 = \frac{\sum{(x_i - \bar{x})^2}}{n} $

And there are some new terms

Overall mean $ \bar{x} $ : the mean for all data points in the entire sample. Otherwise just the sample mean

$ m $ - the number of groups being compared

$ n $ - the total length of data

Sum of Squares Treatment (SStr or SSR) - Variance between groups (variance of group means). Also known as "explained variation" in linear regression.

Sum of Squares Errors (SSE) - Variation inside groups (distance from group element to group mean). Also known as "unexplained variation" in linear regression.

Sum of Squares Total (SST) - Variance for all data points from $ \bar{x}. Sometimes known as $ S_{yy} $

Degrees of Freedom Treatment (DFtr) - $ m - 1 $

Degrees of Freedom Error (DFE) - $ n - m $

Degrees of Freedom Total (DFT) - $ n - 1 $

Mean-Squared Treatment (MStr) - $ SStr / DFtr $

Mean-Squared Error (MSE) - $ SSE / DFE $

### Anova Table

A common way to represent the data used in ANOVA is with this tabular layout. Replace each cell with the appriate value:

| Source (of variance) | Sum of Squares | Degrees of Freedom | Mean-Square        | F or Test Statistic |
| :------------------- | :------------- | :----------------- | :----------------- | :------------------ |
| Treatments           | SStr           | DFtr = $ m - 1 $   | MStr = SStr / DFtr | MStr / MSE          |
| Error or Residuals   | SSE            | DFE = $ n - m $    | MSE = SSE / DFE    |
| Total                | SST (SSR+SSE)  | DFT = $ n - 1 $    | 

In the case of linear regression, this table is useful for comparing the SStr and SSE. If the SSE is large then that is an indication we could adopt a better model, perhaps.

In [1]:
### The data

# Because dataframes are like matrices, we must have something for blank data. Enter blanks as NA
# This is still more elegant than a MD array...
groups = data.frame(
    "a"=c(20  ,6.5 ,21 ,16.5,12  ,18.5, NA),
    "b"=c(14.5,16.5,4.5,2.5 ,14.5,12   ,18.5),
    "c"=c(9   ,1   ,9  ,4.5 ,6.5, 2.5   ,12.5)
)
groups

sampleData = c()
for (group in groups) {
    sampleData = c(sampleData, na.omit(group)) 
}
sampleData

# As described in first cell
m = length(groups)
n = length(sampleData)

a,b,c
<dbl>,<dbl>,<dbl>
20.0,14.5,9.0
6.5,16.5,1.0
21.0,4.5,9.0
16.5,2.5,4.5
12.0,14.5,6.5
18.5,12.0,2.5
,18.5,12.5


### Hypothesis

$ H_0 : \mu_1 = \mu_2 = \mu_3 $

$ H_a : $ one or several populations are different

$ \alpha : 0.05 $

In [2]:
alpha = 0.05

### Test statistic

$ F = \frac{SStr \space / \space Dtr}{SSE \space / \space DE} $

These terms are defined below

In [3]:
# Sample mean (overall mean)

sampleMean = mean(sampleData)
sampleMean

In [4]:
# Sum of squares total

SST = 0
for (sampleDatum in sampleData) {
    SST = SST + (sampleDatum - sampleMean)^2
}
SST

In [5]:
# Degrees of freedom total

DFT = n - 1
DFT

In [6]:
# Sum of squares treatment

SStr = 0
for (group in groups) {
    group = na.omit(group) #because we don't want to calculate the NA values
    groupMean = mean(group)
    groupLength = length(group)
    SStr = SStr + (groupMean - sampleMean)^2 * groupLength
}
SStr

In [13]:
# Degrees of freedom treatment

DFtr = m - 1
DFtr

In [8]:
# Sum of squares errors

SSE = 0
for (group in groups) {
    group = na.omit(group) #because we don't want to calculate the NA values
    groupMean = mean(group)
    for (value in group) {
        SSE = SSE + (value - groupMean)^2
    }
}
SSE

In [14]:
# Degrees of freedom errors

# With SSE, we are measuring each data point's distance from the mean of that group.
# Therefore each group has its own degrees of freedom

DFE = n - m
DFE

In [10]:
# ACID TEST
# This should be true. If it's not there is something broken

stopifnot(SST == SStr + SSE)
stopifnot(DFT == DFtr + DFE)

In [11]:
# And now, the test statistic

MStr = (SStr / DFtr)
MSE = (SSE / DFE)

testStatistic = MStr / MSE
testStatistic

In [12]:
# Compared to the value given by the F distribution
# If the alpha is 0.5, then the fScore is the critical value that defines the lower end for the region of rejection

criticalValue = qf(1-alpha, df1 = DFtr, df2 = DFE)
criticalValue

If the test statistic is within the region of rejection, then, we can reject the null hypothesis