### Anova (analysis of variance), non-parametric test

If you have several samples that you think might be different then you can do an ANOVA test. 

The Kruskal Wallis test is useful when the assumptions for an ANOVA test are not met. This is a non-parametric test.


In [1]:
### The data

# Because dataframes are like matrices, we must have something for blank data. Enter blanks as NA
# This is still more elegant than a MD array...
groups = data.frame(
    "a"=c(20  ,6.5 ,21 ,16.5,12  ,18.5, NA),
    "b"=c(14.5,16.5,4.5,2.5 ,14.5,12   ,18.5),
    "c"=c(9   ,1   ,9  ,4.5 ,6.5, 2.5   ,12)
)
groups

sampleData = c()
for (group in groups) {
    sampleData = c(sampleData, na.omit(group)) 
}
sampleData

a,b,c,d
<dbl>,<dbl>,<dbl>,<dbl>
58.2,56.3,50.1,52.9
57.2,55.4,54.2,49.9
58.4,57.0,55.4,50.0
55.8,55.3,,51.7
54.9,,,


### Hypothesis

$ H_0 : \mu_1 = \mu_2 = \mu_3 $

$ H_a : $ one or several populations are different

$ \alpha : 0.05 $

In [2]:
alpha = 0.05

### Samples' mean ranks

If you found a rank for every datum in the data then found the average for those, then you would have:

The sum : $ \sum_{i=1}^n {i} = \frac{n(n+1)}{2} $

And then the average: $ \large \frac{\sum_{i=1}^n {i}}{n} $

$ \therefore \large \frac{\frac{n(n+1)}{2}}{n} $

$ \therefore \frac{(n+1)}{2} $


In [3]:
# Sample's ranks mean (overall mean)

n = length(sampleData)

overallMean = (n+1)/2
overallMean

### Test statistic

let:

$ k $ be the number of groups

$ R_i = \large \frac{n_i(n_i + 1)}{2} $ be the sum of ranks in a group

$ R = \large \frac{n(n + 1)}{2} $ be the sum of ranks overall

$ \bar{R_i} = \large \frac{R_i}{n_i} $ be the average rank in a group

$ \bar{R} = \large \frac{R}{n} $ be the overall mean (of ranks)

$ V = \sum_{i=1}^k n_i \left( \bar{R_i} - \bar{R} \right)^2 $

$ H = \frac{12V}{n(n+1)} $ the the test statistic

In [4]:
# Sum of ranks treatment

groupsTotalRanks = c()
groupsLengths = c()

indexes = rank(sampleData)
indexes
for (group in groups) {
    group = na.omit(group) #because we don't want to calculate the NA values
    groupLength = length(group)
    groupRanks = head(indexes, groupLength)
    indexes = tail(indexes, -groupLength)
    groupsTotalRanks = c(groupsTotalRanks, sum(groupRanks))
    groupsLengths = c(groupsLengths, groupLength)
}
groupsTotalRanks
groupsLengths

stopifnot(length(groupsTotalRanks) == length(groupsLengths))

In [5]:
V = 0

for (i in 1:length(groupsLengths)) {
    thisGroupTotalRanks = groupsTotalRanks[i]
    thisGroupLength = groupsLengths[i]
    thisGroupMean = thisGroupTotalRanks / thisGroupLength
    
    V = V + thisGroupLength*(thisGroupMean - overallMean)^2
}

V

In [6]:
### Test statistic
H = (12*V)/(n*(n+1))
H

In [7]:
### This always an upper tail test, since chi^2 is not a symmetric distribution
### The degrees of freedom refer to the groups in the sample. 

df = length(groups)-1

criticalValue = qchisq(1-alpha, df)
criticalValue

If the test statistic is within the region of rejection (above the critical value), then we can reject the null hypothesis