# Higher-order Repeated Measures ANOVA
In the previous part of this lesson, we examined the most simple case of the repeated measures ANOVA. Despite the simplicity of the design, we saw how this analysis had several complications around the correct partition of the error, as well as the assumptions made about the covariance structure. These alone were enough to suggest that the repeated measures ANOVA framework was problematic to apply in practice. Yet, there are even more complex situations where this framework can be applied. In this final part of the lesson, we will see how the repeated measures ANOVA is used in situations where there are additional *between-subjects* factors, as well as multiple *within-subject* factors. This is not to condone the use of the repeated measures ANOVA in these situations, rather it is to help you understand (a) how this method should be used correctly and (b) why an approach method such as mixed-effects will provide a much better alternative.

## Adding Between-subjects Factors
The first additional complexity we may come across is when we have a *between-subjects* factor alongside the repeated measurements. For example, the `datarium` package contains the dataset `anxiety`. Here, 3 repeated measurements of anxiety have been taken at 3 different time points. Each of the 45 subjects comes from one of 3 groups practising different exercise regimes. So, `time` is the repeated measurement and `group` is the between-subjects factor. This is effectively a $3 \times 3$ ANOVA, as shown in the means table below. 

|             | Group: 1   | Group: 2   | Group: 3   | 
|-------------|------------|------------|------------|
| **Time: 1** | $\mu_{11}$ | $\mu_{12}$ | $\mu_{13}$ |
| **Time: 2** | $\mu_{21}$ | $\mu_{22}$ | $\mu_{23}$ |
| **Time: 3** | $\mu_{31}$ | $\mu_{32}$ | $\mu_{33}$ |

As such, our interest falls on the main effect of `group`, main effect of `time` and the `group:time` interaction. Based on what we have seen so far, the most obvious model is given by

$$
y_{ijk} = \mu + \alpha_{j} + \beta_{k} + S_{i} + \eta_{ijk}.
$$

Here we have added a term for the *between-subject* effect called $\beta_{k}$, so now $i$ indexes the subject ($i = 1,\dots,45$), $j$ indexes the repeated measurements ($j = 1,\dots,3$) and $k$ indexes the independent groups ($k = 1,\dots,3$).

We can see this dataset below in its original form

In [1]:
library('datarium')
data('anxiety')
print(head(anxiety))

  id group   t1   t2   t3
1  1  grp1 14.1 14.4 14.1
2  2  grp1 14.5 14.6 14.3
3  3  grp1 15.7 15.2 14.9
4  4  grp1 16.0 15.5 15.3
5  5  grp1 16.5 15.8 15.7
6  6  grp1 16.9 16.5 16.2


and then reworked into long-format for univariate modelling

In [2]:
library('reshape2')

# repeats and number of subjects
t <- 3
n <- 45

# reshape wide -> long
anxiety.long <- melt(anxiety,                 # wide data frame
                     id.vars=c('id','group'), # what stays fixed?
                     variable.name='time',    # name for the new predictor
                     value.name='anxiety')    # name for the new outcome

anxiety.long           <- anxiety.long [order(anxiety.long$id),] # order by ID
rownames(anxiety.long) <- seq(1,n*t)                             # fix row names
anxiety.long$id        <- as.factor(anxiety.long$id)             # id as factor

In [3]:
print(head(anxiety.long))

  id group time anxiety
1  1  grp1   t1    14.1
2  1  grp1   t2    14.4
3  1  grp1   t3    14.1
4  2  grp1   t1    14.5
5  2  grp1   t2    14.6
6  2  grp1   t3    14.3


So far, nothing has changed from what we have seen previously. However, we now have to think more carefully about our possible error terms and which is most sensible to use as the denominator for each test. 

### The Between-subjects Error Term
Recall from the previous part of this lesson that a repeated measures ANOVA is effectively a linear model with *partitioned errors*. We stated earlier that by splitting the errors into $\epsilon_{ij} = S_{i} + \eta_{ij}$ we are effectively specifying a model with *two* variance terms. So, the variance function becomes

$$
\text{Var}(y_{ij}) = \sigma^{2} = \sigma^{2}_{b} + \sigma^{2}_{w}.
$$

In the one-way repeated measures ANOVA, the point of doing this was twofold. Firstly, it *removes* the correlation from the model errors and thus the model now meets the $i.i.d.$ assumptions. Secondly, it removes error variance associated with the differences *between* the subjects, allowing any inferential tests to only use the variance associated with measurements *within* a subject. This is obviously the most suitable error to use for inference on the repeated measures as it captures the random fluctuations in measurements *within* an individual. By removing $S_{i}$, we have forced all the repeated measurements to represent samples from the *same* population distribution. Initially, the measurements from subject $i$ can be considered a draw from a distribution with a variance $\sigma^{2}_{w}$ and a mean $E(y_{ij}) = \mu_{j} + S_{i}$. So each subject has the *same* variance but a *different* mean that is unique to them. By removing the $S_{i}$, we line all these distributions up and can collapse them together. We can then use all the data across all the subjects to estimate $\sigma^{2}_{w}$. As this is the variance of each distribution of repeated measurements, it is the most suitable metric of uncertainty for inference about the repeated measurements.

Now, what about the between-subjects terms? One way of understanding what we do here is by imagining that there were *no* repeated measurements. In that case, what would the error term represent? If we averaged the repeats for each subject, the errors would represent the difference between the *group mean* and the *subject mean*. The error variance would then be reflective of how the subject means differed across the sample. This is exactly the variance $\text{Var}(S_{i}) = \sigma^{2}_{b}$. This makes sense because it is the variation *between subjects* that is of most importance when looking at between-subject effects. We care about how much the subject-specific means change. In other words, our population of interest is the distribution of *subject means* with variance $\sigma^{2}_{b}$.    

```{figure} ./images/mixed-measures-sampling.png
---
width: 600px
name: mixed-sampling-fig
---
An illustration of the sampling model that underlies a repeated measures design with 3 measurements per-subject, as taken from 2 independent groups. Two example subjects are shown for each group. The most important element here is seeing the different variance sources ($\sigma^{2}_{b}$ and $\sigma^{2}_{w}$) and how these correspond to inference for either the between-subjects or within-subject effects.
```

In [4]:
anxiety.lm <- lm(anxiety ~ group*time + id, data=anxiety.long)
anova(anxiety.lm)

Analysis of Variance Table

Response: anxiety
           Df  Sum Sq Mean Sq F value    Pr(>F)    
group       2  61.992  30.996 367.701 < 2.2e-16 ***
time        2  66.579  33.289 394.909 < 2.2e-16 ***
id         42 299.146   7.123  84.494 < 2.2e-16 ***
group:time  4  37.154   9.288 110.188 < 2.2e-16 ***
Residuals  84   7.081   0.084                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



... Getting this wrong could actually be disastrous for our inference, because we might falsely claim that the between-subjects factor has an effect because the error term is too small. Considering that these factors could correspond to something like treatment effects in patients and controls, concluding a false difference could have serious implications.

In [5]:
anxiety.aov <- aov(anxiety ~ group*time + Error(id), data=anxiety.long)
summary(anxiety.aov)


Error: id
          Df Sum Sq Mean Sq F value Pr(>F)  
group      2  61.99  30.996   4.352 0.0192 *
Residuals 42 299.15   7.123                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: Within
           Df Sum Sq Mean Sq F value Pr(>F)    
time        2  66.58   33.29   394.9 <2e-16 ***
group:time  4  37.15    9.29   110.2 <2e-16 ***
Residuals  84   7.08    0.08                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So now, even though the effect of `group` is still significant, we can see that this is much weaker when we use the correct error term ($F = 367.70$ vs $F = 4.35$).

## Adding More Within-subject Factors
This is where things start to get *really* tricky. Remember, as we go through this, that this is *not* something you would ever need to do in practice because we are not condoning the use of the repeated measures ANOVA. However, this complexity is shown to help you understand why you really do *not* want to do this. This is one of clearest cases where the automation provided by software such as SPSS actively *hides* so much of the complexity that researchers do not think twice about designing studies that require these approaches. As we will see below, we can also do this using `ezANOVA()`, but using `aov()` forces us to directly address how complex these methods become and why we really want a much more flexible and less cumbersome approach.

In [6]:
library('datarium')
library('reshape2')
data('weightloss')

# repeats and number of subjects
t <- 12
n <- 12

# reshape wide -> long
weightloss.long <- melt(weightloss,                         # wide data frame
                        id.vars=c('id','diet','exercises'), # what stays fixed?
                        variable.name="time",               # name for the new predictor
                        value.name="weight")                # name for the new outcome

weightloss.long <- weightloss.long[order(weightloss.long$id),] # order by ID
rownames(weightloss.long) <- seq(1,n*t)              # fix row names

print(head(weightloss.long))

  id diet exercises time weight
1  1   no        no   t1  10.43
2  1   no       yes   t1  11.12
3  1  yes        no   t1  10.20
4  1  yes       yes   t1  10.43
5  1   no        no   t2  13.21
6  1   no       yes   t2  12.51


### Multiple Within-subject Error Terms

In [7]:
weightloss.aov <- aov(weight ~ diet*exercises*time + Error(id), data=weightloss.long)
summary(weightloss.aov)


Error: id
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 11   27.1   2.463               

Error: Within
                     Df Sum Sq Mean Sq F value   Pr(>F)    
diet                  1   5.11    5.11   4.069   0.0459 *  
exercises             1  71.16   71.16  56.647 1.03e-11 ***
time                  2 211.39  105.69  84.134  < 2e-16 ***
diet:exercises        1  33.40   33.40  26.586 9.96e-07 ***
diet:time             2   2.42    1.21   0.963   0.3846    
exercises:time        2  67.43   33.72  26.839 2.25e-10 ***
diet:exercises:time   2  30.77   15.39  12.249 1.43e-05 ***
Residuals           121 152.01    1.26                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, in this scenario, `diet`, `exercises` and `time` are all using the *same* within-subject error term. But is this appropriate? If we focus on `diet`, as an example, $\sigma^{2}_{w}$ will contain variation associated with the different diet conditions within each subject, as we would expect. However, it will *also* contain variation associated with the different levels of `exercises` and the different levels of `time`. If our inferential focus is `diet` alone, it seems illogical to use the uncertainty around `exercises` and `time` in the denominator. Similarly, if we are interested in the `diet:time` interaction, $\sigma^{2}_{w}$ will contain variation associated with both `diet` and `time`, but will *also* include variation associated with `exercises`, which is of no relevance if we are only interested in the `diet:time` effect. As such, whenever there are *multiple* within-subject factors, we can *further partition* $\sigma^{2}_{w}$ into various error terms that are more suitable for each of the effects. 

Going back to `diet` as an example, the errors we want are only those associated with the different levels of `diet` for each subject. This means we want to *average-over* all other repeated measurements associated with both `time` and `exercises`. As we know, the subject-specific effect $S_{i}$ will effectively average over *all* the data from each subject in order to form a subject mean. Here, we want to do basically the same thing, except we want to average over the data for `time` and `exercises`, but leave the data for `diet` un-averaged. As such, we can specify an *interaction* between `subject` and `diet`. This will average-over all the data for each subject within each level of `diet`. This will leave us with subject-specific errors for each level of `diet`. The variance from these errors will then reflect within-subject variation associated with `diet`, ignoring both `time` and `exercises`.

If we continue this logic for all other terms, we get an error structure where we can partition $\sigma^{2}_{w}$ into further terms by taking all possible interactions between `subject` and the within-subject factors. Using `aov()`, this then becomes

In [8]:
weightloss.aov <- aov(weight ~ diet*exercises*time + 
                               Error(id                + 
                                     id:diet           +
                                     id:exercises      +
                                     id:time           +
                                     id:exercises:time +
                                     id:exercises:diet +
                                     id:time:diet      +
                                     id:time:exercises:diet), 
                        data=weightloss.long)
summary(weightloss.aov)


Error: id
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 11   27.1   2.463               

Error: id:diet
          Df Sum Sq Mean Sq F value Pr(>F)  
diet       1  5.111   5.111   6.021  0.032 *
Residuals 11  9.337   0.849                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: id:exercises
          Df Sum Sq Mean Sq F value   Pr(>F)    
exercises  1  71.16   71.16   58.93 9.65e-06 ***
Residuals 11  13.28    1.21                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: id:time
          Df Sum Sq Mean Sq F value   Pr(>F)    
time       2 211.39  105.69   110.9 3.22e-12 ***
Residuals 22  20.96    0.95                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: id:exercises:time
               Df Sum Sq Mean Sq F value   Pr(>F)    
exercises:time  2  67.43   33.72   20.83 8.41e-06 ***
Residuals      22  35.62    1.62                     
---
Signif. codes:  0 ‘***’ 0.00

which is both hideous to specify, but also to interpret given that we now have *8* ANOVA tables to deal with! There is an easier way to write this using

In [11]:
aov(weight ~ diet*exercises*time + Error(id/(diet*exercises*time)), data=weightloss.long)


Call:
aov(formula = weight ~ diet * exercises * time + Error(id/(diet * 
    exercises * time)), data = weightloss.long)

Grand Mean: 12.68132

Stratum 1: id

Terms:
                Residuals
Sum of Squares   27.09682
Deg. of Freedom        11

Residual standard error: 1.569506

Stratum 2: id:diet

Terms:
                    diet Residuals
Sum of Squares  5.111367  9.337474
Deg. of Freedom        1        11

Residual standard error: 0.9213367
5 out of 6 effects not estimable
Estimated effects are balanced

Stratum 3: id:exercises

Terms:
                exercises Residuals
Sum of Squares   71.16328  13.28392
Deg. of Freedom         1        11

Residual standard error: 1.098922
5 out of 6 effects not estimable
Estimated effects are balanced

Stratum 4: id:time

Terms:
                     time Residuals
Sum of Squares  211.38837  20.95943
Deg. of Freedom         2        22

Residual standard error: 0.9760642
6 out of 8 effects not estimable
Estimated effects may be unbalanced

Strat

where the syntax `id/(diet*exercises*time)` can be read as a request to include the main effect of the term on the *left* of `/` alongside all possible interactions between the term on the *left* of `/` and the terms on the *right* of `/`. Here we also use the `*` syntax to automatically expand `diet*exercises*time` into all possible factorial combinations. Note as well that only the *within-subject* factors appear in the `Error()` syntax because our aim is to further partition $\sigma^{2}_{w}$ and *not* $\sigma^{2}_{b}$.

In general, it is not recommended to use `aov()` like this. Not only is it *difficult* and prone to *mistakes*, but the output becomes unwieldly. A better approach if you *must* use a repeated measures ANOVA is to use `ezANOVA()`, as shown below

In [10]:
library('ez')
weightloss.ez <- ezANOVA(data=weightloss.long, dv=weight, wid=id, within=.(diet,exercises,time))
print(weightloss.ez)

$ANOVA
               Effect DFn DFd          F            p p<.05        ges
2                diet   1  11   6.021440 3.202562e-02     * 0.02774675
3           exercises   1  11  58.928078 9.650954e-06     * 0.28434954
4                time   2  22 110.941583 3.218470e-12     * 0.54133853
5      diet:exercises   1  11  75.356051 2.980284e-06     * 0.15716889
6           diet:time   2  22   0.602562 5.561945e-01       0.01332945
7      exercises:time   2  22  20.825889 8.408790e-06     * 0.27352201
8 diet:exercises:time   2  22  14.246076 1.074451e-04     * 0.14663048

$`Mauchly's Test for Sphericity`
               Effect         W          p p<.05
4                time 0.9833425 0.91944157      
6           diet:time 0.5493166 0.05001654      
7      exercises:time 0.6835227 0.14919857      
8 diet:exercises:time 0.9589434 0.81089547      

$`Sphericity Corrections`
               Effect       GGe        p[GG] p[GG]<.05       HFe        p[HF]
4                time 0.9836155 4.732515e

The output will match the result of `aov()`, but in a much nicer format. However, this does hide the fundamental difficulty with assigning tests to different error terms and thus hides much of the complexity and disadvantages of the repeated measures ANOVA framework.

## Why We Should *Not* Use RM ANOVA
Everything we have discussed above has really been an exercise in telling you why you really do not want to use RM ANOVA. All the unncessary fiddling with error terms and different tests requiring different errors is a complication that we could simply do without. Even if we do manage to successfully work out what needs to go where (or get a function like `ezANOVA()` to sort it for us), we are still left with a method that has a number of meaningful restrictions. ... Because of this, the RM ANOVA is both tricky to understand, tricky to use correctly and massively inflexible. It is no wonder that statisticians abandoned this method decades ago! And yet, this is the method that has persisted in psychology until releatively recently.

... testing assumptions and follow-up tests...

This section has largely been motivational to understand why we want to use something more flexible and more modern, but it is important to recognise that you may well end up working with someone who knows nothing beyond the RM ANOVA. In those situations, it is useful to (a) motivate the need for something better and (b) understand how to get the RM ANOVA results in `R`, in case they require further convincing. So, we do not condone the use of the RM ANOVA, but we understand its place in psychology and also understand that there are times where you may want to see what the RM ANOVA says, even if you do not wish to use it.

[^submodel-foot]: An alternative perspective here is that each error term represents a different *sub-model*. So, we can think of specifying *multiple* models, some of which require us to *average-over* certain factors. For instance, if we were to average-over the repeated measurements and then fit a model on the resultant outcome variable, this model would automatically have $\sigma^{2}_{b}$ as its error term. This does make the whole procedure feel a little bit less of a hack, however, it is very impractical to do this, especially when the number of factors and interactions gets larger. You can read more about this approach in [McFarquhar (2019)](https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2019.00352/full).

[^noterr-foot]: Note that this is *not* the errors from the linear model, even though the Greek letter is the same.