# From Multilevel to Mixed-effects
In the previous parts of this lesson we focussed on the multilevel framework. The argument was that this is more intuitive because we can break complex complex data structures into different levels of variation. This allows us to think hierarchically about where data comes from, building up our complexity layer-by-layer. We saw this when we focussed on modelling just a single subject as our experimental unit of interest. We then built in more subjects and expanded the model logically until we came to the multilevel representation. Importantly, this was all based on capturing the *data structure*, rather than thinking in terms of the *covariance structure*. Unlike GLS, where we had to reason about the form of the variance-covariance matrix, the covariance structure in the multilevel model was an emergent property of its structure. In a way, we did not need to worry about correlation, as it was taken into account *automatically*. All we needed to think about was the data structure and the rest followed.

Although this focus on *data structure* can be more intuitive, it is not how these types of models are typically fit in practice. Now, this is not so much a *technical* limitation as it is a decision about *perspective*. Software for explicitly fitting multilevel models does exist. For instance, [MLwin](https://www.bristol.ac.uk/cmm/software/mlwin/) developed by the Centre for Multilevel Modelling at the University of Bristol is designed *specifically* for fitting models from this perspective. As shown in the screenshot below, models are fit by writing the multilevel equations which are then estimated from the given data. This provides a *direct* connection between the theory and the application of these models to complex datasets.

```{figure} images/mlwin-screenshot-lge.gif
---
width: 600px
---
```

Another example of this approach is the [HLM](https://ssilive.com/license/hlm) software which, as shown below, also specifies these types of models explicitly in terms of equations for the different levels. This illustrates that it *is* possible to apply the multilevel perspective in practice. It also illustrates why understanding the theory is so important, especially when faced with software that requires you to write the theoretical model down in order to use it.

```{figure} images/hlm-screenshot.png
---
width: 600px
---
```

However, this multilevel perspective is not the typical method used in `R` (or `SPSS`, `STATA` or `SAS`) for these types of models. Instead, these software packages use an *equivalent* approach in the form of *mixed-effects models*. In this part of the lesson, we will connect these two perspectives together and then show how these forms of models can be fit in `R` using both the `nlme` package (familiar from the use of `gls()`) and the more recent `lme4` package. 

## Turning a Multilevel Models into a Mixed-effects Model
To start understanding the equivalence between *multilevel* models and *mixed-effects* models, we will start with our multilevel model from the previous part of this lesson. We will then walk through the steps of turning it into a mixed-effects model. 

### The Existing Multilevel Model
To review, we previously finished modelling our simple repeated measures data using the following multilevel specification

$$
\begin{alignat*}{2}
    y_{ij}    &= \mu_{i} + \alpha_{j} + \eta_{ij}             &\quad\text{Level 1} \\
    \mu_{i}   &= \mu + \xi_{i}                                &\quad\text{Level 2}
\end{alignat*}
$$

with

$$
\begin{alignat*}{2}
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{w}\right) \\
    \xi_{i}   &\sim \mathcal{N}\left(0,\sigma^{2}_{b}\right).
\end{alignat*}
$$

We conceptualised the *Level 1* model as capturing variation associated with *individual* subjects, where the error term $\eta_{ij}$ can be used to capture the *within-subject* variance term $\sigma^{2}_{w}$. We then conceptualised the *Level 2* model as capturing variation associated with the *group* of subjects, where the error term $\xi_{i}$ can be used to capture the *between-subjects* variance term $\sigma^{2}_{b}$. We also saw, via simulations, that this structure implied a compound symmetric variance-covariance matrix where 

$$
\begin{alignat*}{1}
    \text{Var}(y_{ij})                 &= \sigma^{2}_{b} + \sigma^{2}_{w} \\
    \text{Cov}(y_{ij},y_{ij^{\prime}}) &= \sigma^{2}_{b}.
\end{alignat*}
$$

### The Mixed-effects Trick
Turning this model into a mixed-effects model only requires one step, where we *collapse* the levels together and then work with a *single* equation. Notice above that the Level 1 equation is given as

$$
y_{ij} = \mu_{i} + \alpha_{j} + \eta_{ij},
$$

with a separate Level 2 equation telling us what $\mu_{i}$ is equal to

$$
\mu_{i} = \mu + \xi_{i}.
$$

Given that we know that $\mu_{i}$ has a wider definition than given in the Level 1 equation, we can simply insert the equality from Level 2 into Level 1. This gives us

$$
y_{ij} = \overbrace{\mu + \xi_{i}}^{\mu_{i}} + \alpha_{j} + \eta_{ij}.
$$

Now, we have a single equation that contains two constants $(\mu,\alpha_{j})$ and two random error-terms $(\xi_{i},\eta_{ij})$. The constants are termed the *fixed-effects* and the errors are termed the *random-effects*. In addition, we would usually arrange like-terms together, so the more typical way of writing this would be

$$
y_{ij} = \overbrace{\mu + \alpha_{j}}^{\text{fixed}} + \overbrace{\xi_{i} + \eta_{ij}}^{\text{random}}
$$

with

$$
\begin{alignat*}{2}
    \xi_{i}   &\sim \mathcal{N}\left(0,\sigma^{2}_{b}\right) \\
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{w}\right)
\end{alignat*}
$$

which is the same as the multilevel model. So, what we have done here is *collapsed* our data hierarchy into a *single level*. Now we have *one* equation with *two* error terms.

### Partitioned Errors Perspective
All of the above may be starting to sound very familiar from our discussion of the *repeated measures ANOVA* a couple of weeks ago. In fact, the repeated measures ANOVA *is a simple mixed-effects model*. This is perhaps one of the most important facts of this lesson to understand. It is not uncommon to hear people recommend a mixed-effects model over the repeated measures ANOVA, but this is something of a confusion because a repeated measures ANOVA *is already* a mixed-effects model. The differences are largely in terms of the mechanics of how this model is estimated and used. Because this is an *older* approach, it is not part of a larger or more-general framework for mixed-effects. Instead, it is an application of mixed-effects, within a more restrictive traditional framework. 

However, the key thing to recognise here is *because* the repeated measures ANOVA *is* a mixed-effects model, modelling the *same* data in the *same* way using a mixed-effects approach can be *no better than the repeated measures ANOVA*. In fact, in some ways, it can be *worse*. So, moving to mixed-effects is not a guarantee of a better model than the repeated measures ANOVA. The framework is more flexible, yes. The estimation methods is more flexible, yes. However, the results can be *no better*. In practice, this is not always well understood.

## Why Collapse the Hierarchy?

- The model is always estimated as a single unit, with the levels informing each other
- Multilevel implies that we can estimate each level separately, which loses the whole advantage of this framework (this is known as a *summary statistics* approach)
- Software is harder to write in a multilevel fashion (though it does exist e.g. MLM, MLwin), whereas a single function call for a single model fits happily inside the usual `R` approach

However, this does have a distinct disadvantage in terms of intuition. Although both perspectives are *mathematically equivalent*, the model

$$
\begin{alignat*}{1}
    y_{ij}    &=    \mu + \alpha_{j} + \xi_{i} + \eta_{ij}   \\
    \xi_{i}   &\sim \mathcal{N}\left(0,\sigma^{2}_{b}\right) \\
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{w}\right)
\end{alignat*}
$$

tells us *very little* about the data structure. We do not get the same hierarchical intuition about the data-generating process when written this way. Indeed, it just looks like we have stuck an additional error term into the normal linear model. Although mathematically that *is* all that has happened, intuitively there is less of a clear sense of *why* this happened and what it implies. This is why we spent so long with the multilevel model, because jumping over the multilevel perspective straight into mixed-effects is a recipe for confusion. Indeed, this is exactly what many applied researchers do when told to use mixed-effects models. However, this can lead to a conceptual gap where mixed-effects models are just not understood as well as they could be, if the route through multilevel level models was taken instead. So we can think, very broadly, that the hierarchical perspective is the most useful *intuitively*, but the mixed-effects perspective is the most useful *practically*. 

## Mixed-effects Using `nlme`

In [3]:
library('datarium')
library('reshape2')

data('selfesteem')

# repeats and number of subjects
t <- 3
n <- dim(selfesteem)[1]

# reshape wide -> long
selfesteem.long <- melt(selfesteem,            # wide data frame
                        id.vars='id',          # what stays fixed?
                        variable.name="time",  # name for the new predictor
                        value.name="score")    # name for the new outcome

selfesteem.long           <- selfesteem.long[order(selfesteem.long$id),] # order by ID
rownames(selfesteem.long) <- seq(1,n*t)                                  # fix row names
selfesteem.long$id        <- as.factor(selfesteem.long$id)               # convert ID to factor

## Mixed-effects Using `lme4`

## Advantages of Mixed-effects

### Modelling Data Structure
... In our previous discussion, GLS was used as a method to get around the presence of correlation within repeated measurement data. We impose a structure for GLS to remove. What LME models do instead is get us to model the *structure* of the data. We do not worry about dependence because this is accommodated *automatically* as a consequence of the structure. As long as we get the structure right, everything else follows. So this is an entirely different perspective. We are not trying to model correlation, we are trying to get the structure correct. Once we do, everything gets worked out for us, correlation and all. 

### Pooling and Shrinkage
...

### The Bias-Variance Tradeoff
... So, from this perspective, LME models provide a solution to this issue via pooling and shrinkage. GLS provides *no pooling* ....

### Regularisation