# From Multilevel to Mixed-effects
In the previous parts of this lesson we focussed on the multilevel framework. The argument was that this is more intuitive because we can break complex complex data structures into different levels of variation. This allows us to think hierarchically about where data comes from, building up our complexity layer-by-layer. We saw this when we focussed on modelling just a single subject as our experimental unit of interest. We then built in more subjects and expanded the model logically until we came to the multilevel representation. Importantly, this was all based on capturing the *data structure*, rather than thinking in terms of the *covariance structure*. Unlike GLS, where we had to reason about the form of the variance-covariance matrix, the covariance structure in the multilevel model was an emergent property of its structure. In a way, we did not need to worry about correlation, as it was taken into account *automatically*. All we needed to think about was the data structure and the rest followed.

Although this focus on *data structure* can be more intuitive, it is not how these types of models are typically fit in practice. Now, this is not so much a *technical* limitation as it is a decision about *perspective*. Software for explicitly fitting multilevel models does exist. For instance, [MLwin](https://www.bristol.ac.uk/cmm/software/mlwin/) developed by the Centre for Multilevel Modelling at the University of Bristol is designed *specifically* for fitting models from this perspective. As shown in the screenshot below, models are fit by writing the multilevel equations which are then estimated from the given data. This provides a *direct* connection between the theory and the application of these models to complex datasets.

```{figure} images/mlwin-screenshot-lge.gif
---
width: 600px
---
```

Another example of this approach is the [HLM](https://ssilive.com/license/hlm) software which, as shown below, also specifies these types of models explicitly in terms of equations for the different levels. This illustrates that it *is* possible to apply the multilevel perspective in practice. It also illustrates why understanding the theory is so important, especially when faced with software that requires you to write the theoretical model down in order to use it.

```{figure} images/hlm-screenshot.png
---
width: 600px
---
```

However, this multilevel perspective is not the typical method used in `R` (or `SPSS`, `STATA` or `SAS`) for these types of models. Instead, these software packages use an *equivalent* approach in the form of *linear mixed-effects* (LME) models. In this part of the lesson, we will connect these two perspectives together and then show how these forms of models can be fit in `R` using both the `nlme` package (familiar from the use of `gls()`) and the more recent `lme4` package. 

## Turning a Multilevel Models into a Mixed-effects Model
To start understanding the equivalence between *multilevel* models and LME models, we will start with our multilevel model from the previous part of this lesson. We will then walk through the steps of turning it into an LME model. 

### The Existing Multilevel Model
To review, we previously finished modelling our simple repeated measures data using the following multilevel specification

$$
\begin{alignat*}{2}
    y_{ij}    &= \mu_{i} + \alpha_{j} + \eta_{ij}             &\quad\text{Level 1} \\
    \mu_{i}   &= \mu + \xi_{i}                                &\quad\text{Level 2}
\end{alignat*}
$$

with

$$
\begin{alignat*}{2}
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{w}\right) \\
    \xi_{i}   &\sim \mathcal{N}\left(0,\sigma^{2}_{b}\right).
\end{alignat*}
$$

We conceptualised the *Level 1* model as capturing variation associated with *individual* subjects, where the error term $\eta_{ij}$ can be used to capture the *within-subject* variance term $\sigma^{2}_{w}$. We then conceptualised the *Level 2* model as capturing variation associated with the *group* of subjects, where the error term $\xi_{i}$ can be used to capture the *between-subjects* variance term $\sigma^{2}_{b}$. We also saw, via simulations, that this structure implied a compound symmetric variance-covariance matrix where 

$$
\begin{alignat*}{1}
    \text{Var}(y_{ij})                 &= \sigma^{2}_{b} + \sigma^{2}_{w} \\
    \text{Cov}(y_{ij},y_{ij^{\prime}}) &= \sigma^{2}_{b}.
\end{alignat*}
$$

### The LME Trick
Turning this model into an LME model only requires a single step. ALl we do is *collapse* the levels together and then work with a *single* equation. To see how to do this, notice above that the Level 1 equation is given as

$$
y_{ij} = \mu_{i} + \alpha_{j} + \eta_{ij},
$$

with a separate Level 2 equation telling us what $\mu_{i}$ is equal to

$$
\mu_{i} = \mu + \xi_{i}.
$$

Given that we know that $\mu_{i}$ has a wider definition than given in the Level 1 equation, we can simply insert the equality from Level 2 into Level 1 to give

$$
y_{ij} = \overbrace{\mu + \xi_{i}}^{\mu_{i}} + \alpha_{j} + \eta_{ij}.
$$

Now, we have a single equation that contains two constants $(\mu,\alpha_{j})$ and two random error-terms $(\xi_{i},\eta_{ij})$. The constants are termed the *fixed-effects* and the errors are termed the *random-effects*. We would usually arrange like-terms together, so the more typical way of writing this would be

$$
y_{ij} = \overbrace{\mu + \alpha_{j}}^{\text{fixed}} + \overbrace{\xi_{i} + \eta_{ij}}^{\text{random}}
$$

with

$$
\begin{alignat*}{2}
    \xi_{i}   &\sim \mathcal{N}\left(0,\sigma^{2}_{b}\right) \\
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{w}\right).
\end{alignat*}
$$

So, what we have done here is *collapsed* our data hierarchy into a *single level*. This gives us *one* equation with *two* error terms.

````{admonition} Partitioned Errors Perspective
:class: tip
All of the above may be starting to sound very familiar from our discussion of the *repeated measures ANOVA* a couple of weeks ago. In fact, the repeated measures ANOVA *is a simple LME model*. This is perhaps one of the most important facts of this lesson to understand. It is not uncommon to hear people recommend a mixed-effects model over the repeated measures ANOVA, but this is something of a confusion because a repeated measures ANOVA *is already* a mixed-effects model. The differences are largely in terms of the mechanics of how this model is estimated and used. Because the repeated measures ANOVA is an *older* approach, it is not part of a larger or more-general framework for mixed-effects. Instead, it is an application of mixed-effects, within a more restrictive traditional framework. 

This fits with everything we know already. We saw how we could make the repeated measures ANOVA work using `lm()`, but only by including the random effects in the model equation as a way of *forcing* the error to be split. We then had to ignore the tests on the individual coefficients and construct the ANOVA table manually so that the correct terms were treated as error (i.e. the random effects) and the other terms were treated as effects (i.e. the fixed effects). This all aligns with trying to fit the model given above, but within a framework that only has a single error term. 

Aligning the multilevel/LME models with the repeated measures ANOVA also gives another useful perspective in terms of *partitioned errors*. We could rewrite the model above as

$$
\begin{alignat*}{1}
    y_{ij}        &= \mu + \alpha_{j} + \epsilon_{ij} \\
    \epsilon_{ij} &= \xi_{i} + \eta_{ij}
\end{alignat*}
$$

where we think just in terms of a normal linear model, but with a more structured error term that can be split into multiple sources[^error-foot]. The split is exactly what the repeated measures ANOVA is trying to do. This can be helpful because it highlights that the *mean function* is no different to a normal linear model and the element that changes in a mixed-effects model is the *variance function*. We saw this previously when discussing the repeated measures ANOVA and now we see it crop-up again because the repeated measures ANOVA *is* a partitioned error model and thus *is* a mixed-effects models.

Perhaps the key thing to recognise from all of this is that, because of this equivalence, modelling the *same* data in the *same* way using a mixed-effects approach can be *no better than the repeated measures ANOVA*. In fact, in some ways, it can be *worse*. We have already seen that the most basic mixed-effects model implies a compound symmetric covariance structure, so this does not fix this issue with the repeated measures ANOVA. So, moving to mixed-effects is not a guarantee of a better model. The framework is more flexible, yes. The estimation method is more flexible, yes. However, the results can be *no better*. In practice, this is not always well understood by applied researchers.
````

## Why Collapse the Hierarchy?
As demonstrated above, the LME model comes from collapsing the different levels of our hierarchy into a single model equation. So, why would we want to do this? Beyond the practical reasons around software implementations, are there are any other advantages to thinking in terms of mixed-effects, rather than multiple levels?

There are some arguments in favour of this, not least that you may actually find the LME perspective more intuitive than thinking in levels of variation. This is something of a personal preference and you will need to decide for yourself how you want to think about all this. The preference in this lesson is for *thinking* in a hierarchical fashion and then *translating* that into mixed-effects. However, you may find it more appealing to just jump into mixed-effects and think about linear models with multiple error terms. The choice is yours.

Conceptually, the LME perspective is useful because it makes it clear that these models are always estimated as a *single unit*, with all the levels informing each other. The multilevel perspective can give the impression that the levels are *separate entities*, rather than a part of the same whole. When it comes to estimating these models using ML/REML, they will always be expressed in a collapsed fashion. This also prevents the confusion that the levels could be estimated *separately*. This is possible, and is known more generally as a *summary statistics* approach[^summarystat-foot], but is not a *true* multilevel model because it prevents the levels from informing each other. This simplifies the process down to a selection of separate regression models, which is *not* how these methods typically work.

The argument in this lesson is that collapsing the hierarchy has a distinct *disadvantage* in terms of intuition. Although both perspectives are *mathematically equivalent*, the model

$$
\begin{alignat*}{1}
    y_{ij}    &=    \mu + \alpha_{j} + \xi_{i} + \eta_{ij}   \\
    \xi_{i}   &\sim \mathcal{N}\left(0,\sigma^{2}_{b}\right) \\
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{w}\right)
\end{alignat*}
$$

tells us *very little* about the data structure. We do not get the same intuition about the data-generating process when written this way. Indeed, it just looks like we have stuck an additional error term into the normal linear model. Although mathematically that *is* all that has happened, intuitively there is less of a clear sense of *why* this happened and what it implies. This is why we spent so long with the multilevel model, because jumping over the multilevel perspective straight into mixed-effects is a recipe for confusion. Indeed, this is exactly what many applied researchers do when told to use mixed-effects models. However, this can lead to a conceptual gap where mixed-effects models are just not understood as well as they could be, if the route through multilevel level models was taken instead. So we can think, very broadly, that the hierarchical perspective is the most useful *intuitively*, but the mixed-effects perspective is the most useful *practically*. 

## Advantages of Multilevel/LME Models
Before we move on to the *practical* topic of actually fitting these models in `R`, we turn to a brief discussion of the advantages of LME models. Although we have spent a good deal of time discussing the multilevel/LME framework, you may still be wondering *why* we are going to all this effort, especially when GLS appears to solve our problems more readily than anything we have discussed in this lesson. Below, we will outline the main reasons to push forward with this approach. We will only summaries these at this stage, but will provide clearer demonstrations of some of these ideas a little later.

### Modelling Data Structure
In our previous discussion, GLS was used as a method to get around the presence of correlation within repeated measurement data. We did this by imposing a structure for GLS to remove. This has the distinct disadvantage of requiring us to *know* what that structure is in order for this to work. If we do not know this, we simply have to guess. However, if we get this *wrong* then the removal we impose will be *imperfect*. We saw this already in terms of small samples, but think more carefully about the asymptotic assumptions. We said previously that we can take $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$ as the sample size gets larger. However, this is only true if we assume that the structure is correct. If we have assumed compound symmetry then the asymptotic tests (indeed, any tests in GLS) are predicated on assuming that the *form* of the covariance matrix is correct. The uncertainty lies in the values within that structure, not around its general form. So if we are *wrong* everything falls apart.

Furthermore, for repeated measurements, the correlation is a product of the *structure* of the data, yet GLS has no knowledge of this structure. It simply removes what we tell it to because GLS is entirely covariance-focussed. It does not care *where* the correlation comes from, it just treats it as a nuisance to be removed. While this may not be an issue for some types of data, for others using LME models to embedding the structure of the data buys us many advantages. One such advantage is that we do not need to reason about what covariance structure is appropriate when we have complex data with multiple sources of dependency. As we will see next week, LME models can be applied to situations with many levels to the hierarchy. Because the covariance structure is an emergent property of the structure, so long as the structure is captured correctly the most appropriate covariance structure will always just fall out of the model. This changes our focus from *capturing covariance* to *capturing structure*. Once we get our head around doing this we are at an advantage because *structure* is more readily defensible. There is no argument against repeated measurements being collected *within* subjects, that is a fact of the data. However, there are plenty of arguments around compound symmetry vs sphericity vs unconstrained covariance matrices. 

### Pooling and Shrinkage
Another advantage of LME models is that, by embedding the structure of the data within the model, they can actually *use* that structure to their advantage. Two phenomena that emerge from the LME framework that are *not* seen in GLS are *pooling* and *shrinkage*. In an LME model, unit-specific effects (such as subject means) are estimated using *both* the data from that unit and information about the population as a whole. This is known as *partial pooling*. Units with little data inherently provide *noisy estimates* on their own. In an LME model, this is automatically taken into account by pulling their estimates toward the population average. On the other hand, units with lots of data are allowed to differ from the population average more strongly. This pulling toward the population mean is called *shrinkage*. So, subjects with missing data or fewer observations are less likely to bias the estimates because their lack of information is automatically accommodated. Extreme values are only allowed if there is *a lot* of data supporting them. Importantly, this is an inherent property of estimation within the LME framework. It is not specified by the analyst *a priori*, it is just something that happens. This is why LME models can be so useful in the real-world because missing data is not problematic, it is something the model will automatically accommodate. This is *not* true of GLS, where *no pooling* takes place. GLS makes no adjustments for noisy or uncertain data so estimates are never *shrunk* towards a population average. Indeed, GLS knows nothing about *units* or *subjects* and so this lack of structure means this cannot happen. Subjects with missing data just contribute less to the estimation of the covariance structure, but there is no inherent weighting that uses this lack of information to balance the estimates.

### The Bias-Variance Tradeoff
A key concept within statistical modelling is the *bias-variance* tradeoff. Here, *bias* refers to a model being *systematically wrong*. Even with more data, the model predictions are off from the true values by some amount. In other words, the model keeps *missing* in some consistent way. *Variance* then refers to *instability*. With each new sample the model estimates differ *wildly*, with no consistency from sample to sample. These are two ends of a spectrum. *Bias* comes from a model that is too *rigid*, so irrespective of the data it keeps making the same mistakes. *Variance* comes from a model that is too *flexible*. It over-reacts to new data, shifting itself around constantly. A model with *high bias* will give similar answers with new data, so it is *consistent* and *reliable*, but *inflexible* and may not fit the data very well. A model with *high variance* will keep changing its answers with new data, so it is *inconsistent* and *unreliable*, but *flexible* and may fit the data much better. The *trade-off* is that we cannot minimise *both*. We either keep our model *simple* and *rigid*, which reduces variance and increases bias, or we make our model *complex* and *flexible*, which increases variance but reduces bias. We cannot do *both*.

Where an LME model is useful is because it provides one particular "solution" to this issue. By applying *partial pooling* and *shrinking* values towards the population average the model is able to introduce a small amount of *bias*. By not allowing the estimates to be solely governed by the noisy data of any particular individual, the model becomes more rigid and less reactive to the data. This is a principled way of introducing a little bit of sensible bias, which then allows for a substantial reduction in variance. By applying partial pooling, new data from the same population will not dramatically shift the model estimates. The model is not so rigid that it will not change for new data, but that change has been regulated by balancing the information provided by noisy subjects and using the population *as a whole* to determine sensible estimates. This is a balancing of information that provides a sensible compromise, yet the analyst does not need to do anything. This is all automatic and built-in to the framework. A GLS model does not, indeed *cannot*, do this. 

### Regularisation
Another perspective on *partial pooling* is as a form of *regularisation*. Regularisation is a broader concept in mathematics, but we can think of it as the use of some sort of *penalty* or *restriction* that forces our model to give us *sensible answers*. This is key because ultimately any mathematical model simply produces numbers based on some other numbers. In order for this to be useful, we need to introduce constraints in terms of what numbers the model can spit out. If it gives us values that are way off the scale of what we are measuring, or is predicting responses that are implausible, then it is not a very good model. The introduction of *penalties* or *restrictions* is known as *regularising the solution*. These penalties are not usually all-or-nothing. We do not make certain values *impossible*, rather the penalty scales with plausibility *and* the amount of data. The model is therefore allowed to consider extreme values, but *only* when these are overwhelmingly supported by the data. Bayesian methods embed this in their overarching framework. Frequentist methods do not always do this, but the behaviour of an LME model does allow this to a degree. Pooling and shrinkage are a *form* of regularisation. A subject is only allowed to have a mean that is *extreme*, in terms of its distance from the population average, if they also have *a lot* of data supporting this value. If they do not, their mean will be *shrunk* towards the population average in order to *regularise* the solution. This keeps the model *sensible* and balances the structure we have imposed with what the data is telling us.

`````{topic} What do you now know?
In this section, we have explored ... . After reading this section, you should have a good sense of :

- ...
- ...
- ...

`````

[^error-foot]: This also explains why we did not use the usual symbol of $\epsilon_{ij}$ for the errors until now.

[^summarystat-foot]: In the wild, you find summary statistic approaches applied to problems such as *meta-analyses*, where the effect sizes from each study are taken as the summaries, or in *fMRI*, where estimates of signal change from each subject are taken as the summaries.