# The Multilevel Framework
Now that we have established some general topics around mixed-effects models, we can start our journey into the world of mixed-effects from the perspective of a *multilevel* model. This is primarily because we can build the pieces of a mixed-effects model very slowly from first principles, allowing the logic to become much clearer. In addition, the multilevel framework is often the most *intuitive* way to think about these models, they are just less frequently implemented in this fashion. So, we generally advise to *think* about these models in a multilevel fashion, even if *practically* we end up specifying them in a mixed-effects fashion. This will all become clearer once we have discussed *both* perspectives. 

## Fitting a Model to One Subject
To start understanding multilevel models, let us imagine that we only have the data for *one subject*. Going back to the long-formatted `anxiety` data from `datarium`, let us extract the data associated with the first subject, who has `id == 31`.

In [9]:
library('datarium')
library('reshape2')

data('anxiety')
anxiety.grp3 <- anxiety[anxiety$group == 'grp3',]   # high exercise group
anxiety.grp3 <- subset(anxiety.grp3, select=-group) # remove group column

# repeats and number of subjects
t <- 3
n <- dim(anxiety.grp3)[1]

# reshape wide -> long
anxiety.long <- melt(anxiety.grp3,             # wide data frame
                     id.vars='id',             # what stays fixed?
                     variable.name="time",     # name for the new predictor
                     value.name="score")       # name for the new outcome

anxiety.long           <- anxiety.long[order(anxiety.long$id),] # order by ID
rownames(anxiety.long) <- seq(1,n*t)                            # fix row names
anxiety.long$id        <- as.factor(anxiety.long$id)            # convert ID to factor

In [10]:
sub.1 <- anxiety.long[anxiety.long$id == '31',]
print(sub.1)

  id time score
1 31   t1  14.6
2 31   t2  13.0
3 31   t3  11.7


So, we can see that we have 3 repeated measurements associated with the 3 values of `time`. Importantly, there are no replications at each time-point, so this is all the information we have available. Now, we know these values will be *correlated* by virtue of coming from the same subject, but we will put that to one side for now because it is a distraction. Instead, our focus here is simply *what model of these data is possible*?

We might be tempted to use

In [11]:
lm.sub.1 <- lm(score ~ time, data=sub.1)

However, there is a problem here. Because each level of `time` is associated with *one* data point, there is no sense in which the parameter estimates can be *average* effects, or *summaries* of any kind. They will simply be *identical* to the raw data. The fitting process aims to minimise the errors, so if it can make them 0 it has done the best job it can. In this example, the model will fit the data *perfectly* and we will be left with *no error*. This means we can get parameter estimates, but nothing else. So, we end up with this

In [12]:
summary(lm.sub.1)


Call:
lm(formula = score ~ time, data = sub.1)

Residuals:
ALL 3 residuals are 0: no residual degrees of freedom!

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)     14.6        NaN     NaN      NaN
timet2          -1.6        NaN     NaN      NaN
timet3          -2.9        NaN     NaN      NaN

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:    NaN 
F-statistic:   NaN on 2 and 0 DF,  p-value: NA


This is an important point because the *data* is the element that *constrains* the model. We might *desire* something more complex, but we can only do so if the data supports it. This is important to understand as we go forward.

So, if we cannot fit the model we want, what can we fit? Well, because the problem is that we have no replications within each level of `time`, we cannot use the variable `time` at all. Instead, the best we can do is just fit an intercept. The model for subject $i = 1$ from time-point $j$ is simply

$$
\begin{alignat*}{1}
    y_{1j}    &= \mu_{1} + \eta_{1j} \\
    \eta_{1j} &\sim \mathcal{N}\left(0,\sigma^{2}\right)
\end{alignat*}
$$

Where we have used $\eta$ to refer to the errors rather than the usual $\epsilon$, for reasons that will become clearer as we progress. In `R`, this model would be

In [13]:
lm.sub.1 <- lm(score ~ 1, data=sub.1)
summary(lm.sub.1)


Call:
lm(formula = score ~ 1, data = sub.1)

Residuals:
   1    2    3 
 1.5 -0.1 -1.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  13.1000     0.8386   15.62  0.00407 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.453 on 2 degrees of freedom


So now we have residuals and everything else will work. There is nothing ground-breaking or Earth-shattering about any of this. All we are concluding is that, based on only having a *single* subject from this experiment, the best we could do is model a *subject-specific constant* and nothing else. In this example, the average value of `score` for subject 1 was estimated to be $\hat{\mu}_{1} = 13.10$. That is it.

## Extending the Model to Multiple Subjects
Of course, we do not *only* have subject 1. So let us introduce subject 2 into this framework and see where it gets us. Much like subject 1, if we extract the data for `id == '32'` and consider it in *isolation*, all we can do is the following

In [15]:
sub.2 <- anxiety.long[anxiety.long$id == '32',]
print(sub.2)

  id time score
4 32   t1  15.0
5 32   t2  13.0
6 32   t3  11.9


In [16]:
lm.sub.2 <- lm(score ~ 1, data=sub.2)
summary(lm.sub.2)


Call:
lm(formula = score ~ 1, data = sub.2)

Residuals:
   4    5    6 
 1.7 -0.3 -1.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  13.3000     0.9074   14.66  0.00462 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.572 on 2 degrees of freedom


and then conclude that the average `score` for subject 2 is $\hat{\mu}_{2} = 13.30$. In isolation, we therefore have the following two models

$$
\begin{alignat*}{1}
    y_{1j} &= \mu_{1} + \eta_{1j} \\
    y_{2j} &= \mu_{2} + \eta_{2j} \\
\end{alignat*}.
$$

But, of most importance, is that we are not *really* working in isolation. We have *both* subjects together. Recall that the problem with trying to fit an effect of `time` within a single subject was that there were no replications and thus *no variance*. This is true *within* each subject, but *across* the two subjects we *do* have replications of each level of `time`. If we put the two datasets together, we get

In [8]:
rbind(sub.1,sub.2) # row-bind function

  id time    score
1  1   t1 4.005027
2  1   t2 5.182286
3  1   t3 7.107831
4  2   t1 2.558124
5  2   t2 6.912915
6  2   t3 6.308434

So, now we have *two* values of `t1`, *two* values of `t2` and *two* values of `t3`. The key insight is that *if* we assume that both subjects share the same effect of `time`, we can estimate its effect *across* them. More formally, if we imagine that both subjects are drawn from the same population with a constant effect of `time`, we can use the data from both of them to estimate what that effect is. This is distinct from imagining that the effect of `time` is different and unique to each subject, in which case we could not pool their data because we would be mixing together two different effects. So, as long as we conceptualise `time` as something *shared* between the subjects, we *can* introduce an effect of `time`. If we call the effect of the $j$th level of `time` $\alpha_{j}$, we can then think of these two models as

$$
\begin{alignat*}{1}
    y_{1j} &= \mu_{1} + \alpha_{j} + \eta_{1j} \\
    y_{2j} &= \mu_{2} + \alpha_{j} + \eta_{2j} \\
\end{alignat*}
$$

which, across all subjects, gives us

$$
y_{ij} = \mu_{i} + \alpha_{j} + \eta_{ij}.
$$

So, we now have a *subject-specific* mean ($\mu_{i}$) and an effect of `time` ($\alpha_{j}$). But notice that there is no subject index on $\alpha_{j}$ meaning that its value is *the same* irrespective of the specific subject. This is important because it indicates that $\alpha_{j}$ captures something *universal* across *all* subjects. 

In `R`, this complete model would be

In [18]:
lm.all.subs <- lm(score ~ 0 + id + time, data=anxiety.long)
summary(lm.all.subs)


Call:
lm(formula = score ~ 0 + id + time, data = anxiety.long)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43111 -0.15111  0.01556  0.20222  0.61556 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
id31    14.9178     0.1865   79.98   <2e-16 ***
id32    15.1178     0.1865   81.06   <2e-16 ***
id33    14.8844     0.1865   79.81   <2e-16 ***
id34    15.7844     0.1865   84.63   <2e-16 ***
id35    16.1844     0.1865   86.78   <2e-16 ***
id36    16.8178     0.1865   90.17   <2e-16 ***
id37    17.4844     0.1865   93.75   <2e-16 ***
id38    17.4511     0.1865   93.57   <2e-16 ***
id39    17.6511     0.1865   94.64   <2e-16 ***
id40    17.2178     0.1865   92.32   <2e-16 ***
id41    17.8844     0.1865   95.89   <2e-16 ***
id42    17.6178     0.1865   94.46   <2e-16 ***
id43    18.6511     0.1865  100.00   <2e-16 ***
id44    18.4844     0.1865   99.11   <2e-16 ***
id45    19.0511     0.1865  102.15   <2e-16 ***
timet2  -2.0000     0.1108  -18.05   <2e-16 ***
t

where we have suppressed the intercept to make the `id` effects explicitly subject means. This is not technically necessary, but helps conceptualise what we are doing. So, we now have *subject-specific* intercepts, as well as *universal* effects of `time`. 

In general, what we have done here is create a model that has a *different predicted value for each subject*. Each subject gets their *own model*. Subject 1 is

$$
E(y_{1j}) = \mu_{1} + \alpha_{j} = \mu_{1j},
$$

subject 2 is

$$
E(y_{2j}) = \mu_{2} + \alpha_{j} = \mu_{2j}
$$

and so on. It is like the subjects form the *cells* of the design. Crucially, each subject's expected value is a *combination* of something specific to them $\left(\mu_{1}, \mu_{2}, \dots, \mu_{n}\right)$ and something *universal* from across subjects $\left(\alpha_{1},\alpha_{2},\alpha_{3}\right)$. Thus, this model captures two important elements of our data: the *idiosyncrasies* of the individual and the *constant effect* in the population. 

## Understanding $\mu_{i}$ as a *Random Variable*
In order to understand the next steps towards a complete multilevel model, we need to imagine that we ran the experiment *again* and collected a new subject. Take another look at the model

$$
y_{ij} = \mu_{i} + \alpha_{j} + \eta_{ij}
$$

and remember that this is the theoretical *population-level* description of the data-generating process. As such, what do we imagine changes about this description for each *new* subject? Ponder this for a moment.

If we have a new subject, we have a new value of $i$. So which of the terms above depend upon $i$? Certainly $\mu_{i}$ does. Each subject has their own unique mean that is specific to them, so a *new* subject means a *new* value of $\mu_{i}$. In addition, the errors $\eta_{ij}$ depend upon the value of $i$ and will change with every observation. So, when we sample someone new both $\mu_{i}$ and $\eta_{ij}$ will change.

Importantly, what does *not* change is $\alpha_{j}$. Why? Because this has *no* subject-specific index. $\alpha_{1}$ is the same whether $i = 1$ or $i = 1,427$. This is because this is a *universal effect* across all subjects. Its *estimate* would change with another subject (because there is now *more data*), but remember we are thinking about the true population description of the data-generating process here. In this sense, $\alpha_{j}$ is a *constant*. It does not change with each sample, it remains a *fixed* and *unwavering* element of the universe. As stated earlier, we are imagining that each subject is drawn from a *population* with a mean that depends upon $\alpha_{j}$. So $\alpha_{j}$ is a *constant* of the population distribution we are trying to estimate. If we had the whole population, we would known $\alpha_{j}$ with absolute certainty and we would not need statistics.

So, if $\alpha_{j}$ is *fixed*, what does this mean for $\mu_{i}$ and $\eta_{ij}$? Well, what do we call a variable who's value changes every time we observe it? A *random variable*. So, both $\mu_{i}$ and $\eta_{ij}$ have to be *random variables*. We already know this about $\eta_{ij}$ because we always think of the errors as random and ascribe them a probability distribution. So this is nothing new. What *is* new is having *another random variable in the model*. In fact, pretty much every complication that follows is a direct knock-on effect of having *multiple random variables* in the model.


````{admonition} Fixed-effects and Random-effects
:class: tip
Although we are currently focusing on the *multilevel* perspective, we can already see the *mixed-effects* perspective creeping in. A mixed-effects model is, by definition, a model that contains *both* population-level constants *and* random variables. These are usually referred to as *fixed-effects* and *random-effects*. So, in our example so far we have

- $\alpha_{j}$ = a *fixed-effect* 
- $\mu_{i}$ = a *random-effect*

We will discuss more about what it means to have these different types of effects in a model later in the lesson. For now, the only thing to note is that the *multilevel* and *mixed-effects* frameworks cannot be separated. A discussion of one is naturally a discussion of the other. They are simply different ways of viewing the *same thing*.
````

Because *both* $\mu_{i}$ and $\eta_{ij}$ are *random variables* they will both, by definition, have some probability distribution that describes their behaviour over repeated samples. As we know, the $\eta_{ij}$ are *errors* and thus reflect *deflections* around the expected value. As such, their distribution is the same as it always was 

$$
\eta_{ij} \sim \mathcal{N}\left(0, \sigma^{2}_{1}\right).
$$

But what about the $\mu_{i}$? 

Well, as written above, these are *means* for each subject, so their expected value will not be 0. We saw already that our estimates for the first two subjects from `anxiety` were $\hat{\mu}_{1} = 13.10$ and $\hat{\mu}_{2} = 13.30$. Clearly these are *not* 0 because they are on the same scale as `score`. So, instead, the expected value of each of these subject-specific means will be whatever their *population grand mean* is. If we assume that each subject is drawn from the *same* distribution, and assume this distribution is normal, we have

$$
\mu_{i} \sim \mathcal{N}\left(\mu, \sigma^{2}_{2}\right),
$$

where $\mu$ is the grand mean of the population. 

We can see an example of this in `R` if we take the subject-specific means from the earlier model and then treat them like new data to be modelled further

In [28]:
mu.i       <- coef(lm.all.subs)[1:n] # subject means estimated from Level 1
lm.level.2 <- lm(mu.i ~ 1)
summary(lm.level.2)


Call:
lm(formula = mu.i ~ 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1289 -1.0289  0.4378  0.7544  2.0378 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  17.0133     0.3501    48.6   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.356 on 14 degrees of freedom




There are several consequences of the assumptions made above. To begin with, there are now *two* probability distributions describing where our data come from. These distributions have different variances $\left(\sigma^{2}_{1},\sigma^{2}_{2}\right)$, meaning we have *two* sources of error to now consider. We also need to *estimate* both of these variances from the data in order to complete the unknowns for this model. Last semester, our focus was almost exclusively on the *mean function*, but now we can see that we are shifting focus and adding complexity to the *variance function*. This means that our mental model of where our data comes from now consists of *two layers*. This is precisely the conceptualisation that a multilevel model makes explicit.

## The Complete Multilevel Model
Given the discussions above about the nature of $\eta_{ij}$ and $\mu_{i}$, we can now write our model as

$$
\begin{alignat*}{1}
    y_{ij}        &\sim \mathcal{N}\left(\mu_{i} + \alpha_{j}, \sigma^{2}_{1}\right) \\
    \mu_{i}       &\sim \mathcal{N}\left(\mu , \sigma^{2}_{2}\right)                 \\
\end{alignat*}
$$

In the multilevel framework, these equations are considered multiple *layers* or *levels* of the data, and are usually labelled like so 

$$
\begin{alignat*}{2}
    y_{ij}        &\sim \mathcal{N}(\mu_{i} + \alpha_{j}, \sigma^{2}_{1}) &\quad\text{Level 1} \\
    \mu_{i}       &\sim \mathcal{N}(\mu , \sigma^{2}_{2})                 &\quad\text{Level 2} \\
\end{alignat*}
$$

This also implies a *hierarchy* of data-generation, where the data at Level 1 depends upon the data at Level 2. This is why these types of model are also known as *hierarchical* linear models or HLMs.

We can also write this model in a slightly different way. As we know from last semester, we can always separate a probability model into an equation for the expected value plus random error. For instance, we can write a simple regression model as

$$
y_{i} \sim \mathcal{N}(\beta_{0} + \beta_{1}x_{i}, \sigma^{2})
$$

or as

$$
\begin{alignat*}{1}
    y_{i}        &=    \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} \\
    \epsilon_{i} &\sim \mathcal{N}(0, \sigma^{2})                \\
\end{alignat*}
$$

where, in the second form, we move the probabilistic behaviour into a new error term that captures the variation in $y$ around the expected value. So, if we apply the same principles here, we can rewrite the multilevel model as

$$
\begin{alignat*}{2}
    y_{ij}    &= \mu_{i} + \alpha_{j} + \eta_{ij}             &\quad\text{Level 1} \\
    \mu_{i}   &= \mu + \xi_{i}                                &\quad\text{Level 2} \\
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{1}\right), &\\
    \xi_{i}   &\sim \mathcal{N}\left(0,\sigma^{2}_{2}\right)  &\\
\end{alignat*}
$$

where we have added a new error term $\xi_{i}$[^xi-foot] at Level 2. When written this way, we can see that actually what we have is two models containing *two error terms*. Each of these error terms captures a different form of *random deviation* and thus a different *source of variance*. Importantly, these are not two *separate* models, they are *connected together*. This is something we can understand more by discussing each level in turn.

### Level 1
At Level 1, the model is

$$
\begin{alignat*}{1}
    y_{ij}    &=    \mu_{i} + \alpha_{j} + \eta_{ij}         \\
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{1}\right) \\
\end{alignat*}
$$

So, our data are drawn from a normal distribution with a mean that depends upon two parts: 

- $\mu_{i}$ = something *unique* and *specific* to subject $i$
- $\alpha_{j}$ = something *constant* and *fixed* across subjects. 

At this level, we imagine that the effect of the experimental manipulation is the *same* for every subject. Each subject is drawn from the same population with some fixed value that we are trying to estimate. This is no different to our usual assumption about *regression slopes* or *group means* or any other phenomena we are trying to capture. The difference now is that we also imagine that each subject is *offset* by an amount unique to them. This implies that the *relative* differences in `score` between the levels of `time` is constant in the population, but that the *absolute* value of `score` depends upon the individual. This captures the idea that a single subject may have an overall *lower* or *higher* value of `score`, because each individual will have their own unique degree of self-esteem. While the `time` manipulation may serve to *increase* or *decrease* self-esteem, someone with *low* self-esteem will still remain *low* and someone with *high* self-esteem will still remain *high*. The measurements from each individual subject are therefore *connected* by the $\mu_{i}$ term. 

Importantly, the errors at this level correspond to deviations in the value of `score` from the unique expected value of each subject. For instance, the errors for subject 1 correspond to the deviation 

$$
\begin{alignat*}{1}
    \eta_{1j} &= y_{1j} - (\mu_{1} + \alpha_{j})          \\
              &= y_{1j} - \mu_{1j},
\end{alignat*}
$$

the errors for subject 2 correspond to the deviations 

$$
\begin{alignat*}{1}
    \eta_{2j} &= y_{2j} - (\mu_{2} + \alpha_{j})           \\
              &= y_{2j} - \mu_{2j},
\end{alignat*}
$$ 

and so on. These errors are therefore the scattering of data around the *subject's own unique mean*. The variance captured by these deviations $\left(\sigma^{2}_{1}\right)$ therefore corresponds to the *internal consistency* of each individual subject. It tells us how much a single subject varies in their value of `score` relative to *their own average*. So, this is not how much the subjects differ from each other, this is how much each subject differs *from themselves*. As such, we usually call this the *within-subject* variance and could write $\sigma^{2}_{1} = \sigma^{2}_{w}$.


````{admonition} Level 1 Summary
:class: info
In this model, Level 1 explains each subject as an individual entity. Their measured values of `score` can be decomposed into a term unique to them ($\mu_{i}$) and a population-level effect of `time` ($\alpha_{j}$). The errors at this level therefore correspond to deviations in the value of `score` from the unique expected value of each subject. As such, the variance at this level tells us how much the *repeated measurements* differ on average *within* a single subject. 
````


### Level 2
At Level 2, the model is

$$
\begin{alignat*}{1}
    \mu_{i}  &=    \mu + \xi_{i}                             \\
     \xi_{i} &\sim \mathcal{N}\left(0,\sigma^{2}_{2}\right). \\
\end{alignat*}
$$

Here, our outcome variable now consists of the individual *subject means* from Level 1. Because our outcome is always conceived as a *random variable*, this means that $\mu_{i}$ is also a random variable. This fits with our conceptualisation from Level 1, because each $\mu_{i}$ was *unique* to each subject. As such, its value will change with every new sample. Level 2 therefore explains where the individual values of $\mu_{i}$ come from. As written above, we explain each unique value of $\mu_{i}$ as a combination of a *fixed* population-level mean $(\mu)$ and random error $(\xi_{i})$. This is akin to our conceptualisation of a one-sample $t$-test model. Each subject is drawn from an overall population with some constant mean that we want to estimate. The errors at this level therefore correspond to deviations in the subject means from the population mean. For instance, the errors for the first subject mean are

$$
\xi_{1} = \mu_{1} - \mu.
$$

Notice that we are effectively back to a model with *no dependence* here. The Level 2 model is effectively just a simple regular linear model. As such, the variance at this level $\left(\sigma^{2}_{2}\right)$ corresponds to the *consistency of each subject with the group*. It tells us how much the subjects differ from *each other*. As such, we usually call this the *between-subjects* variance and could write $\sigma^{2}_{2} = \sigma^{2}_{b}$.


````{admonition} Level 2 Summary
:class: info
In this model, Level 2 explains the subjects as a group. Each unique subject mean can be decomposed into a group mean $(\mu)$ plus random error. These errors therefore correspond to deviations in the subject means from the group mean. The variance at this level $\left(\sigma^{2}_{2}\right)$ therefore tells us how much the subjects differ *from each other*.
````


### Multilevel Visualisation

```{figure} images/multilevel-diagram.png
---
width: 425px
align: right
name: multilevel-fig
---
An illustration of a basic two-level model.
```

To help further conceptualise the multilevel framework, {numref}`multilevel-fig` presents a diagrammatic representation of the model we have been working with so far. To navigate this, start from the *bottom* and work *upwards*. Two sampled datapoints are illustrated as $y_{i1}$ and $y_{i2}$. These are measured values of the repeated measures conditions $j = 1$ and $j = 2$ for subject $i$. These are conceptualised as drawn from individual distributions for that specific subject. The means of these distributions are then a combination of the unique individual subject mean $\mu_{i}$ and the fixed effects of the two conditions, $\alpha_{1}$ and $\alpha_{2}$. The subject mean $\mu_{i}$ is itself conceptualised as a random drawn from the population-level distribution of subjects, with a fixed mean of $\mu$. 

There are a few key takeaways from this illustration. Firstly, every term that contains the index $i$ will change with each new subject. These are therefore all *random variables*. If we conceptualised drawing a new subject, think of every term that contains an $i$ changing and every term that does *not* contain an $i$ staying the same. Secondly, notice that the terms $\sigma^{2}_{1}$ and $\sigma^{2}_{2}$ describe different *kinds* of variation. $\sigma^{2}_{2}$ describes variation *between* different subjects, whereas $\sigma^{2}_{1}$ describes variation *within* a single subject. Finally, we can think of this model as a *generalisation* of everything we have done previously. When our data were independent, we were thinking only in terms of Level 2. So, all the models we were looking at last semester were all representations of Level 2[^levels-foot]. The addition of repeats for each experimental unit creates the additional level of variation. So, all our previous models were really *single-level* and we can think of them as *special cases* of a multilevel model. From that perspective, the multilevel model is really *the* framework that underlies everything[^gelman-foot].

`````{topic} What do you now know?
In this section, we have explored ... . After reading this section, you should have a good sense of :

- ...
- ...
- ...

`````

[^xi-foot]: The Greek letter *xi*.

[^levels-foot]: Importantly, this is Level 2 in *this* model. In general, the levels of a multilevel model do not have any meaning outside of a given dataset. They are just labels for the levels the model happens to have. So, the normal linear model has just *one level* and it would not make sense to talk about this as "Level 2" outside of the current context.

[^gelman-foot]: This is exactly the perspective taken by Gelman & Hill (2007) in their book [*Data Analysis Using Regresion and Multilevel/Hierarchical Models*](https://sites.stat.columbia.edu/gelman/arm/). Everything is effectively a regression model with one or more levels.