# The Multilevel Framework
We will start our journey into the world of mixed-effects models from the perspective of a *multilevel* model. This is primarily because we can build the pieces of a mixed-effects model very slowly from first principles, allowing the logic to become much clearer. In addition, the multilevel framework is often the most *intuitive* way to think about these models, they are just less frequently implemented in this fashion. So, we generally advise to *think* about these models in a multilevel fashion, even if *practically* we end up specifying them in a mixed-effects fashion. This will all become clearer once we have discussed *both* perspectives. 

## Fitting a Model to One Subject
To start understanding multilevel models, let us imagine that we only have the data for *one subject*. Going back to the long-formatted `selfesteem` data from `datarium`, let us extract the data associated with subject 1.

In [1]:
library('datarium')
library('reshape2')

data('selfesteem')

# repeats and number of subjects
t <- 3
n <- dim(selfesteem)[1]

# reshape wide -> long
selfesteem.long <- melt(selfesteem,            # wide data frame
                        id.vars='id',          # what stays fixed?
                        variable.name="time",  # name for the new predictor
                        value.name="score")    # name for the new outcome

selfesteem.long           <- selfesteem.long[order(selfesteem.long$id),] # order by ID
rownames(selfesteem.long) <- seq(1,n*t)                                  # fix row names
selfesteem.long$id        <- as.factor(selfesteem.long$id)               # convert ID to factor

In [2]:
sub.1 <- selfesteem.long[selfesteem.long$id == '1',]
print(sub.1)

  id time    score
1  1   t1 4.005027
2  1   t2 5.182286
3  1   t3 7.107831


So, we can see that we have 3 repeated measurements associated with the 3 values of `time`. Importantly, there are no replications at each time-point, so this is all the information we have available. Now, we know these values will be *correlated* by virtue of coming from the same subject, but we will put that to one side for now because it is a distraction. Instead, our focus here is simply *what model of these data is possible*?

We might be tempted to use

In [3]:
lm.sub.1 <- lm(score ~ time, data=sub.1)

However, there is a problem here. Because each level of `time` is associated with *one* data point, there is no sense in which the parameter estimates can be *average* effects, or *summaries* of any kind. They will simply be *identical* to the raw data. The fitting process aims to minimise the errors, so if it can make them 0 it has done the best job it can. In this example, the model will fit the data *perfectly* and we will be left with *no error*. This means we can get parameter estimates, but nothing else. We *need* residual error for everything else to work. So, we end up with this

In [4]:
summary(lm.sub.1)


Call:
lm(formula = score ~ time, data = sub.1)

Residuals:
ALL 3 residuals are 0: no residual degrees of freedom!

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    4.005        NaN     NaN      NaN
timet2         1.177        NaN     NaN      NaN
timet3         3.103        NaN     NaN      NaN

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:    NaN 
F-statistic:   NaN on 2 and 0 DF,  p-value: NA


This is an important point because the *data* is the element that *constrains* the model. We might *desire* something more complex, but we can only do so if the data supports it. This is quite important to understand as we go forward.

So, if we cannot fit the model we want, what can we fit? Well, because the problem is that we have no replications within each level of `time`, we cannot use the variable `time` at all. Instead, the best we can do is just fit an intercept. The model for subject $i = 1$ from time-point $j$ is simply

$$
\begin{alignat*}{1}
    y_{1j}    &= \mu_{1} + \eta_{ij} \\
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}\right)
\end{alignat*}
$$

irrespective of the value of $j$. We have used $\eta_{ij}$ to refer to the errors rather than the usual $\epsilon_{ij}$, for reasons that will become clearer as we progress. In `R`, this model would be

In [5]:
lm.sub.1 <- lm(score ~ 1, data=sub.1)
summary(lm.sub.1)


Call:
lm(formula = score ~ 1, data = sub.1)

Residuals:
      1       2       3 
-1.4267 -0.2494  1.6761 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   5.4317     0.9043   6.006   0.0266 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.566 on 2 degrees of freedom


So now we have residuals and everything else can work. There is nothing ground-breaking or Earth-shattering about any of this. All we are concluding is that, based on only having a *single* subject from this experiment, the best we could do is model a *subject-specific constant* and nothing else. In this example, the average value of `score` for subject 1 was estimated to be $\hat{\mu}_{1} = 5.43$. That is it.

## Extending the Model to Multiple Subjects
Of course, we do not *only* have subject 1. So let us introduce subject 2 into this framework and see where it gets us. Much like subject 1, if we extract the data for `id == '2'` and consider it in *isolation*, all we can do is the following

In [6]:
sub.2 <- selfesteem.long[selfesteem.long$id == '2',]
print(sub.2)

  id time    score
4  2   t1 2.558124
5  2   t2 6.912915
6  2   t3 6.308434


In [7]:
lm.sub.2 <- lm(score ~ 1, data=sub.2)
summary(lm.sub.2)


Call:
lm(formula = score ~ 1, data = sub.2)

Residuals:
     4      5      6 
-2.702  1.653  1.049 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    5.260      1.362   3.862    0.061 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.359 on 2 degrees of freedom


and then conclude that the average `score` for subject 2 is $\hat{\mu}_{2} = 5.260$. In isolation, we therefore have the following two models

$$
\begin{alignat*}{1}
    y_{1j} &= \mu_{1} + \eta_{1j} \\
    y_{2j} &= \mu_{2} + \eta_{2j} \\
\end{alignat*}.
$$

But, of most importance, is that we are not *really* working in isolation. We have *both* subjects together. Recall that the problem with trying to fit an effect of `time` within a single subject was that there were no replications and thus *no variance*. This is true *within* each subject, but *across* the two subjects we *do* have replications of each level of `time`. If we put the two datasets together, we get

In [8]:
rbind(sub.1,sub.2) # row-bind function

  id time    score
1  1   t1 4.005027
2  1   t2 5.182286
3  1   t3 7.107831
4  2   t1 2.558124
5  2   t2 6.912915
6  2   t3 6.308434

So, now we have *two* values of `t1`, *two* values of `t2` and *two* values of `t3`. This means we *can* introduce an effect of `time` that will be estimated *across* the subjects. If we call the effect of the $j$th level of `time` $\alpha_{j}$, we can think of these two models as

$$
\begin{alignat*}{1}
    y_{1j} &= \mu_{1} + \alpha_{j} + \eta_{1j} \\
    y_{2j} &= \mu_{2} + \alpha_{j} + \eta_{2j} \\
\end{alignat*}
$$

which, across all subjects, gives us

$$
y_{ij} = \mu_{i} + \alpha_{j} + \eta_{ij}.
$$

So, we now have a *subject-specific* mean ($\mu_{i}$) and an effect of `time` ($\alpha_{j}$). But notice that there is no subject index on $\alpha_{j}$. So this is *the same* irrespective of the specific subject. This is important because it indicates that $\alpha_{j}$ captures something *universal* from across all subjects. In `R`, this complete model would be

In [9]:
lm.all.subs <- lm(score ~ id + time, data=selfesteem.long)
summary(lm.all.subs)


Call:
lm(formula = score ~ id + time, data = selfesteem.long)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3509 -0.5233 -0.0888  0.5304  1.9560 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.33503    0.60780   5.487 3.28e-05 ***
id2         -0.17189    0.78466  -0.219  0.82907    
id3          0.39031    0.78466   0.497  0.62491    
id4          0.06107    0.78466   0.078  0.93882    
id5         -1.01940    0.78466  -1.299  0.21029    
id6         -0.75183    0.78466  -0.958  0.35067    
id7         -0.11610    0.78466  -0.148  0.88402    
id8         -0.30895    0.78466  -0.394  0.69840    
id9          0.02795    0.78466   0.036  0.97198    
id10        -0.06029    0.78466  -0.077  0.93960    
timet2       1.79382    0.42978   4.174  0.00057 ***
timet3       4.49622    0.42978  10.462 4.44e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.961 on 18 degrees of freedom
Multiple R-squa

So we now have *subject-specific* intercepts, as well as a *universal* effect of `time`. In general, what we have done here is create a model that has a different predicted value for each subject. So, each subject gets their *own model*. Subject 1 is

$$
E(y_{1j}) = \mu_{1} + \alpha_{j} = \mu_{1j},
$$

subject 2 is

$$
E(y_{2j}) = \mu_{2} + \alpha_{j} = \mu_{2j}
$$

and so on. It is like the *subjects* form the *cells* of the design. However, crucially, each subject's expected value is a *combination* of something specific to them $\left(\mu_{1}, \mu_{2}, \dots, \mu_{n}\right)$ and something *universal* from across subjects $\left(\alpha_{1},\alpha_{2},\alpha_{3}\right)$. Thus, this model captures two important elements of our data: the *idiosyncrasies* of the individual and the *constant effect* in the population. 



````{admonition} Connections with the Repeated Measures ANOVA
:class: tip
Hopefully the connections with the repeated measures ANOVA are starting to emerge, but we will make this more explicit later in the lesson.
````

## Understanding $\mu_{i}$ as a *Random Variable*
In terms of understanding the next step towards a complete multilevel model, we need to imagine that we ran the experiment *again* and collected a new subject. Take another look at the model

$$
y_{ij} = \mu_{i} + \alpha_{j} + \eta_{ij}
$$

and remember that this is the theoretical *population-level* description of the data-generating process. As such, what do we imagine changes about this description for each *new* subject? Ponder this for a moment, before we move on.

If we have a new subject, we also have a new value of $i$. So which of the terms above depend upon $i$? Well, certainly $\mu_{i}$ does. Each subject has their own unique mean that is specific to them. So, a *new* subject means a *new* value of $\mu_{i}$. In addition, the errors $\eta_{ij}$ depend upon the value of $i$, as these will change with each observation within each subject. So in terms of what changes in the model above when we sample someone new, both $\mu_{i}$ and $\eta_{ij}$ will shift.

Importantly, what does *not* change is $\alpha_{j}$. Why? Because this has *no* subject-specific index. $\alpha_{1}$ is the same whether $i = 1$ or $i = 1,427$. This is because this is a *universal effect* across all subjects. Yes, its *estimate* would change with another subject (because there is now *more data*), but remember we are thinking about the true population description of the data-generating process here. In this sense, $\alpha_{j}$ is a *constant*. It does not change with each sample, it remains a *fixed* element of the universe.

So, if $\alpha_{j}$ is *fixed*, what does this mean for $\mu_{i}$ and $\eta_{ij}$? Well, what do we call a variable that changes every time we observe it? A *random variable*. So, both $\mu_{i}$ and $\eta_{ij}$ are *random variables*. We already know this about $\eta_{ij}$ because we already ascribed it a distribution earlier. Indeed, we have *always* treated the errors as random, so this is nothing new. What *is* new is having *another random variable in the model*.


````{admonition} Fixed-effects and Random-effects
:class: tip
Although we are currently focusing on the *multilevel* perspective, we can already see the *mixed-effects* perspective creeping up on us. A mixed-effects model is, by definition, a model that contains *both* population-level constant *and* random variables. These are usually referred to as *fixed-effects* and *random-effects*. So, in our example so far we have

- $\alpha_{j}$ - a *fixed-effect* that represents a population-level constant that we want to *estimate* and tells us something universal about our data
- $\mu_{i}$ - a *random-effect* that represents a realised value of a *random variable* that tells something about the variability in our data

This connection between variability and random effects may not be completely clear right now, but we will get to it shortly. However, for the time being, you can think of 

- *Fixed-effects* = elements of the *mean function*
- *Random-effects* = elements of the *variance function*

We perform *inference* on the estimated fixed-effects, using the *uncertainty* encoded by the random-effects. This is the core definition that we need to keep in mind throughout this entire section of the unit. But do not worry if its not wholly clear right now.
````

Because *both* $\mu_{i}$ and $\eta_{ij}$ are *random variables* they will both, by definition, have some probability distribution that describes their behaviour over repeated samples. As we know, the $\eta_{ij}$ are *errors* and thus reflect *deflections* around the expected value. As such, their distribution is the same as it always was 

$$
\eta_{ij} \sim \mathcal{N}\left(0, \sigma^{2}_{w}\right).
$$

But what about the $\mu_{i}$? 

Well, as written above, these are *means* for each subject, so their expected value will not be 0. We saw already earlier that our estimates for the first two subjects from `selfesteem` were $\hat{\mu}_{1} = 5.43$ and $\hat{\mu}_{2} = 5.26$. Clearly these are *not* 0 because they are on the same scale as `score`. So, instead, the expected value of each of these subject-specific means will be whatever the *population grand mean* is. Their variance will then represent the variability of the subject means, which we will call $\sigma^{2}_{b}$. As such

$$
\mu_{i} \sim \mathcal{N}\left(\mu, \sigma^{2}_{b}\right).
$$

This means that our mental model of where our data comes from now consists of *two layers*. ...

## The Complete Multilevel Model
...As we know from our discussions last semester, we can always write a linear model as an equation for the mean function with the probabilistic behaviour of the random variable attributable to an error term. So we can write the above as

$$
\mu_{i} = \mu + S_{i}
$$

with

$$
S_{i} \sim \mathcal{N}(0,\sigma^{2}_{b})
$$


So, putting all these pieces together, our full model is now

$$
\begin{alignat*}{1}
    y_{ij}        &= \mu_{i} + \alpha_{j} + \eta_{ij}  \\
    \mu_{i}       &= \mu + S_{i} \\
\end{alignat*}
$$

with

$$
\begin{alignat*}{1}
     S_{i}    &\sim \mathcal{N}\left(0,\sigma^{2}_{b}\right)   \\
    \eta_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}_{w}\right) \\
\end{alignat*}
$$

So, these are in fact *two* models that are *linked together*. We have a *hierarchy* of models, or a single model with *multiple levels*. Indeed, from the multilevel perspective, it is typical to label these models like so

$$
\begin{alignat*}{2}
    y_{ij}        &= \mu_{i} + \alpha_{j} + \eta_{ij} &\quad\text{(Level 1)} \\
    \mu_{i}       &= \mu + S_{i}                      &\quad\text{(Level 2)} \\
\end{alignat*}
$$

### The Data-generating Process
In order to fully conceptualise what the multilevel model is saying, we need to think of it as an explanation of *where* our data come from.

So, for a single subject, their mean for a given value of `time` is *unique* to them. But we can decompose this into a *subject-specific* mean and a constant effect of `time`. So each measurement from each subject is composed of three parts

1. $\mu_{i}$ - something unique and specific to individual $i$ that is true across all measurements taken from them
2. $\alpha_{j}$ - something universal about the effect of time-point $j$ that is true irrespective of the individual
3. $\eta_{ij}$ - a random perturbation of measurement $ij$ that captures all the reasons why this is not exactly $\mu_{i} + \alpha_{j}$

... SO we can think of the expected value *conditional* on a specific subject. For instance, when $i = 1$ we expect

$$
E(y_{1j}) = \mu_{1} + \alpha_{j}.
$$

So each subject has their own specific mean. However, whenever we consider *all* subjects we have

$$
E(y_{ij}) = E(\mu_{i}) + \alpha_{j} = \mu + \alpha_{j}.
$$

which consists of only those effects that are *universal* across all subjects. Again, this captures the idea that we have many *little* models for each individual subject as well as one *big* model for all subjects. 

### Fitting Individual Linear Models in `R`
Each time we do this, we will collect the estimates of $\mu_{i}$ and $\epsilon_{ij}$, just to demonstrate that these are the elements that truly *do* change with each new subject. We will then show the distribution of these at the end to illustrate that these terms are indeed *random variables*.

In [10]:
level.1 <- lm(score ~ 0 + id + time, data=selfesteem.long)

Notice that this bears a striking similarity to how we specified the repeated measures ANOVA using `lm()` a couple of weeks ago. We will make this connection more explicit a little later in this lesson.

In [11]:
summary(level.1)


Call:
lm(formula = score ~ 0 + id + time, data = selfesteem.long)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3509 -0.5233 -0.0888  0.5304  1.9560 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
id1      3.3350     0.6078   5.487 3.28e-05 ***
id2      3.1631     0.6078   5.204 5.98e-05 ***
id3      3.7253     0.6078   6.129 8.66e-06 ***
id4      3.3961     0.6078   5.588 2.65e-05 ***
id5      2.3156     0.6078   3.810 0.001283 ** 
id6      2.5832     0.6078   4.250 0.000482 ***
id7      3.2189     0.6078   5.296 4.91e-05 ***
id8      3.0261     0.6078   4.979 9.72e-05 ***
id9      3.3630     0.6078   5.533 2.97e-05 ***
id10     3.2747     0.6078   5.388 4.04e-05 ***
timet2   1.7938     0.4298   4.174 0.000570 ***
timet3   4.4962     0.4298  10.462 4.44e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.961 on 18 degrees of freedom
Multiple R-squared:  0.9824,	Adjusted R-squared:  0.9707 
F-statistic: 83.89

So, we now have individual effects unique to each subject, as well as effects of `time` that exist *across* subjects.

In [12]:
mu.i    <- coef(level.1)[1:10]
level.2 <- lm(mu.i ~ 1)

Due to the way the model is coded, this is therefore the *group mean* for `time1`. 