# Generalised Least Squares (GLS)
In the previous parts of this lesson, we discussed the idea of using more flexible covariance structures to model repeated measures data. Although desirable from a theoretical perspective, we saw that this causes problems in practice when it came to applying the classical inferential framework. So, we already start with a degree of *tension* between our *model* and the process of *inference* based on that model. To begin with, however, we will put that tension to one side and focus only on the *model*. From this perspective, our aim is not to calculate a $p$-value or a confidence interval, it is to build something that can accurately capture the *data-generating process*. We will worry about the process of inductive inference later.

Although our general theme on this part of the unit is *mixed-effects models*, we are not quite there yet. So far, our journey has been to examine the *traditional* methods of capturing repeated measurements, in the form of the *paired $t$-test* and *repeated measures ANOVA*. Our conclusion was that these methods suffer from two problematic quirks that we would like to address:

1. They require a *partitioning* of the error within a framework that doesn't allow this, leading to the awkward inclusion of error terms (e.g. `subject`) within the mean structure, and the manual assignment of numerators and denominators across multiple ANOVA tables.
2. They make very *simplistic* assumptions about the covariance structure that are unlikely to hold in reality. 

Although mixed-effects models do address *both* of these issues (as we will see later), so too does a much more straightforward approach that we are already familiar with: Generalised Least Squares (GLS). As such, GLS presents a more logical starting point for addressing the issues we have identified. Indeed, this is useful because solving the two points above is *not* the main justification for using mixed-effects models. In fact, the justification is *not* that mixed-effects are better than GLS for every dataset, it is that mixed-effects are generally *more flexible* for complex data structures. However, this flexibility also comes at a price. So, there are times when GLS may actually be the better solution and we should not default to *mixed-effects* whenever we simply want a more general covariance structure.

## GLS Theory
We previously came across GLS in the context of allowing different variances for different groups of data in ANOVA-type models. This was motivated as a way of lifting the assumption of *homogeneity of variance*. However, GLS is actually a much more general technique. To see this, note that the probability model for GLS is

$$
\mathbf{y} \sim \mathcal{N}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right),
$$

where $\boldsymbol{\Sigma}$ can take on *any structure*. In other words, GLS has exactly the same probability model as the normal linear model, except that it allows for a flexible specification of the variance-covariance matrix. Importantly, this is done *without* an explicit partitioning of the error. So, a GLS version of the one-way repeated-measures ANOVA model would be

$$
y_{ij} = \mu + \alpha_{j} + \epsilon_{ij},
$$

for subject $i$ in repeated measures condition $j$. Notice that there is *no* subject-specific term here. Indeed, the mean structure is *identical* to an independent one-way ANOVA model. This is as it should be, given that we know that correlation does not affect the *mean function*. So, the difference here lies in the *variance function*. Under GLS, the *vector* of errors is assumed

$$
\boldsymbol{\epsilon} \sim \mathcal{N}\left(\mathbf{0},\boldsymbol{\Sigma}\right)
$$

where $\boldsymbol{\Sigma}$ can take on any structure that can be estimated from the data. For instance, we could specify a block-diagonal matrix that was unstructured within each subject, and 0 everywhere else. If we had 3 repeated measurements per-subject, we can express the block associated with subject $i$ as 

$$
\boldsymbol{\epsilon}_{i} \sim \mathcal{N}\left(\mathbf{0},\boldsymbol{\Sigma}_{i}\right)
$$

which can be expanded to

$$
\begin{bmatrix}
    \epsilon_{i1} \\
    \epsilon_{i2} \\
    \epsilon_{i3} \\
\end{bmatrix}
\sim \mathcal{N}\left(
\begin{bmatrix}
    0 \\
    0 \\
    0 \\
\end{bmatrix}, 
\begin{bmatrix}
    \sigma^{2}_{1} & \sigma_{12}    & \sigma_{13}    \\
    \sigma_{12}    & \sigma^{2}_{2} & \sigma_{23}    \\
    \sigma_{13}    & \sigma_{23}    & \sigma^{2}_{3} \\
\end{bmatrix}
\right).
$$

Crucially, there is no restriction on the form that $\boldsymbol{\Sigma}_{i}$ takes. So there is no need for this to be *compound symmetric* or *spherical*. We can let there be *any* correlation between the repeated measurement conditions, as well as *different variances* for each condition. This is because the covariance structure is formed *directly* and is an *explicit* part of the model. This is in direct opposition to the repeated measures ANOVA, where the covariance structure is simply *implied* and is formed *indirectly* from the partitioned errors. 

So, we can already see two key advantages here. Firstly, we do not need to partition the error and include messy `subject` terms within the mean function. Secondly, we do not need to make a restrictive assumption about the correlational structure of the data. Another key advantage is that the GLS standard errors are formed *using $\boldsymbol{\Sigma}$*. This means that the correlational structure is *automatically* taken into account. This is true of the standard errors of the parameter estimates, but *also* extends to ANOVA-style omnibus tests. This means the correct error term can be *automatically* determined for each test and we do not need multiple ANOVA tables with manual assignment of denominators. The model will take care of it all for us. So, GLS would appear to solve all the issues we had previously with the repeated measures ANOVA. 

### How Does GLS Work?
In order to get a sense of how GLS works, we need to start with the unrealistic assumption that we *know* $\boldsymbol{\Sigma}$. In other words, we know *exactly* what the population variances and covariances are between the repeated measures conditions. So, the only unknowns in our model are the parameters of the mean function. Although fanciful, we have seen this sort of thing before and is an approach typical of how these sorts of problems are addressed. We start with some simplifying assumptions to allow us to calculate what we need. We then see what happens when those assumptions are *lifted*. 

So, let us work through what happens when we *know* $\boldsymbol{\Sigma}$ precisely. In this situation, GLS is nothing more than a *transformation* of the model and the data. This is based on the idea that it is possible to find a matrix called $\mathbf{W}$ which can be used to *pre-multiply* the data and the model to *remove* the covariance structure. This pre-multiplication is then equivalent to pre-multiplying the *errors*, which then removes all correlation and variance differences. For example, returning to subject $i$ from the one-way ANOVA example above, we originally had

$$
\text{Var}\left(\epsilon_{i}\right) = 
\begin{bmatrix}
    \sigma^{2}_{1} & \sigma_{12}    & \sigma_{13}    \\
    \sigma_{12}    & \sigma^{2}_{2} & \sigma_{23}    \\
    \sigma_{13}    & \sigma_{23}    & \sigma^{2}_{3} \\
\end{bmatrix}.
$$

If we now correct both the *data* and the *model*, the new errors are equivalent to a *transformed* version of the form

$$
\epsilon_{i}^{\star} = \mathbf{W}_{i} \times \epsilon_{i},
$$

which has the following covariance structure

$$
\text{Var}\left(\epsilon_{i}^{\star}\right) = 
\begin{bmatrix}
    \sigma^{2} & 0          & 0          \\
    0          & \sigma^{2} & 0          \\
    0          & 0          & \sigma^{2} \\
\end{bmatrix}.
$$

So, the basic idea is that if we *know* the covariance structure, we can therefore *remove it* from the data. Once we have done that, we have independent data and can simply go back to using the normal linear model. This is a very *clean* and *simple* solution to the problem. If the covariance structure is what is causing us issues, we just *remove* it and then carry on as normal.


### Feasible GLS (FGLS)
Now, what happens in the more *realistic* scenario when we *do not know* $\boldsymbol{\Sigma}$?

## Covariance Constraints
As well as understanding that the very process of estimating $\boldsymbol{\Sigma}$ causes problems, we also need to understand that we cannot have free reign to estimate any old covariance structure we like. One of the most important elements to recognise is that some sort of *constraint* is always needed when estimating a variance-covariance matrix. To see this, note that for a repeated measures experiment there are $nt \times nt$ values in this matrix. The values above and below the diagonal are a mirror image, so the true number of unknown values is $\frac{nt(nt + 1)}{2}$. For instance, if we had $n = 5$ subjects and $t = 3$ repeated measures, there would be $\frac{15 \times 16}{2} = 120$ unique values in the variance-covariance matrix. If we allowed it to be completely unstructured, we would have 120 values to estimate *just* for the covariance structure. Indeed, this is not really possible unless the amount of data we have *exceeds* the number of parameters. So, the data itself imposes a *constraint* on how unstructured the covariance matrix can be.

Luckily, for most applications, we not only assume that $\boldsymbol{\Sigma}$ has a block-diagonal structure (so most off-diagonal entries are 0), but that many of the off-diagonal elements are actually *identical*. We saw this previously with the repeated measures ANOVA. Even though $\boldsymbol{\Sigma}$ may have *hundreds* of values we *could* fill-in, if we assume compound symmetry only within each subject, there are only *two* covariance parameters to be estimated: $\sigma^{2}_{b}$ and $\sigma^{2}_{w}$. The whole matrix can then be constructed using those two alone. This is an example of *extreme simplification*, but it does highlight that we generally do not estimate the *whole* variance-covariance matrix. We only estimate *small parts* of it. Indeed, making the covariance matrix more general is often a risky move because of the number of additional parameters needed. The more we estimate from the same data, the greater our uncertainty will become because each element of the covariance-matrix is supported by *less data*. Complexity always comes at a price.

### Blocked Covariance Structures

### More General Covariance Structures
... For instance, one example of data that is correlated but does *not* fit our usual repeated measures structure is *time-series* data. ... For this, there are special covariance matrices called *autoregressive* (AR) structures ...