# Solutions for Inference
Given what we discussed in the previous part of this lesson, we are in a difficult situation. What we *want* is a framework where we can have any form of covariance structure to accurately represent the data-generating process. This would allow us to model any type of repeated measures experiment, irrespective of its complexity. However, the inferential devices used by the normal linear model simply *do not allow this*. The emphasis on *knowing* the sampling distribution of the estimates in order to calculate $p$-values and confidence intervals has backed us into a corner. Once the very specific conditions that allow these to be calculated are gone, so too is the whole inferential machinery. In a way, this demonstrates how *fragile* these methods are. 

So, we have two options available to us. One is to simply give up and spend our whole lives restricting inference to only those datasets where the classical results to still apply. The other is that we try and find a way forward. Clearly, our plan is to push forward with the *second option*, but it is important to understand from the very beginning that we are making a *compromise*. We want to develop much better *models* that describe *where the data came from*. In order to do so, we have to accept that we have left the clean world of the normal linear model behind and our inference *will become approximate*. This can be an uncomfortable conclusion, but it is the reality of the situation we are in.

## Option 1 - Pretend that $\boldsymbol{\Sigma}$ is Known
In terms of trying to push forward, in spite of all these complexities, our first option is to simply assume that *we know* $\boldsymbol{\Sigma}$. Of course, we *do not*, that is the whole problem we are trying to address. However, if we simply assume that we have got $\boldsymbol{\Sigma}$ correct, pretty much all the problems highlighted previously *disappear*. In general, there are two ways this is realised in practice. We can choose to:

<ol type="a">
  <li>Treat our estimate as the true value and ignore everything else</li>
  <li>Treat our estimate as the true value, but <i>only</i> because we have enough data that the uncertainty has disappeared</li>
</ol>

We will now discuss each of these in more detail.

### a. Assume $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$
As a first approach, we can choose to *ignore* the problem. If we treat our estimate as *exactly* the population value, then we can carry on without any issues. As we will come to see, methods that allow for a more general covariance matrix do so by *removing* this structure from the data. Taking $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$ means that the covariance structure can be *perfectly removed* and we are back to the normal linear model. All the original inferential theory then applies and we are done. So, this approach is quite *practically* appealing because all the mess simply disappears. 

However, this is not without consequence. Simply ignoring the uncertainty about the value of $\boldsymbol{\Sigma}$ means:

- The degree to which the covariance structure can be *removed* is unknown and is very unlikely to be *perfect*. As such, the degree to which the standard errors and test statistics follow a known sampling distribution is also unknown. This means we have no sense of how accurate the $p$-values or confidence intervals actually are.
- We are pretending that degrees of freedom exist as a universal indicator of uncertainty for a single variance component, but they do not. 
- Because we are pretending that we got $\boldsymbol{\Sigma}$ for free, the degrees of freedom have no correction for estimating $\boldsymbol{\Sigma}$. As such, they will be *larger* than equivalent repeated measures ANOVA models.

### b. Use *Asymptotic* Results
As a second approach, we can choose to *acknowledge* the problem, but assume that it does not apply to us. As discussed earlier, most of the issues here arise in *small samples* because the uncertainty around $\hat{\boldsymbol{\Sigma}}$ is much greater. We see this with the $t$-distribution in the normal linear model. Small sample means smaller degrees of freedom, which means a wider scaled $\chi^{2}(\nu)$ distribution and a null $t$-distribution with heavier tails. As the sample gets *larger*, most of the complexity disappears. The scaled $\chi^{2}(\nu)$ collapses to a single point and the $t$-statistic becomes a $z$-statistic. At this point we no longer need degrees of freedom because the null $z \sim \mathcal{N}(0,1)$ has a *fixed* rather than *dynamic* shape. So, the solution here is to simply calculate $z$-statistics instead of $t$-statistics[^chisq-foot]. We can then refer to null distributions *without* degrees of freedom and claim that the tests are *asymptotically* correct. In other words, these results are correct, so long as the sample size is large enough that our uncertainty around $\hat{\boldsymbol{\Sigma}}$ is now *negligible*. In other words, we can treat $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$ not because we *know* it already, but because we have so much data that this is *effectively* true. 

Compared to possibility **1a**, this approach has a certain *statistical purity* to it. We do not need to pretend degrees of freedom still exist nor pretend that any of the small sample theory still applies. We can also ignore the whole idea of the sampling distribution of the denominator because, if we have enough data, *this does not matter*. So we do not need to pretend it has a certain form, we can simply ignore it. However, there are still some clear issues here:

- We need to be comfortable assuming that our $n$ is *large-enough* for this to work, but this is an *unanswerable* question (see box below).
- We need to be comfortable with the idea of dismissing uncertainty in the estimation of $\boldsymbol{\Sigma}$ as negligible.
- In small samples this will result in inference that is *optimistic*. However, the open use of asymptotic tests already embeds this as a caution. As such, adjusting inference for small samples becomes an *interpretative* concern, rather than a mathematical one.

`````{admonition} How Large is "Large"?
:class: tip
If we want to lean on asymptotic theory, the obvious question is "how big does $n$ need to be?". The problem is that the definition is based on a *limit*, so it says that the approximation gets better and better as $n$ moves towards infinity. For our purpose, $n$ is the *number of subjects*, rather than the total amount of data. So, the answer is not that there is some magic sample size that is suddenly large enough, the answer is that the approximation will get better the larger $n$ becomes. The question then is more about what our tolerance for error is. The point of the asymptotic theory is to say that the error that comes from estimation becomes more negligible as $n$ grows, as does the penalty for estimating $\boldsymbol{\Sigma}$ from the data. So, unfortunately, there is *no honest numeric answer to this question*. The way to think about it is as a *degree of comfort*. If you are have $n = 5$, you should probably feel *very uncomfortable*, whereas $n = 50$ should probably make you feel *cautious*. If you have $n = 100$, you should feel *optimistic* and if you have $n = 200$ you should probably be feeling *fairly confident*. As $n$ increases beyond that, you should probable feel *perfectly fine* about this approach. These are only ballpark figures, but the point is really to think of $n$ as a *continuum of comfort*, rather than as a *threshold*. 
`````

### c. Use Simulations
As a third approach, we can again choose to *acknowledge* the problem, but assume that $\hat{\boldsymbol{\Sigma}}$ is close-enough to $\boldsymbol{\Sigma}$ that we can use it to *simulate* uncertainty. So, this involves assuming $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$, but only insofar as assuming that our estimate is close enough to define a *plasuible world* for simulation. We do not assume that the uncertainty in estimating $\hat{\boldsymbol{\Sigma}}$ is negligible. In fact, the whole point of the simulations is to *measure it*. 

The advantage here is that we do not assume the uncertainty around estimating $\boldsymbol{\Sigma}$ has a particular shape. Instead, we let the simulations build the relevant sampling distributions over its many iterations. This neither requires assuming that the classical results still hold, nor requires enough data so that all these problems disappear. Instead, we use the power of the *computer* to find a solution using a procedure known as the *parametric bootstrap*. In this method we:

1. Treat a fitted *null model* as the "truth".
2. Use this fitted model to randomly simulate new data.
3. Refit the model to the simulated dataset and save a copy of the test statistic.
4. Over many repeats of 2 and 3 (10,000+), build up the *null distribution* of the test statistic.
5. Calculate $p$-value and confidence intervals from this distribution.

So this requires *zero* theory about the form of the sampling distributions. The uncertainty comes through naturally as part of the simulation. There are no *degrees of freedom* here as a distributional parameter because the distribution that is built does not even need to be mathematically tractable[^tractable-foot]. So this has some distinct advantages because we can get rid of much of the difficult approximation needed in classical approaches. However, there are still trade-offs:

- Computational burden is high, as calculating a single $p$-value can be a long process depending upon the complexity of refitting the model.
- Fundamentally, we have to assume that our models is *close enough* to the truth for this to work. Not only in terms of the covariance structure, but also in terms of the assumed population distribution. This is why this is a *parametric* bootstrap, rather than a *non-parametric* procedure. 
- Assumptions about the accuracy of the covariance structure will improve with larger samples, meaning we can trust the simulations more. However, the more data we have the more *asymptotic* theory takes over, so the need for simulation becomes more questionable. Simulation is therefore most useful for that unknown middle-ground between samples that are too small for $\hat{\boldsymbol{\Sigma}}$ to be reliable and samples large enough for asymptotic theory to take over.

## Option 2 - Acknowledge that $\hat{\boldsymbol{\Sigma}}$ is an *Estimate*
The methods above were both variations on a theme where, in one way or another, we assume that $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$. This was either because we were ignoring the problem, or because we were assuming that we had enough data so the problem disappeared. However, another option is to avoid making this assumption entirely and fully accept that $\hat{\boldsymbol{\Sigma}}$ is an estimate. Once we do that, we acknowledge that there is some degree of uncertainty in its value and, without having any of way of deriving this uncertainty exactly, we need to correct for it.   

Methods that take this approach do so by creating *fictitious* degrees of freedom that allow a $p$-value to be calculated that has been approximately adjusted for the uncertainty. The way this is typically done is that 

1. The test statistic is formed in the usual way
2. The properties of the fitted model are used *analytically* to approximate the mean and the variance of the test statistic over repeated samples
3. The mean and variance are then used to solve for the parameters of the null distribution, producing what are known as *effective degrees of freedom*
4. Those effective degrees of freedom are used to calculate a $p$-value and confidence intervals 

As an example, say we formed a $t$-statistic from our model. We know that this is a statistic with an unknown null distribution, so we will call it $t^{\star}$. We can use information in the model alongside the value of $t^{\star}$ to *approximate* $\text{Var}(t^{\star})$[^analytic-foot]. Once we have this, we simply note that the variance of a $t$-distribution can be expressed as $\frac{\nu}{\nu - 2}$. So, we can use our approximate variance value to solve for $\nu$. This gives us the *effective degrees of freedom* of this $t$-distribution and we can calculate a $p$-value. In essence, we have approximated capturing *universal uncertainty* in the same way that traditional degrees of freedom do, but within a context where this definition is no longer applicable.

This method is perhaps more appealing than simply pretending there is no problem because it at least tries to accommodate small sample adjustments. However, this also comes with some consequences:

- We are assuming that the true null distribution only differs from known null distributions (such as the $t$ and $F$) by its width, but not the general shape. This is potentially quite a big assumption to make.
- Although $p$-values can still be reported within the familiar language of $t$/$F$-statistics, we must not forget that this remains an *approximation*. 
- In principle, this should behave better in smaller samples compared to the naive degrees of freedom discussed earlier. However, this behaviour is ultimately *unkown*.
- These degrees of freedom can become fractional and no longer have a clear theoretical grounding. They are more devices to encode "tail-heaviness" within the familiar language of $t$ and $F$ distributions. 

`````{admonition} Effective Degrees of Freedom Methods
:class: tip
Although the term *effective degrees of freedom* may be new, we already saw an example of this approach last week in terms of the *non-sphericity corrections* applied to a repeated measures ANOVA. Similar methods for more general models include the Satterthwaite degrees of freedom and the Kenward-Rogers degrees of freedom. This latter method also contains a *bias correction* for $\hat{\boldsymbol{\Sigma}}$, which should make it more accurate but can come with a severe computational cost, depending upon the model.
`````

## Practically, Where Does This Leave Us?
Given all the discussion above, what are we supposed to do *practically*? Although in principle we could simply decide which of the above methods sounds most justifiable to us, in reality we are bound by which of these methods have been implemented in software. So, even if we wish for a consistent inferential framework using our preferred method, we will be restricted by what is available to us. This is where more curated software like SPSS, SAS or STATA has an advantage, because there will be a centralised decision about which method(s) to provide and these will be provided *consistently*. For instance, SPSS prefers effective degrees of freedom, whereas STATA prefers asymptotic tests. The disadvantage is that if you personally disagree with this approach, there is little you can do. The advantage of an ecosystem like `R` is *choice*, but that choice can itself become a burden.   

Below is a table that will not make complete sense right now, but indicates which of the above options are available when using the functions `gls()`, `lme()` and `lmer()` to fit different models in `R`. These are labelled in terms of the naive $t$/$F$-statistics, asymptotic $z$/$\chi^{2}$-statistics or corrected $t$/$F$-statistics using effective degrees of freedom. Note, however, that even when using effective degrees of freedom, there can be difference in terms of the calculation method employed. 

| Model function        | `summary()` | `anova()`    | `Anova()`               | `emmeans()`                              |
| --------------------- | ----------- | ------------ | ----------------------- | -----------------------------------------|
| `gls()`               | $t$         | $F$          | $\chi^{2}$              | $t$/$F$ <br> $z$/$\chi^{2}$ <br> $t$/$F$ (corr.) |
| `lme()`               | $t$ (corr.) | $F$ (corr.)  | $\chi^{2}$              | $z$/$\chi^{2}$ <br> $t$/$F$ (corr.)          |
| `lmer()`              | **none**    | **none**     | $\chi^{2}$ <br> $F$ (corr.) | $z$/$\chi^{2}$ <br> $t$/$F$ (corr.)          |
| `lmer()` + `lmerTest` | $t$ (corr.) | $F$ (corr.)  | $\chi^{2}$ <br> $F$ (corr.) | $z$/$\chi^{2}$ <br> $t$/$F$ (corr.)          |

As you can see, this is all a bit of a mess. Part of the problem is that there is no *single* agreed-upon solution. As discussed above, each of these methods has caveats and are all *approximations* to an inferential framework that no longer fits. Indeed, by default the `lme4` package (used to fit mixed-effects models) refuses to even play this game and gives you *no $p$-values at all*. As such, the reality is that it is up to the authors of each package to implement the methods they agree with. This has resulted in an awkward fragmentation of methods, depending upon the functions you are working with. Ultimately, this exposes the uncomfortable truth about what we are trying to do here and makes it much more difficult to use a single inferential framework *consistently*. We will be exploring all these functions and all these options across the next few weeks, so make sure you come back to this table when you want to review which methods are associated with each function. For now, *even before we have discussed a single model*, we can see that our goal is a highly uncertain and controversial one.

`````{topic} What do you now know?
In this section, we have explored ... . After reading this section, you should have a good sense of :

- ...
- ...
- ...

`````

[^chisq-foot]: The same principle can be applied to the $F$-statistic. Because an $F$ is a ratio of two variance terms, the $F$-distribution is derived from the ratio of two $\chi^{2}$ random variates. When the denominator is taken as a constant, this is just a *scaled* version of the numerator (exactly the same as with the $t$-statistic). So this just becomes a scaled $\chi^{2}$ and we can use a null $\chi^{2}$ distribution for inference.

[^tractable-foot]: In other words, there is no requirement that the distribution can be written down as an equation that depends upon parameters to control its shape and scale. It can be any wiggly shape it wants to be.

[^analytic-foot]: The methods used to do this are complicated and require a much deeper education in elements of calculus, such as the [Taylor Expansion](https://en.wikipedia.org/wiki/Taylor_series). We really do not need to concern ourselves with them.