# Solutions for Inference
Given what we discussed in the previous section, we have found ourselves in a difficult situation. What we *want* is a framework where we can have any form of covariance structure to accurately represent the data-generating process. This would allow us to model any type of repeated measures experiment, irrespective of its complexity. However, the inferential devices used by the normal linear model simply *do not allow this*. The emphasis on *knowing* the sampling distribution of the estimates in order to calculate $p$-values and confidence intervals has backed us into a corner. Once the very specific conditions that allow these to be calculated are gone, so too is the whole inferential machinery. In a way, this demonstrates how *fragile* these methods are. 

So, we have two options available to us. One is to simply give up and spend our whole lives restricting inference to only those datasets where the classical results to still apply. The other is that we try and find a way forward, acknowledging that *no* perfect solution is going to exist. 

Clearly, our plan is to push forward with the *second option*, but it is important to understand from the very beginning that we are making a *comprise*. We want to develop much better *models* that describe *where the data came from* and can make better *predictions* of future data. In order to do so, we have to accept that we have left the clean world of the normal linear model and our inference *will become approximate*. This can be an uncomfortable conclusion, but it is the reality of the situation we are in.

## Option 1 - Pretend that $\boldsymbol{\Sigma}$ is Known
In terms of trying to push forward, in spite of all these complexities, our first option is to simply assume that *we know* $\boldsymbol{\Sigma}$. Of course, we *do not*, that is the whole problem we are trying to address. However, if we simply assume that we have got $\boldsymbol{\Sigma}$ correct, pretty much all the problems highlighted previously *disappear*. In general, there are three ways this is realised in practice. We can choose to:

- Treat our estimate as the true value and ignore everything else
- Treat our estimate as the true value, but *only* because we have enough data that the uncertainty has disappeared
- Treat our estimate as the true value, but only for the purpose of trying to *simulate* the uncertainty

Each of these corresponds to different approaches we will see being applied across different packages in `R`. So, given that we will see examples of each of these, we will now discuss them in more detail.

### 1a) Assume $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$
As a first approach, we can choose to *ignore* the problem. If we treat our estimate as *exactly* the population value, then we can carry on without any issues. As we will come to see, methods that allow for a more general covariance matrix do so by *removing* this structure from the data. Taking $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$ means that the covariance structure can be *perfectly removed* and we are back to the normal linear model. All the original inferential theory then applies and we are done. So, this approach is quite *practically* appealing because all the mess simply disappears. However, this is not without consequence. Simply ignoring the uncertainty about the value of $\boldsymbol{\Sigma}$ means:

- The degree to which the covariance structure can be *removed* is unknown and is very unlikely to be *perfect*. As such, the degree to which the standard errors and test statistics follow a known sampling distribution is also unknown. This means we have no sense of how accurate the $p$-values or confidence intervals actually are.
- We are pretending that degrees of freedom exist as a universal indicator of uncertainty for a single variance component, but they do not. 
- Because we are pretending that we got $\boldsymbol{\Sigma}$ for free, the degrees of freedom have no correction for estimating $\boldsymbol{\Sigma}$. As such, they will be *larger* than equivalent repeated measures ANOVA models.

### 1b) Use *Asymptotic* Results
As a second approach, we can choose to *acknowledge* the problem, but assume that it does not apply to us. As discussed earlier, most of the issues here arise in *small samples*. This is because the uncertainty around $\hat{\boldsymbol{\Sigma}}$ will be much greater. We see this with the use of the $t$-distribution in the normal linear model. Small sample means smaller degrees of freedom, which means a wider scaled $\chi^{2}(\nu)$ distribution and a null $t$-distribution with heavier tails than a standard normal. As the sample gets *larger*, most of this disappears. The scaled $\chi^{2}(\nu)$ collapses to a single point and the $t$-statistic becomes a $z$-statistics, At this point we no longer need uncertainty in the form of degrees of freedom. The null $z \sim \mathcal{N}(0,1)$ is not parameterised by degrees of freedom because its shape is *fixed* rather than *dynamic*. So, the solution here is to calculate $z$-statistics instead of $t$-statistics and $\chi^{2}$ statistics instead of $F$-statistics. We can then refer to null distributions *without* error degrees of freedom and claim that the tests are *asymptotically* correct. In other words, these results are correct, so long as the sample size is large enough that our uncertainty around $\hat{\boldsymbol{\Sigma}}$ has effectively *vanished*. In other words, we can treat $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$ not because we *know* it already, but because we have so much data that this is *effectively* true. 

Compared to the earlier possibility, this approach has a certain *statistical purity* to it. We do not need to pretend degrees of freedom still exist nor pretend that any of the small sample theory still applies. We can also ignore the whole idea of the sampling distribution of the denominator because, if we have enough data, *this does not matter*. So we do not need to pretend it has a certain form, we can simply ignore it. However, there are still some clear issues here:

- We need to be comfortable assuming that our $n$ is *large-enough* for this to work, but this is an *unanswerable* question (see box below).
- We need to be comfortable with the idea of dismissing uncertainty in the estimation of $\boldsymbol{\Sigma}$ as negligible.
- In small samples this will result in inference that is *optimistic*. However, the open use of asymptotic tests already embeds this as a caution. As such, we shift the uncertainty in small samples from a mathematical concern to an *interpretative* concern. If we distrust these results as samples get smaller, we are doing the job of adjusting our inference accordingly.

`````{admonition} How Large is "Large"?
:class: tip
If we want to lean on asymptotic theory, the obvious question is "how big does $n$ need to be?". The problem is that the definition is based on a *limit*, so it says that the approximation gets better and better as $n$ moves towards infinity. For our purpose, $n$ is the *number of subjects*, rather than the total amount of data. So, the answer is not that there is some magic sample size that is suddenly large enough, the answer is that the approximation will get better the larger $n$ becomes. The question then is more about what our tolerance for error is. The point of the asymptotic theory is to say that the error that comes from estimation becomes more negligible as $n$ grows, as does the penalty for estimating $\boldsymbol{\Sigma}$ from the data. So, unfortunately, there is *no honest numeric answer to this question*. The way to think about it is as a *degree of comfort*. If you are have $n = 5$, you should probably feel *very uncomfortable*, whereas $n = 50$ should probably make you feel *cautious*. If you have $n = 100$, you should feel *optimistic* and if you have $n = 200$ you should probably be feeling *fairly confident*. As $n$ increases beyond that, you should probable feel *perfectly fine* about this approach. These are only ballpark figures, but the point is really to think of $n$ as a *continuum of comfort*, rather than as a *threshold*. 
`````

### 1c) Use Simulations
As a third approach, we can again choose to *acknowledge* the problem, but assume that $\hat{\boldsymbol{\Sigma}}$ is a close-enough proxy to $\boldsymbol{\Sigma}$ so we can use it to *simulate* its uncertainty across multiple samples. So, this involves assuming $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$, but only insofar as assuming that our estimate is close enough to define a *plasuible world* for simulation. We do not assume that the uncertainty in estimating $\hat{\boldsymbol{\Sigma}}$ is negligible. In fact, the whole point of the simulations is to *measure it*. 

The advantage here is that we do not assume the uncertainty around estimating $\boldsymbol{\Sigma}$ has a particular shape. Instead, we let the simulations build the relevant sampling distributions over its many iterations. This neither requires assuming that the classical results still hold, nor requires enough data so that all these problems disappear. Instead, we use the power of the *computer* to find a solution. This gets us into the world of *resampling methods*, which we encountered briefly last semester in the form of the *permutation test*. For the general problem of deriving a null distribution under an arbitrary covariance structure, the *parametric bootstrap* is most commonly employed. In this method we:

1. Treat a fitted *null model* as the "truth".
2. Use this fitted model to simulate new data.
3. Refit the model to the simulated dataset and save a copy of the test statistic.
4. Over many repeats of 2 and 3, build up a *distribution* of the test statistic under the null.
5. Calculate the $p$-value and confidence intervals from this distribution.

So this requires *zero* theory about the distribution of the sampling distributions. The uncertainty comes through naturally as part of the simulation and we can get a $p$-value irrespective of the form of $\boldsymbol{\Sigma}$. There are no *degrees of freedom* here as a parameter of a sampling distribution because the distribution that is built does not even need to be mathematically tractable[^tractable-foot]. So this has some distinct advantages because we can get rid of much of the difficult approximation needed in classical approaches. However, there are still trade-offs:

- Computational burden is high, as calculating a single $p$-value can be a long process depending upon the complexity of refitting the model.
- Fundamentally, we have to assume that our models is *close enough* to the truth for this to work. Not only in terms of the covariance structure, but also in terms of the assumed population distribution. This is why this is a *parametric* bootstrap, rather than a *non-parametric* procedure. 
- Assumptions about the accuracy of the covariance structure will improve with larger samples, meaning we can trust the simulations more. However, the more data we have the more *asymptotic* theory takes over, so the need for simulation becomes more questionable. That being said, simulation is useful for that unknown middle-ground between samples that are too small for $\hat{\boldsymbol{\Sigma}}$ to be reliable and the point where asymptotic theory takes over.

## Option 2 - Acknowledge that $\hat{\boldsymbol{\Sigma}}$ is an *Estimate*
All the methods above were variations on a theme where, in one way or another, we assume that $\hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}$. This was either because we were ignoring the problem, assuming that we had enough data so the problem disappeared, or assuming that our estimate was close-enough to the population value to be used for simulation. However, another option is to avoid making this assumption entirely and fully accept that $\hat{\boldsymbol{\Sigma}}$ is an estimate. Once we do that, we acknowledge that there is some degree of uncertainty in its value and, without having any of way of deriving this uncertainty exactly, we need to correct for it.   

Methods that take this approach do so by creating *fictitious* degrees of freedom that allow a $p$-value to be calculated that has been approximately adjusted for the uncertainty. The way this is typically done is that 

1. The test statistic is formed in the usual way
2. The properties of the fitted model are used *analytically* to approximate the mean and the variance of the test statistic over repeated samples
3. The mean and variance are then used to solve for the parameters of the null distribution, producing what are known as *effective degrees of freedom*
4. Those effective degrees of freedom are used to calculate a $p$-value and confidence intervals 

As an example, say we formed a $t$-statistic from our model. We know that this is a statistic with an unknown null distribution, so we will call it $t^{\star}$. We can use information in the model alongside the value of $t^{\star}$ to *approximate* $\text{Var}(t^{\star})$[^analytic-foot]. Once we have this, we simply note that the variance of a $t$-distribution can be expressed as $\frac{\nu}{\nu - 2}$. So, we can use our approximate variance value to solve for $\nu$. This gives us the *effective degrees of freedom* of this $t$-distribution and we can calculate a $p$-value. In essence, we have approximated capturing *universal uncertainty* in the same way that traditional degrees of freedom do, but within a context where this definition is no longer applicable.

We have already seen an example of this approach in terms of the *non-sphericity corrections* that can be applied to a repeated measures ANOVA. Similar methods for more general models include the Satterthwaite degrees of freedom and the Kenward-Rogers degrees of freedom, which also contains a *bias correction* for the fact that $\hat{\boldsymbol{\Sigma}}$ is a *biased* estimate of the true covariance structure.   

This method is perhaps more appealing than simply pretending there is no problem because it tried to accommodate small sample adjustments and uncertainty, though it also comes with some consequences:

- We are assuming that the true null distribution only differs from known null distributions (such as the $t$ and $F$) by its width, but not the general shape.
- This still remains an *approximation*, though it should behave better in smaller samples when degrees of freedom become more necessary.
- Degrees of freedom can become fractional and no longer have a clear theoretical grounding. They are more devices to encode "tail-heaviness" within the familiar language of $t$ and $F$ distributions. 

In fact, we already saw an example of this last week in terms of the *non-sphericity corrections*.

## Practically, Where Does This Leave Us?