# The Paired *t*-test
To begin our journey into the world of models for repeated measurements, we will start with the simplest possible example: the paired $t$-test. Last semester, we saw how the humble $t$-test can be subsumed into the linear model framework through the use of *dummy variables*. This same approach can be used with paired data[^paired-foot] because the difference between the *independent* and *paired* $t$-test does not lie with the *mean function*. We spent a lot of time last semester discussing different mean functions and, you will be glad to know, this all follows-over into the world of repeated measurements. The difference lies with the *variance function*. Specifically, how we can alter the variance function to accommodate non-zero correlation between the repeats? As we will see in this part of the lesson, there are *two* equivalent ways of doing this with paired data. One of them *side-steps* the issues of dependence whereas the other does not. However, it is useful to understand how both of these work because it provides some crucial insight into modelling repeated measurements that we will generalise as our models get more complex.     

## Two-sample vs Paired *t*-tests
To begin with, it is useful to examine *how* the results differ between a *two-sample* and *paired* $t$-test. We can don this in `R` by comparing the results of the `t.test()` function with `paired=FALSE` and `paired=TRUE`. To do this, we use the `mice2` data set from the `datarium` package, that contains the weight of a sample of 10 mice both *before* and *after* some treatment. The experimental question concerns whether the treatment affects the weight of the mice. The data is shown below

In [1]:
library('datarium')
data('mice2')
print(mice2)

   id before after
1   1  187.2 429.5
2   2  194.2 404.4
3   3  231.7 405.6
4   4  200.5 397.2
5   5  201.7 377.9
6   6  235.0 445.8
7   7  208.7 408.4
8   8  172.4 337.0
9   9  184.6 414.3
10 10  189.6 380.3


We can compare the output from a *two-sample* $t$-test and a *paired* $t$-test by changing the `paired=` argument of `t.test()`, as shown below

In [2]:
print(t.test(mice2$before, mice2$after, var.equal=TRUE, paired=TRUE))  # paired t-test
print(t.test(mice2$before, mice2$after, var.equal=TRUE, paired=FALSE)) # two-sample t-test


	Paired t-test

data:  mice2$before and mice2$after
t = -25.546, df = 9, p-value = 1.039e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -217.1442 -181.8158
sample estimates:
mean difference 
        -199.48 


	Two Sample t-test

data:  mice2$before and mice2$after
t = -17.453, df = 18, p-value = 9.974e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -223.4926 -175.4674
sample estimates:
mean of x mean of y 
   200.56    400.04 



The output is a bit different between the two methods, so let us spend a little time unpacking what *is* and what *is not* different here. To begin with, the clearest differences between the two methods concern the $t$-statistic itself, the degrees of freedom, the $p$-value and the confidence interval. This is summarised in the table below 

| Test       | *t*-statistic | DoF | *p*-value | 95% CI            | 
| ---------- | ------------- | --- | --------- | ----------------- |
| Paired     | -25.546       | 9   | 1.039e-09 | [-217.14 -181.82] |
| Two-sample | -17.453       | 18  | 9.974e-13 | [-223.49 -175.47] |

Although this may therefore seem like *everything* is different, there is actually one element that is *identical* here, though it is somewhat hidden. To see it, consider that the structure of a $t$-test is

$$
t = \frac{\mu_{1} - \mu_{2}}{\text{SE}\{\mu_{1} - \mu_{2}\}},
$$

meaning that we think of the $t$ as the ratio between the *mean difference* and the *standard error of the mean difference*. The $t$-statistic is different between the *two-sample* and the *paired* tests, but this does not necessarily mean that all elements of this ratio are also different. Indeed, if we look at the output above we can see that the *paired* test reports a mean difference of `-199.48` and the *two-sample* test reports the individual means as `200.56` and `400.04`. If we calculate the mean difference in the *two-sample* test we get 

In [3]:
print(200.56 - 400.04)

[1] -199.48


So, this is *identical* between the *paired* and *two-sample* tests. This should not be surprising, as we already established that repeated measurements do not affect the mean function. So, in either case, the groups means are the same, the mean difference is the same and the *numerator* of the $t$-statistic is the same. From this, we can conclude that the *difference* between the two methods concerns the *denominator* of the $t$-statistic. In other words, *the standard error of the difference changes under repeated measurements*.

Given that we know the numerator for both tests, we can recover the denominators to see that this is the case

In [4]:
mean.diff  <- -199.48
paired.t   <- -25.546
twosamp.t  <- -17.453
paired.se  <-  mean.diff / paired.t
twosamp.se <-  mean.diff / twosamp.t

print(c(paired.se, twosamp.se))

[1]  7.808659 11.429554


So, the *standard error* of the difference is much *smaller* in the *paired* test when compared to the *two-sample* test. This should not be a surprise. Thinking back to our discussion from the beginning of the lesson, we know that the variance of the difference between two random variables should get *smaller* when they are positively correlated. From this, we can conclude that, when applied to *paired* data, the *two-sample* $t$-test is using a standard error that is *too large*. This tracks with everything we have discussed so far. However, the key question for us is *how* the *paired* $t$-test is able to do this. 

As mentioned at the start of this part of the lesson, there are two equivalent ways of thinking about this. We will discuss *both* below because their equivalence provides important information for conceptualising more complex methods. 

## The Model of *Paired Differences*
The first method we can use is actually a bit of a *cheat* in order to side-step the issue of dependence entirely. However, conceptually, this is a really key step because it introduces the idea that we can correctly model repeated measurements by *removing something from the data*. In this first method, we remove it manually by subtracting the two repeats and analysing the *difference*. Further below we will see how this removal can be done *within the model itself*.

... So the key insight is that when we take $D_{i} = y_{i1} - y_{i2}$, we have *removed* some element from the data that allows us to analyse the difference between the conditions without taking correlation into account. How is this possible? Precisely because, by subtracting these two values, we have *removed* the correlation. Thus, there is some shared component between $y_{i1}$ and $y_{i2}$ that cancels-out when we subtract them. In cancelling-out, the correlation is gone and we can treat $D_{i}$ as a vector of independent measurements. 

We can get more clarity on this by writing it formally. Let us *add* a shared component to the definition of both $y_{i1}$ and $y_{i2}$. We will call this $S_{i}$

$$
\begin{alignat*}{1}
    y_{i1} &= \mu_{1} + S_{i} + \epsilon_{i1} \\
    y_{i2} &= \mu_{2} + S_{i} + \epsilon_{i2} 
\end{alignat*}.
$$

So, the reason why $y_{i1}$ and $y_{i2}$ are correlated is because they *share* the same component $S_{i}$. This captures the idea that these measurements come the same *subject*. If we then *subtract* these values, $S_{i}$ will cancel-out

$$
\begin{alignat*}{1}
    (y_{i1} - y_{i2}) = D_{i} &= (\mu_{1} - \mu_{2}) + (S_{i} - S_{i}) + (\epsilon_{i1} - \epsilon_{i2}) \\
                              &= \mu^{(D)} + \epsilon_{i}^{(D)}
\end{alignat*}.
$$

So, this tells us that $S_{i}$ precisely captures the correlation, because removing it renders the data *independent*. 

## The Model of *Partitioned Errors*

### Partitioning the Error



To begin with, let us just examine the overall variance of the data for the first 15 subjects.

In [5]:

subject <- rep(seq(1,50),each=2)
subject <- as.factor(subject)

plot(as.numeric(subject)[1:30],
     y.long[1:30],
     col=as.factor(cond),
     xlab='Subject',
     ylab='Y',
     pch=16)

abline(h=mean(y.long))


: [1m[33mError[39m:[22m
[33m![39m object 'y.long' not found

Here we can see that there are 3 main sources of variation in these data:

1. The two conditions have different means, thus measures from one condition tend to be larger than the other. In this example, the mean difference is only 0.21, so this is quite a subtle effect. But it is there. This is the *structured* variance we are interested in (can't remember what term was used for this?)
2. The data come from different subjects, thus the degree to which the data fall above or below the grand means depends upon the individual subject. This is one source of *error variance* that explains why the data deviate from their expected value.
3. Within each subject, there is variation between the two measures. The magnitude of this difference is sometimes bigger and sometimes smaller than the condition difference of 0.21. This is *another* source of *error variance*.

So, we have one *structured* source of variation here (the difference between the experimental conditions), as well as *two* sources of error: the data come from different subjects and responses within each subject will also be different. The "subject" source captures the *internal consistency* of individual subjects (i.e. how correlated the responses are), whereas as the "within-subject" source captures the natural random variation we get between measurements, irrespective of whether they come from the same subject or not. 

To understand this more, we can work through *removing* each of these sources from the data to see how it can be decomposed. To begin with, we can remove the constant effect of each condition. To do so, we can simply take the residuals from the `two.sample.mod`, given that the predicted values from this model *are* the condition means. Once the effect of the conditions is removed, we have the plot below.

In [None]:
res.two.sample <- resid(two.sample.mod)

plot(as.numeric(subject)[1:30],
     res.two.sample[1:30],
     xlab='Subject',
     ylab='Residuals',
     col=as.factor(cond),
     pch=16,
     ylim=c(-3,3))

abline(h=0)

Now that we have removed any variation associated with the two conditions, we can turn to variability associated with the individual subjects. This is known as *between-subjects* variance. To remove it, we can calculate the mean value of each subject and then subtract it from the data. This results in the plot below.

In [None]:
res.matrix <- matrix(res.two.sample, ncol=2, nrow=50, byrow=TRUE)
sub.means  <- rowMeans(res.matrix)
res.nosub  <- rep(0,100)
sub.idx    <- as.numeric(subject)

for (i in 1:100){
    res.nosub[i] <- res.two.sample[i] - sub.means[sub.idx[i]]
}

plot(sub.idx[1:50],
     res.nosub[1:50],
     xlab='Subject',
     ylab='Residuals - Subjects',
     pch=16,
     ylim=c(-3,3))


Notice that a huge amount of the variation in this data was attributable to the variation between different subjects. Now that this has been removed, we can effectively treat our data as *one big sample from a single subject*. As such, the variability we can see between all pairs of measurements provides us with an indication of how *internally consistent* a single subject is across the different conditions of the task. With the overall effect of the conditions removed, this remaining variability is not related to the conditions themseleves. Rather, it is related to other sources of random variation that cause an individual's response to change across multiple repeats of an experiment. This is known as the  *within-subject variance*. 

The variance associated with the two conditions is of direct interest because this captures our experimental effect of interest. Both the *between-subjects* and *within-subject* variance are effectively sources of *error* because they indicate different ways that the raw data may differ from the means of the conditions. One of these errors comes from the fact that different people may respond consistently higher or lower than the mean. The other comes from the fact that, even if an individual did not respond differently from the mean, natural variation across repeats will always be there.

This partitioning of variance can be formally stated as

$$
\text{Var}(y) = \sigma^{2} = \sigma^{2}_{b} + \sigma^{2}_{w}.
$$

As such, we now have *three* choices when it comes to calculating standard errors. Do we use the pooled variance of $\sigma^{2}$? The between-subjects variance of $\sigma^{2}_{b}$, or the within-subject variance of $\sigma^{2}_{w}$?

Key Point
In a regular paired t-test, the error variance consists of the differences between the means of the groups and the raw data. Howevever, when we have repeated measurements, this difference can be further divided into two sources ... This is consistent with the idea of the *between-subjects variance* $\left(\sigma^{2}_{b}\right)$ and the *within-subject variance* $\left(\sigma^{2}_{w}\right)$ ... As such, the difference with a *paired* test is that is uses the *within-subject variance* exclusively for determining the denominator of the $t$-statistic.

### Partitioning the Error as a Decomposition of the Variance-covariance Matrix
Now, we will connect what we have done above with the idea of modelling the variance-covariance matrix. Rather than doing this *explicitly*, the method above was an *implicit* modelling of the covariance structure...

### Explicit Two-sample $t$-test as a Linear Model
To see this, we first start with the familiar case of the *two-sample* model. We can fit this as an LM within `R` as follows

In [None]:
y.long <- as.vector(t(y)) # Turn y into a column
cond   <- rep(c("A","B"),50) # Create a predictor for the two conditions

two.sample.mod <- lm(y.long ~ cond)
summary(two.sample.mod)

Focussing on the coefficient and tests associated with `CondB` in the table, we can see $t = 0.928$ and $p = 0.356$, which is the same[^foot1] as we saw for the two-sample test earlier. We can also see that the degrees of freedom agree at $98$. So we have managed to successfully implement the *two-sample* $t$-test as a linear model.

Now, we know that this is *incorrect* for correlated data. However, specifying the model in this form allows us to dig deeper into *why* this is wrong, which will then provide us with the insight needed to correct this and thus tell us how the *paired* method is able to accommodate correlation. This will also provide us with the grounding needed to understand the traditional repeated measured ANOVA, as well as mixed-effect model a little later in the unit.

### Where Does the Standard Error Come From?
To begin understanding what is going on here, we need to review where the value for the standard error comes from. In the context of a linear model, the standard error of $\hat{\beta}_{1}$ is given by

$$
\text{SE}\left(\hat{\beta}_{1}\right) = \sqrt{\text{Var}\left(\hat{\beta}_{1}\right)} = \sqrt{\frac{\hat{\sigma}^{2}}{\sum_{i=1}^{n}\left(x_{i1} - \bar{x}_{1}\right)}}.
$$

So, the standard error is the square-root of the variance of an estimate, and the variance of an estimate is simply a scaled version of the error variance from the model. This scaling is not entirely clear when represented in the format above. However, when $x_{1}$ is simply a dummy variable encoding a mean difference, this simplifies to the known formula for the denominator of a $t$-test assuming equal variance in the two samples

$$
\text{SE}\left(\hat{\beta}_{1}\right) = \text{SE}\left(\hat{\mu}_{1} - \hat{\mu}_{2}\right) = \sqrt{\frac{\hat{\sigma}^{2}}{\frac{1}{\frac{1}{n_{1}} + \frac{1}{n_{2}}}}} = \sqrt{\hat{\sigma}^{2}\left(\frac{1}{n_{1}} + \frac{1}{n_{2}}\right)}.
$$

Here, we can more easily see that the standard error depends only upon the sample sizes and the error variance.

`````{admonition} Key Point!
:class: tip
Under multiple repeats of the same experimnent, the sample sizes of the groups will remain the same. As such, this element of the standard error is simply a *constant scaling*. Whether the data are independent or not will not change this element of the standard error because the formula is *always the same*. The only element that can change is the *error variance*. As such, this must be the source of the difference between the *two-sample* and *paired* approaches.
`````

We can verify that this formula for the standard error is correct in the *two-sample* model by calculating

In [None]:
sigma2 <- summary(two.sample.mod)$sigma^2
sqrt(sigma2 * (1/50 + 1/50))

which agrees with out results so far. As the element of most interest here is the estimate of $\sigma^{2}$, the next obvious question is where does this come from? 

As a review, the error variance in a linear model is estimated using

$$
\hat{\sigma}^{2} = \frac{\sum_{i=1}^{n}\epsilon_{i}^{2}}{n-p},
$$

which is also known as the *residual mean square* or *error mean square*. We can again verify this for our example by calculating

In [None]:
sigma2

sum(resid(two.sample.mod)^2) / two.sample.mod$df.residual



As such, if this is the element that differs between the *two-sample* and *paired* model, then our final suspect must be the *model residuals*. More specifically, the residuals must be *larger* in the *two-sample* case and *smaller* in the *paired* case. This is the only way the standard errors can differ. But this still does not explain *why* this is the case?

### Recreating the Paired $t$-test in the Linear Model

From all we discussed earlier, the aim is therefore to *remove* the between-subject variance from the residuals so that the error variance of the model only contains the *within-subject variance*. If we do not do this, then $\sigma^{2} = \sigma^{2}_{b} + \sigma^{2}_{w}$, which will be too large to accurately capture the standard error of the mean difference under repeated measurements.

In [None]:
library(car)

subject <- rep(seq(1,50),each=2)
subject <- as.factor(subject)

paired.mod <- lm(y.long ~ cond + subject)
summary(paired.mod)

print(Anova(paired.mod))

Now, the output here is a bit of mess due to all the subject effects. However, if you look at the coefficient and test for `CondB`, notice that $t = 2.495$ and $p = 0.016$, which is the same as the *paired* $t$-test from earlier. Furthermore, the degrees of freedom are now correct at $49$. As such, adding the subject effects to the model has allowed the *between-subjects* error to be partitioned out and thus the remaining variance calculated from the residuals is *only* the *within-subject* error. This is the error needed to correctly estimate the standard error of the paired difference and thus the model results are now correct.

```{admonition} Advanced: Understanding the coefficients in the paired model
:class: warning, dropdown
So from this, we can see that

$$
\begin{align}
\mu_{1} &= \beta_{0} + \frac{1}{n}\sum_{k=2}^{n+1}\beta_{k} \\
\mu_{2} &= \beta_{0} + \beta_{1} + \frac{1}{n}\sum_{k=2}^{n+1}\beta_{k}.
\end{align}
$$

Solving for both $\beta_{0}$ and $\beta_{1}$ gives

$$
\begin{align}
\beta_{0} &= \mu_{1} - \frac{1}{n}\sum_{k=2}^{n+1}\beta_{k} \\
\beta_{1} &= \beta_{0} + \frac{1}{n}\sum_{k=2}^{n+1}\beta_{k} - \mu_{2} = \mu_{1} - \mu_{2}.
\end{align}
$$

So, rather unituitively, the intercept is actually the mean of the first group, minus the average of the subject effects. This then raises the question of what exactly the subject effects are? We can use a similar approach as above to solve for these. For instance, the expected value of the response from subject 2 in condition A is

$$
\mu_{12} = \beta_{0} + \beta_{2}.
$$

Meaning that the effect for subject 2 is

$$
\beta_{2} = \mu_{12} - \beta_{0}.
$$

As such, the subject effects are effectively *residuals* around the intercept. Importantly, these are *constant offsets*, irrespective of the condition. For instance. As such, subject 2 is expect to lie the same distance from the mean of condition A and the mean of condition B.
```

## Section Summary

[^foot2]: Here we set the variance to be the same between the samples because this (a) agrees with the simulations and (b) prevents conflation of the issue of correlation with conflation of the issue of homogeneity of variance.

[^foot1]: Due to the way the factors are coded in `R`, the coefficient is actually the opposite comparison here, hence why the $t$-statistic is *positive* rather than *negative*.

[^paired-foot]: Paired data simply means data where each experimental unit provides a *pair* of measurements. For instance, each subject takes part in *two* conditions, or is measured at *two* separate time points.