# [The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)
## Chapter 11 - Using Regression to Test for More Differences with ANOVA and ANCOVA!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Perform ANOVA tests using `lm()` and `anova()`.
- Perform post-hoc tests to identify specific differences.
- Perform an ANCOVA test.
- Calculate *p*-values for ANOVA and ANCOVA using the null hypothesis to build a histogram.

**NOTE:**
This tutorial assumes that you have installed **[R](https://cran.rstudio.com/)**, and possibly **[RStudio](https://posit.co/download/rstudio-desktop/)** and read the Chapter 11 in **[The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)**.

----

# Performing ANOVA tests using `lm()` and `aov()`

If we're going to do ANOVA, then the first thing we need is some data. In this example, we'll use the dataset illustrated in Chapter 11, which has the recovery times measured for 4 different drugs.

In [None]:
## First, create the data
drug.a <- c(8, 12, 22)
drug.b <- c(20, 29, 39)
drug.c <- c(6, 17, 20)
drug.d <- c(45, 39, 30)

recovery.time <- c(drug.a, drug.b, drug.c, drug.d)
recovery.time

Now let's put `recovery.time` into a `data.frame` that uses a factor, `drug`, to organize each measurement.

In [None]:
df <- data.frame(
  time = recovery.time,
  drug = factor(c(rep("a", times=3),
                  rep("b", times=3),
                  rep("c", times=3),
                  rep("d", times=3))))
df

Now, just like we did for Simple Linear Regression, Multiple Regression, and *t*-tests, we can call the `lm()` and `summary()` functions with the `data.frame` and a simle formula, `time ~ drug`.

In [None]:
## rather than saving the output from lm() in a variable,
## we can just pass it directly to summary()...
anova.summary <- summary(lm(time ~ drug, data=df))
anova.summary

The *p*-value in the bottom right hand corner, 0.01487, tells us to reject the Null Hypothesis that there are no differences in recovery time among the 4 drugs.

Now let's see what happens when we use the `aov()` function, which is specialized for doing ANOVA, with the same data.

In [None]:
## now using aov()...
## Again, rather than saving the output from aov() in a variable,
## we can just pass it directly to summary()
aov.summary <- summary(aov(time ~ drug, data=df))
aov.summary

And we see that the `aov()` function simplifies the output significantly, but still gives us the same *F*-value and the same *p*-value.

Now let's see how we can generate the same *p*-value using the histogram method.

In [None]:
## since we're going to generate random datasets,
## let's start by setting the seed so that the results
## are reproducable
set.seed(42)

## To generate random datasets, we'll use a
## normal distribution. This distributions
## will be based on our observed data, so we
## we need to calculate the estimated
## mean and standard deviation.
overall.mean <- mean(df$time)
overall.sd <- sd(df$time)

## Next, we define the number of random
## datasets we wantt o create...
num.rand.datasets <- 10000

## ...and we define the number of data points
## per dataset
num.datapoints <- length(drug.a)

## Create an empty array that is num.rand.datasets long
rand.r.squared <- rep(NA, times=num.rand.datasets)

## Here is the loop were we create a bunch of random datasets,
## each with num.datapoints values, do an ANOVA
## with the random data, then calculate and store
## the R-squared values
for(i in 1:num.rand.datasets) {
    
    ## generate random values for each drug
    r.a <- rnorm(n=num.datapoints,
                 mean=overall.mean,
                 sd=overall.sd)
    
    r.b <- rnorm(n=num.datapoints,
                 mean=overall.mean,
                 sd=overall.sd)
    
    r.c <- rnorm(n=num.datapoints,
                 mean=overall.mean,
                 sd=overall.sd)
    
    r.d <- rnorm(n=num.datapoints,
                 mean=overall.mean,
                 sd=overall.sd)

    ## bundle the random values together in a data.frame
    
    recovery.time <- c(r.a, r.b, r.c, r.d)
    
    data <- data.frame(
        time = recovery.time,
        drug = factor(c(rep("drug.a", times=3),
                        rep("drug.b", times=3),
                        rep("drug.c", times=3),
                        rep("drug.d", times=3))))
    
    ## fit a linear regression line to the random data
    ## and calculate R-squared
    lm.fit <- summary(lm(time ~ drug, data=data))

    ## save the R-squared value.
    rand.r.squared[i] <- lm.fit$r.squared
}

Now let's draw a histogram of the $R^2$ values with the `hist()` function...

In [None]:
hist(rand.r.squared)

...and calculate the *p*-value as the precentage of "random" $R^2$ values greater than or equal to the one we got for our original data.

In [None]:
## the number of randomly generated r.squared >= the original r.squared
num.greater <- sum(rand.r.squared >= anova.summary$r.squared)

## calculate the p-value 
p.value <- num.greater / num.rand.datasets

## print out the p-value
p.value

Thus, the *p*-value calculated with the histogram is 0.0143. Now let's compare that to the *p*-value stored in `anova.summary`...

In [None]:
aov.summary

So, at last, we see that the two *p*-values are essentially the same.

# BAM!

Now that we have relatively high confidence that there are differences among the four drugs, let's do all pairwise *t*-tests so that we can identify which drugs are different. These tests, done after an ANOVA to identify the indiviual groups that are different, are called **Post-Hoc** tests.

----

# Performing post-hoc tests to identify specific differences

Performaing all pairwise *t*-tests in R is super easy with the `pairwise.t.test()` function. Not only will this function do all of the pairwise *t*-tests for us, it will also adjust the *p*-values using any of the methods that we pass to the `p.adjust()` function. We'll start by using the Holm correction for multiple testing.

In [None]:
pairwise.t.test(df$time, df$drug, p.adjust.method="holm")

So, the output from `pairwise.t.test()` is a matrix of adjusted *p*-values. The first column are all the adjusted *p*-values associated with Drug A, starting with Drug B at the top and ending with Drug D at the bottom. Likewise, the second colunn gives us the remaining *p*-values associated with Drug B (the comparison of Drugs A and B is in the first column), and the third column gives us the remaining *p*-values associated with Drug C.

The last row has the two adjusted *p*-values less than 0.05 and both of them are associated with Drug D. Drug D appears to be significantly different from Drug A (adjusted *p*-value = 0.037) and Drug C (adjusted *p*-value = 0.037).

Now let's see what happens when we use False Discovery Rates (FDR) to adjust the *p*-values...

In [None]:
pairwise.t.test(df$time, df$drug, p.adjust.method="fdr")

...and the conclusions are the same as before. Drug D is significantly different from Drugs A and C. However, with FDR, the adjusted *p*-values tend to be a little smaller.

# Double BAM!!!

Now let's learn how to do an ANCOVA test in R.

----

# Performing an ANCOVA test

As always, we start by creating our data. In this case, we'll use the data illustrated in Chapter 11, which has recovery times for two drugs, A and B, as well as the Height of each person in the trial.

In [None]:
drug.a.time <- c(6, 9, 12.5, 14)
drug.a.height <- c(4, 7.5, 10, 12.5)

drug.b.time <- c(4, 8, 9, 11)
drug.b.height <- c(5, 11, 12.5, 14)

Now let's package the data we have for the two drugs into a `data.frame`.

In [None]:
## create a factor to keep track of which
## recovery time pairs with which drug.
drug <- factor(c(rep("Drug A", times=4),
                 rep("Drug B", times=4)))

## package everything up in a data.frame
df <- data.frame(
  time = c(drug.a.time, drug.b.time),
  drug = drug,
  height = c(drug.a.height, drug.b.height))

## print out the data.frame
df

Now let's create a plot of the data.

In [None]:
colors <- c("#A5D3FC", #blue 
            "#FF968D") #red

plot(df$height, 
     df$time, 
     # col=df$drug,
     xlim=c(0, 17), 
     ylim=c(0, 15),
     pch=21, # set the shape
     cex=3,  # scale the size of the shape
     bg=colors[df$drug],
     col="black"
    )

Now, if we ignore the Height data, we can do a *t*-test on the two drugs, with respect to the recovery times, using the the `t.test()` function.

**NOTE:** the `t.test()` function will accept formulas, just like the `lm()` function, making it easy to specify the *t*-test we want to do with our `data.frame`.

In [None]:
## compare the two drugs
t.test(time ~ drug, data=df)

The resulting *p*-value, 0.3472, is well above 0.05, so, without the Height data, we can't reject the Null Hypothesis and be confident that the two drugs are different.

Now, just for fun, let's see what happens if we ignore the Drug and just use Height to predict recovery time.

In [None]:
model.simple <- lm(time ~ height, data=df)
summary(model.simple)

The p-value is less than 0.05, so we can reject the Null Hypothesis that using Height to predict Time is no different from using the average Time. However, this doesn't tell us if one drug is different from the other. So we need to combine both Height and Drug in a single model. This can be done by adding both column names, separated by a `+` to the **Formula** that we pass to `lm()`.

In [None]:
## NOTE: This only compares fancy to overall mean
model.fancy.summary <- summary(lm(time~drug + height, data=df))
model.fancy.summary

And when we combine both bits of information, the Drug and Height, we can predict Recovery Time much better than just using the average Recovery Time (*p*-value = 0.0001835). But the Recovery Time predictions were also much better when we only used Height. So now the question is...Is using Drug and Height to predict Recovery Time better than just using Height?

One hint at the answer to this question is to compare the Adjuste R-squared values for both tests. When we compare the Adjusted R-squared values, 0.9552 for Drug + Height, compared to 0.4766 for just Height, we see that Adjusted R-squared goes way up, even when we add a variable to the model. The increase in Adjusted R-squared suggests that using both variables, Drug and Height, to predict Recovery Time is much better than just using Height. However, we can also calculate a *p*-value to compare the two models directly with the equation for *F*...

<span style="font-size: 24px;">
$F = \frac{[\textrm{SSR(simpler)} - \textrm{SSR(fancier)}] / (p_\textrm{fancier} - p_\textrm{simpler})}
    {\textrm{SSR(fancier)} / (n - p_\textrm{fancier})}$
</span>

...and then using *F* to calculate the area under the curve of a suitable *F*-distribution.

First, let's calculate the Sum of the Squared Residuals for the simpler model that only uses Height to predict Recovery Time.

In [None]:
## calculate SSR(simpler)
ssr.simpler <- sum(model.simple$residuals^2)

## print out SSR(simpler)
ssr.simpler

Now let's calculate the Sum of the Squared Resiudals for the fancier model that uses Height and Drug to predict Recovery Times.

In [None]:
## calculate SSR(fancier)
ssr.fancier <- sum(model.fancy.summary$residuals^2)

## print out SSR(fancier)
ssr.fancier

Now that we have SSR(simpler) and SSR(fancier), we can calculate *F*...

In [None]:
DF1 <- 1 # p_fancier - p_simpler = 3 - 2 = 1
DF2 <- 5 # n - p_fancier = 8 - 3 = 5

f.value <- ((ssr.simpler - ssr.fancier) / DF1) / (ssr.fancier / DF2)
f.value

And now that we have *F*, we can calculate the *p*-value.

In [None]:
pf(f.value, df1=DF1, df2=DF2, lower.tail=FALSE)

And the resulting *p*-value tells us that using Drug and Height to predict Recovery Time is significantly different from just using Height to prediction Recovery Time. And the Adjusted R-squared we calculated earlier for the model that uses Drug and Height, 0.9552, tells us that this model is much better.

Now let's see how we can calculate the *p*-value using the Histogram method.

In [None]:
## since we're going to generate random datasets,
## let's start by setting the seed so that the results
## are reproducable
set.seed(42)

## To generate random datasets, we'll use
## normal distributions. These distributions
## will be based on our observed data, so we
## we need to calculate the estimated
## means and standard deviations.
mean.time <- mean(df$time)
sd.time <- sd(df$time)

mean.height <- mean(df$height)
sd.height <- sd(df$height)

## Next, we define the number of random
## datasets we wantt o create...
num.rand.sets <- 10000

## ...and we define the number of data points
## per dataset
num.datapoints <- length(drug.a.time)

## Create an empty array that is num.rand.datasets long
rand.r.squared <- rep(NA, times=num.rand.datasets)

## Here is the loop were we create a bunch of random datasets,
## each with num.datapoints values, fit regressions to them, 
## then calculate and store the R-squared values to compare
## the residuals
for(i in 1:num.rand.sets) {

    ## generate random values for each variable

    rand.a.time <- rnorm(n=num.datapoints,
                           mean=mean.time,
                           sd=sd.time)
    rand.a.height <- rnorm(n=num.datapoints,
                           mean=mean.height,
                           sd=sd.height)
    
    rand.b.time <- rnorm(n=num.datapoints,
                         mean=mean.time,
                         sd=sd.time)
    rand.b.height <- rnorm(n=num.datapoints,
                           mean=mean.height,
                           sd=sd.height)
    
     ## bundle the random values together in a data.frame
    data <- data.frame(
        time = c(rand.a.time, rand.b.time),
        drug = drug,
        height = c(rand.a.height, rand.b.height))

    ## now do the simpler regression...
    simpler.lm <- summary(lm(time~height, data=data))
    ## ...and calculate SSR(simpler)
    simpler.ssr <- sum(simpler.lm$residuals^2)
    # print(paste("simpler.ssr:", simpler.ssr))
    
    ## now do the fancier regression...
    fancier.lm <- summary(lm(time~drug + height, data=data))
    ## ...and calculate SSR(fancier)
    fancier.ssr <- sum(fancier.lm$residuals^2)
    # print(paste("fancier.ssr:", fancier.ssr))

    ## calculate and save the R-squared value
    rand.r.squared[i] <- (simpler.ssr - fancier.ssr) / simpler.ssr
}

Now let's draw a histogram of the $R^2$ values with the `hist()` function...

In [None]:
hist(rand.r.squared)

...and calculate the *p*-value as the precentage of "random" $R^2$ values greater than or equal to the one we got for our original data.

In [None]:
## the number of randomly generated r.squared >= the original r.squared
num.greater <- sum(rand.r.squared >= model.fancy.summary$r.squared)

## calculate the p-value 
p.value <- num.greater / num.rand.datasets

## print out the p-value
p.value

And at last, we see that the two *p*-values are essentially the same, because 1e-04 = 0.0001.

# TRIPLE BAM!!!

----