# [The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)
## Chapter 10 - *p*-value Pitfalls and How to Avoid Them!!!!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Observe the Multiple Testing problem first hand.
- Adjust *p*-values to counteract the effects of multiple testing.
- Observe the negative affects of Significance Chasing.

**NOTE:**
This tutorial assumes that you have installed **[R](https://cran.rstudio.com/)**, and possibly **[RStudio](https://posit.co/download/rstudio-desktop/)** and read the Chapter 10 in **[The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)**.

----

# Observing the Multiple Testing Problem

In order to observe the multiple testing problem, we're going to do a lot of *t*-tests where the null hypothesis is true, both sets of measurements come from the same distribution. We'll then save all of the *p*-values and see how many are less than 0.05, and thus, false positives. If things work as expected, we should get about 5% false positives.

In [None]:
## since we're going to generate random datasets,
## let's start by setting the seed so that the results
## are reproducable
set.seed(42)

pop.mean <- 0
pop.sd <- 1

## Next, we define the number of random
## datasets we wantt o create...
num.rand.datasets <- 10000

## ...and we define the number of data points
## per dataset
num.datapoints <- 3

## Create an empty array that is num.rand.datasets long
p.values <- rep(NA, times=num.rand.datasets)

## Here is the loop were we create a bunch of random datasets,
## each with num.datapoints values, do a t-test and 
## then keep track of the corresponding p-value
for(i in 1:num.rand.datasets) {
    
    group.a <- rnorm(n=num.datapoints, mean=pop.mean, sd=pop.sd)
    group.b <- rnorm(n=num.datapoints, mean=pop.mean, sd=pop.sd)

    ## t-test with equal variances = F-test
    results <- t.test(group.a, group.b, var.equal=TRUE)

    p.values[i] <- results$p.value
}

Now let's draw a histogram of the *p*-values values with the `hist()` function...

In [None]:
hist(p.values)

The histogram shows that *p*-values are uniformly distributed between 0 and 1. This makes sense if, for whatever threshold we want to use is, $x$, then we should get $x \times 100$ false positives. For example, if the threshold is 0.05, like it is here, then we should get 5% false positives and thus, 5% of the *p*-values should be < 0.05. If, on the other hand, the threshold is 0.1, then we should get 10% false positvies, and thus, 10% of the *p*-values should be < 0.10.

Now let's calculate the number of false positives, the number of *p*-values < 0.05...

In [None]:
num.false.positives = sum(p.values < 0.05)

num.false.positives

...and we see we got 501 false positives.

Now calculate the percentage of false positives, the percentage of *p*-values < 0.05...

In [None]:
num.false.positives / num.rand.datasets

...and we see that about 5% of the *p*-values resulted in false positives. So, things worked as expected and we got a bunch of false positives from multiple testing, even though we know that the Null Hypothesis is false.

# Bummer!

Now let's see what happens if we adjust those p-values to compensate for multiple testing.

----

# Adjusting *p*-values to compensate for multiple testing

Since we created a bunch of *p*-values when we know that the Null Hypothesis is true, it should be interesting to see how many of 501 that were < 0.05 remain < 0.05 after adjusting them for multple testing. In this example, we'll start by using the Holm correction...

In [None]:
## First, adjust the p-values with the holm correction
adjusted.p.values.holm <- p.adjust(p=p.values, method="holm")

## print out the first few adjusted p-values
head(adjusted.p.values.holm)

...then we'll calculate the number of false positives...

In [None]:
## Now determine how many adjusted p-values are false positives
num.false.positives.holm = sum(adjusted.p.values.holm < 0.05)

num.false.positives.holm

...and we see that, after adjusting the *p*-values with the Holm correction, there are no longer any false positives.

Now let's see if the same thing happens with using the False Discovery Rate, FDR, to adjust the *p*-values.

In [None]:
## First, adjust the p-values with the FDR correction
adjusted.p.values.fdr <- p.adjust(p=p.values, method="fdr")

## print out the first few adjusted p-values
head(adjusted.p.values.fdr)

In [None]:
## Now determine how many adjusted p-values are false positives
num.false.positives.fdr = sum(adjusted.p.values.fdr < 0.05)

num.false.positives.fdr

So, either way we adjust the *p*-values, we eliminate all the false positives.

# Double BAM!!

Now let's observe the negative efffects of significance chasing.

----

# Observing the negative effects of Significance Chasing

Just like we did earlier when we wanted to observed the multiple testing problem, here we're going to do a lot of *t*-tests where the null hypothesis is true, both sets of measurements come from the same distribution. However, this time we'll select tests where the *p*-values is close to 0.05, but still greater than it. For those tests, we'll then add one additional measurement per group and save the *p*-value with all the others. If adding the new data to the original tests that have *p*-values close to 0.05 does not result in additoinal false positives, we should have about 5% false positives in the end. If we have more than 5% false positives, then we will have seen the effects of significance chasing, an increase of false positives by adding data to tests that look promising.

In [None]:
## since we're going to generate random datasets,
## let's start by setting the seed so that the results
## are reproducable
set.seed(42)

pop.mean <- 0
pop.sd <- 1

## Next, we define the number of random
## datasets we wantt o create...
num.rand.datasets <- 10000

## ...and we define the number of data points
## per dataset
num.datapoints <- 3

## Create an empty array that is num.rand.datasets long
p.values <- rep(NA, times=num.rand.datasets)

## Here is the loop were we create a bunch of random datasets,
## each with num.datapoints values, do a t-test and 
## then keep track of the corresponding p-value
for(i in 1:num.rand.datasets) {
    
    group.a <- rnorm(n=num.datapoints, mean=pop.mean, sd=pop.sd)
    group.b <- rnorm(n=num.datapoints, mean=pop.mean, sd=pop.sd)

    ## t-test with equal variances = F-test
    results <- t.test(group.a, group.b, var.equal=TRUE)

    if (results$p.value > 0.05) {
        if (results$p.value < 0.08) {
            extra.a <- rnorm(n=1, mean=pop.mean, sd=pop.sd)
            extra.b <- rnorm(n=1, mean=pop.mean, sd=pop.sd)

            group.a <- c(group.a, extra.a)
            group.b <- c(group.b, extra.b)

            results <- t.test(group.a, group.b, var.equal=TRUE)

            }
        }

    p.values[i] <- results$p.value
}

Now let's draw a histogram ofthe *p*-values.

In [None]:
hist(p.values)

In the histogram, we can see that the second column has a lot fewer *p*-values in it than any of the other columns. This is because the *t*-tests associated with the *p*-values in that bin were re-done with additional data. As a result, we have more than 500 *p*-values in the surrounding columns, including the first column, which suggests that we could have more than 500 false positives. So let's count the number of false positives...

In [None]:
num.false.positives = sum(p.values < 0.05)

num.false.positives

...and calculate the percentage...

In [None]:
num.false.positives / num.rand.datasets

...and we see that, compared to the original illustration of the multiple testing problem, when we both expected and received 5.0% of false positives, when we added data to promising *t*-tests, we ended up with 5.6% false positives, or an additional 61 false positives.

So, the moral of the story is, if a test looks promising, don't just add additional data existing measurements. Instead, do a proper power analysis and start over, collecting new data.

# TRIPLE BUMMER

----