# [The StatQuest Illustrated Guide to Statistics]()
## Chapter 09 - Determining How Much Data to Collect with Power Analyses!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn...

- How sample sizes influences our confidence in the accuracy of an Estimated Mean.
- How to do a Power Analysis for *t*-test.

**NOTE:**
This tutorial assumes that you have installed **[R](https://cran.rstudio.com/)**, and possibly **[RStudio](https://posit.co/download/rstudio-desktop/)** and read the Chapter 9 in **[The StatQuest Illustrated Guide to Statistics]()**.

----

# The effects of sample size on our confidence in the accuracy of an Estimated Mean

Here we will take samples from two distributions, the normal distribution and the exponential distribution, and show how the estimated means calculated from those samples tend to get better when the sample size is larger. In other words, our confidence in the accuracy of an estimated mean improves with larger sample size.

We'll start with showing how this works with the normal distribution.

# Normal Distribution

First, let's just draw a normal distribution with mean = 0 and standard deviation = 1.

In [None]:
curve(dnorm(x=x, mean=0, sd=1), 
      from=-5, 
      to=5, 
      n=101, 
      col="grey", 
      lwd=5)

Now let's add the estimated means, calculated with different sample sizes, to the graph. We'll start with *n* = 1.

## *n* = 1

First, calculate the means...

In [None]:
## First, make sure our results are reproducable
## by calling set.seed()
set.seed(42)

num.rand.datasets <- 20 # Number of datasets to collect
num.datapoints <- 1 # Number of points in each dataset

## means is an array of estimated mean values calculated
## from each dataset.
means <- rep(NA, times=num.rand.datasets)

## y.axis is just a bunch of 0s, one per estimated mean
## that we calculate. This is just to put the estimated
## means at the bottom of the graph.
y.axis <- rep(0, times=num.rand.datasets)

## Now create num.rand.datasets datasets
## and calculate an estimated mean with each one
for(i in 1:num.rand.datasets) {
    sample <- rnorm(n=num.datapoints, mean=0, sd=1)
    
    means[i] <- mean(sample)
}

...now add them to our graph of the normal curve.

In [None]:
## Now plot the normal distribution that the datasets
## were sampled from...
curve(dnorm(x=x, mean=0, sd=1), 
      from=-5, 
      to=5, 
      n=101, 
      col="grey", 
      lwd=5)

## ...and the estimated means.
## NOTE: The color for the estimated means is in 
## hexidecimal format where the first two digits
## represent the values for Red, the second two
## digits represent the values for Blue, the third
## two digits represent the values for Green, and the
## last two digits represent the "alpha" or, how opaque
## the color should be. In this case, we're setting the
## alpha=88 so that the color is semi-transparit. This means
## that we'll be able to see overlapping means since the overlap
## will be darker than the non-overlap parts.
points(x=means, y=y.axis, col="#225ea888", cex=3, lwd=5)

Bam! Here we see that when *n* = 1, the estimated means are spread out between **-3** and **2.5**. Now let's see what happens when we increase the sample size to *n* = 2.

# *n* = 2

**NOTE:** Increasing the sample size to **2** means setting `num.datapoints` to **2**. Everything else is the same as for *n* = 1. First, we calculate the means...

In [None]:
set.seed(42)

num.rand.datasets <- 20
num.datapoints <- 2

means <- rep(NA, times=num.rand.datasets)
y.axis <- rep(0, times=num.rand.datasets)

for(i in 1:num.rand.datasets) {
    sample <- rnorm(n=num.datapoints, mean=0, sd=1)
    
    means[i] <- mean(sample)
}

...then we add them to the a graph of the normal curve.

In [None]:
## First plot the normal distribution that the datasets
## were sampled from...
curve(dnorm(x=x, mean=0, sd=1), 
      from=-5, 
      to=5, 
      n=101, 
      col="grey", 
      lwd=5)

## ...and the estimated means.
points(x=means, y=y.axis, col="#225ea888", cex=3, lwd=5)

Bam! Now, when *n* = 2, the estimated means span a range of values from about -1,75 to just under 2. This is a narrower range of values than before. As a result, we should have more confidence in the accuracy of estimated means when *n* = 2 compared to when *n* = 1.

Now let's increase the sample size to *n* = 5.

## *n* = 5

In [None]:
set.seed(42)

num.rand.datasets <- 20
num.datapoints <- 5

means <- rep(NA, times=num.rand.datasets)
y.axis <- rep(0, times=num.rand.datasets)

for(i in 1:num.rand.datasets) {
    sample <- rnorm(n=num.datapoints, mean=0, sd=1)
    
    means[i] <- mean(sample)
}

In [None]:
curve(dnorm(x=x, mean=0, sd=1), 
      from=-5, 
      to=5, 
      n=101, 
      col="grey", 
      lwd=5)

points(x=means, y=y.axis, col="#225ea888", cex=3, lwd=5)

Bam! Setting the sample size to *n* = 5 resulted in the estimated means forming a tighter cluster around the population mean, 0. Now let's see what happens when we increase the sample size one last time to *n* = 10.

## *n* = 10

In [None]:
set.seed(42)

num.rand.datasets <- 20
num.datapoints <- 10

means <- rep(NA, times=num.rand.datasets)
y.axis <- rep(0, times=num.rand.datasets)

for(i in 1:num.rand.datasets) {
    sample <- rnorm(n=num.datapoints, mean=0, sd=1)
    
    means[i] <- mean(sample)
}

In [None]:
curve(dnorm(x=x, mean=0, sd=1), 
      from=-5, 
      to=5, 
      n=101, 
      col="grey", 
      lwd=5)

points(x=means, y=y.axis, col="#225ea888", cex=3, lwd=5)

Bam! Now we see that each time we increase the sample size, the estimated means cluser closer and closer to the true, population mean. Thus, the larger the sample size, the more confidence we can have that an particular estimated mean is close to the population mean.

Now that we've seen how increasing the sample size results in the estimated mean being closer to the population mean for the **Normal Distribution**, let's see what happens when we use an **Exponential Distribution**.

----

# Exponential Distribution

First, let's just draw an exponential distribution with mean = 2.

**NOTE:** The expoenential distribution has one parameter that we need to set called the **rate**, which is also notated with the Greek character $\lambda$. The **rate** is defined as 1/mean. So if the mean = 2, then the **rate** = 1/mean = 0.5.

**ALSO NOTE:** Since the expoential distribution is not symmetrical, we'll draw a vertical line at the mean value so it is easy to identify. We can do this with the `abline()` function. The **ab** part of the `abline()` functions name comes from how it draws lines from **a** to **b**. In other words, we can use the `abline()` function to draw any line from **a** to **b**. However, it's also used to draw horizonal and vertical lines. To draw a a horizontal line, you only need to specify the y-axis value for it and to draw a vertical line, you only need to specify the x-axis location. Since we want a to draw a vertical line, we do this by specifcying a value for `x`.

In [None]:
exp.mean <- 2

In [None]:
curve(dexp(x=x, rate=1/exp.mean), 
      from=0, 
      to=10, 
      n=101, 
      col="grey", 
      lwd=5)

## NOTE: Since the expoential distribution is not symmetrical,
## we'll draw a vertical line at the mean value so it is easy
## to identify. 
abline(v=exp.mean, col="grey", lwd=5)

Bam! Now we know how to draw an exponential distribution and a vertical line at mean value. Now let's see how the estimated means are distributed when the sample size = **1**.

## *n* = 1 

In [None]:
set.seed(42)

num.rand.datasets <- 20
num.datapoints <- 1

means <- rep(NA, times=num.rand.datasets)
y.axis <- rep(0, times=num.rand.datasets)

for(i in 1:num.rand.datasets) {
    sample <- rexp(n=num.datapoints, rate=1/exp.mean)
    
    means[i] <- mean(sample)
}

In [None]:
curve(dexp(x=x, rate=1/exp.mean), 
      from=0, 
      to=10, 
      n=101, 
      col="grey", 
      lwd=5)

## Since the expoential distribution is not symmetrical, we'll
## draw a vertical line at the mean value.
abline(v=exp.mean, col="grey", lwd=5)

## Now plot the means
points(x=means, y=y.axis, col="#225ea888", cex=3, lwd=5)

Now that we can see the range of values estiamted means calculated with *n* = 1 measurements have, let's compare them to estimated means calculated with *n* = 2 measurements.

## *n* = 2

In [None]:
set.seed(42)

num.rand.datasets <- 20
num.datapoints <- 2

means <- rep(NA, times=num.rand.datasets)
y.axis <- rep(0, times=num.rand.datasets)

for(i in 1:num.rand.datasets) {
    sample <- rexp(n=num.datapoints, rate=1/exp.mean)
    
    means[i] <- mean(sample)
}

In [None]:
curve(dexp(x=x, rate=1/exp.mean), 
      from=0, 
      to=10, 
      n=101, 
      col="grey", 
      lwd=5)

## Since the expoential distribution is not symmetrical, we'll
## draw a vertical line at the mean value.
abline(v=exp.mean, col="grey", lwd=5)

## Now plot the means
points(x=means, y=y.axis, col="#225ea888", cex=3, lwd=5)

...and the results are not exaclty what we were hoping for. When *n* = 2 we see a wider range of values for the esimtated means. However, keep in mind that we're only looking at **20** different estimated means in each case, and it could be (and, in theory, will be) different if we looked at way more estimated means.

We could test that theory by increasing the value for `num.rand.datasets`, but I'll leave that as an exercise for the reader. For now, let's see what happens when we increase the sample size to *n* = 5.

## *n* = 5

In [None]:
set.seed(42)

num.rand.datasets <- 20
num.datapoints <- 5

means <- rep(NA, times=num.rand.datasets)
y.axis <- rep(0, times=num.rand.datasets)

for(i in 1:num.rand.datasets) {
    sample <- rexp(n=num.datapoints, rate=1/exp.mean)
    
    means[i] <- mean(sample)
}

In [None]:
curve(dexp(x=x, rate=1/exp.mean), 
      from=0, 
      to=10, 
      n=101, 
      col="grey", 
      lwd=5)

## Since the expoential distribution is not symmetrical, we'll
## draw a vertical line at the mean value.
abline(v=exp.mean, col="grey", lwd=5)

## Now plot the means
points(x=means, y=y.axis, col="#225ea888", cex=3, lwd=5)

Now, with the sampel size set to *n* = 5, we start to see that the means have a slightly smaller range of values compared to when *n* = 1 and when *n* = 2. Now let's increase the sample size to *n* = 10.

## *n* = 10

In [None]:
set.seed(42)

num.rand.datasets <- 20
num.datapoints <- 10

means <- rep(NA, times=num.rand.datasets)
y.axis <- rep(0, times=num.rand.datasets)

for(i in 1:num.rand.datasets) {
    sample <- rexp(n=num.datapoints, rate=1/exp.mean)
    
    means[i] <- mean(sample)
}

In [None]:
curve(dexp(x=x, rate=1/exp.mean), 
      from=0, 
      to=10, 
      n=101, 
      col="grey", 
      lwd=5)

## Since the expoential distribution is not symmetrical, we'll
## draw a vertical line at the mean value.
abline(v=exp.mean, col="grey", lwd=5)

## Now plot the means
points(x=means, y=y.axis, col="#225ea888", cex=3, lwd=5)

And here is where we really see the estimated means cluster more tightly around the population mean, **2**. As a result, when *n*=10, we can have more confidence that any individual estimated mean will be closer to the population mean.

# BAM!

Now let's learn how to do a Power Analysis for a t-test.

----

# How to do a Power Analysis for *t*-test

For pretty much every experiment, it's a good idea to determine if the sample size will be large enough to reject the null hypothesis if the null hypothesis is, indeed, false. In other words, you should always do a **Power Analysis**. The good news is that this is super easy to do in **R** with built in functions.

For example, to do a **Power Analysis** for a *t*-test in **R**, we can use the `power.t.test()` function.

We'll start using relatively standard parameters values. We'll set probability that we will correctly reject the null hypothesis, if it is indeed false, to 0.8 with `power=0.8`, the threshold for significance to 0.05 with `sig.level=0.05`.

Now, the last two things we need to specify are our estimates of the difference the population means and the standard deviation, using the same value for both populations. For this example, let's assume the difference in the populations means is 1, so we'll set that with `delta=1`, and we'll assume the standard deviation is also 1, and we'll se that with `sd=1`.

So, with all those parameters set, we have the following command that we can run:

In [None]:
power.t.test(power=0.80,
             sig.level=0.05, 
             delta=1, # difference in means
             sd=1)

The top of the output, where it says, `n = 16.71477` tells us that we need, exactly **16.71477** obesrvations in each group. Why the this value isn't rounded to the nearest integer is a complete mystery to me. I'm not sure how it would be remotely possible to gather **16.71477** measurements. But, whatever, no one asked me.

Anyway, now let's compare that value to what we get when our estimate of the standard deviation is smaller, **0.5**.

In [None]:
power.t.test(power=0.80,
             sig.level=0.05, 
             delta=1, # difference in means
             sd=0.5)

So, when the estimated standard deviation is halfed, the number of measurements requires to get the power we want (80% probability that we will correctly reject the null hypothesis if it is false) drops by over two thirds. Instead of gathering **17** measurements per group, now we just need **5**.

# Double BAM!!

----