# [The StatQuest Illustrated Guide to Statistics]()
## Chapter 01 - Fundamental Concepts in Statistics!!!

Copyright 2026, Joshua Starmer

In this notebook weâ€™ll learn how to...

- Load data, the apples for sale at every single **Spend-n-Save** store, from a file.
- Calculate the **Population Mean** and **Population Standard Deviation** from **Spend-n-Save** data.
- Randomly select a subset of the data in the file and use it to calculate the **Estimated Mean** and the **Estimated Standard Deviation**.
- Compare the **Population Mean** to the **Estimated Mean** and the **Population Standard Deviation** to the **Estimated Standard Deviation**.
- Compare the **Estimated Standard Deviation** calculated by dividing by $n-1$ to dividing by $n$.

**NOTE:**
This tutorial assumes that you have installed **[R](https://cran.rstudio.com/)**, and possibly **[RStudio](https://posit.co/download/rstudio-desktop/)** and read the first chapter in **[The StatQuest Illustrated Guide to Statistics]()**.

----

# Load data from a file

Just like **'Squatch** does in the book, we're going to calculate the **Population Mean** and **Population Standard Deviation** for the number of apples for sale at every single **Spend-n-Save**. So, the first thing we need to do is load in a file with all of the data.

In **R**, we can use the `read.delim()` function to read in a text file that has columns of data separated by some "delimiter". In this case, the delimiter is the `tab` character, meaning the text file has columns of data separated by a `tab` character. So we pass `read.delim()` the file we wanted to read in, `spend_n_save.txt` and the delimiter, `\t`, which is how we specify the `tab` character. Just between you and me, I think it's funny that the name for the delimiter argument is `sep` and not `delim`, which would be consistent with the name of the function.

Anyway, `read.delim()` returns a `data.frame` containing all the data in the file and we'll save that data in a new variable called `spend.n.save.df`, where the `df` is short for `data.frame` and will remind use what type of data we have. **NOTE:** If you are not already familiar with what a `data.frame` is, think of it as somethign similar to a spreadsheet. 

Once we have the data stored in in `spend.n.save.df`, we can verify that `read.delim()` was successful by printing out the first few rows of the data with the `head()` function and then print out the total number of rows with `nrow()`. 

In [None]:
## First, use read.delim() to read the data in "spend_n_save.txt"
spend.n.save.df <- read.delim(file="https://raw.githubusercontent.com/StatQuest/sigs/refs/heads/main/chapter_01/spend_n_save.txt", sep="\t")

## Verify that read.delim() was successful by printing out the
## first few rows with the head() function
print("The first few rows of data in spend.n.save.df")
head(spend.n.save.df)

## Print out the number of rows in spend.n.save.df
print(paste("Number of rows in spend.n.save.df:", nrow(spend.n.save.df)))

Hooray!!! It looks like `read.delim()` worked as expected. `spend.n.save.df` has two columns, the first column has the store ID number, and the second column has the number of apples for sale at that store. Lastly, `nrow()` returned **5123**, so we have data for all **5,123** **Spend-n-Save** stores. This means we can use the data to calculate the **Population Mean** and the **Population Standard Deviation**.

----

# Calculate the Population Mean and Population Standard Deviation

Now that we have the data we can calcluate the **Population Mean** of the number of apples for sale at each store with the `mean()` function. **NOTE:** The equation for the **Population Mean** is the same as the equation for the **Estimated Mean**, the only difference is that for the **Population Mean** we use data from the entire population and for the **Estimated Mean** we only use a subset of the data.

First, let's demonstrate how we can access the number of apples for sale at each store from our `data.frame`. We do this by adding `$num.apples` to the variable name, like this `spend.n.save.df$num.apples`. In general, we can access the values in any column in a `data.frame` by adding `$` and the column name to the variable name. For example, to print out the first few values in the `num.apples` column we would use...

In [None]:
head(spend.n.save.df$num.apples)

Now that we know how to access just the values in the `num.apples` column, we can calculate the **Population Mean** and store it in a variable called `pop.mean`.

In [None]:
## Calculate the mean of the number of apples for sale
## because we are using the data from every single store
## we are calculating the Population Mean
## Anyway, we'll save the value in a variable called pop.mean
pop.mean <- mean(spend.n.save.df$num.apples)

## Now print out the value in pop.mean
print(paste("population mean:", pop.mean))

So, **19.9230919383174** is the **Population Mean**. However, that's a mouthful. We can make it easier to talk about by rounding to the nearest 10th with the `round()` function.

In [None]:
## Round the mean to the nearest 10th so it's easier
## to read, write, and communicate in general.
round(pop.mean, digits=1)

Now that we have calculated the **Population Mean**, we can calculate the **Population Standard Deviation**. 

**NOTE:** It is so rare that we have all of the data to calculate the **Population Standard Deviation** that **R** doesn't have a function that can do it for us. So, instead of using a built in function to do it, we'll just use the formula from the book...

<span style="font-size: 24px;">
$\sqrt(\frac{\sum(x - \mu)^2}{N})$
</span>

...where `x` is each individual value, $\mu$ is the **Population Mean** and `N` is the number of values in the entire population, which we can set to the number of rows in `spend.n.save.df`

In [None]:
## calculate the population standard deviation and save it in 
## a variable called pop.sd

## First, lets' just calculate the population variance...
numerator <- sum((spend.n.save.df$num.apples - pop.mean)^2) # = sigma(x - mu)^2
denominator <- nrow(spend.n.save.df) # = N

pop.var <- numerator / denominator

## print out the population variance
print(paste("population variance:", pop.var))

## now we can calculate the population standard deviation
## by taking the square root of the population variance.
pop.sd <- sqrt(pop.var)

## print out the population standard deviation
print(paste("population sd:", pop.sd))

Again, this result is a mouthful, so we can make it easier to communicate by rounding it to the nearest 10th.

In [None]:
round(pop.sd, digits=1) # digits is the number of digits past the decimal point

# BAM!

----

# Calculate the Estimated Mean and the Estimated Standard Deviation with a randomly selected subset of the data

Now that we know how to calculate the **Population Mean** and the **Populatin Standard Deviation**, let's learn how to calculate an **Estimated Mean** and an **Estimated Standard Deviation**.

First, let's use the `sample()` function to create a subset of the population data by randomly select a few rows of data from `spend.n.save.df`. In this case, the `sample()` function has **3** parameters we need to set. First, `x` is set to the data we want to sample from. In this case, we want to sample from the numbers of apples for sale at each store, so we set `x=spend.n.save.df$num.apples`. Then, since we want to randomly select **5** of the values, we set `size=5`. Lastly, we set `replace=FALSE`, and this means that, if we started with **5,123** values, then, after we randomly select the first item for our sample, then the second value will be selected from **5,122**, or one fewer than we started with. In other words, whatever we selected for the first value is removed from the pool of potential values to select from. Likewise, the third value will be selecteed from only **5,121** items and so on.

In [None]:
## NOTE: set.seed() allows us to create a "random" sample that
## is reproducable. This is because "random" numbers are 
## are created from a starting number. Usually this starting
## number is different every single time you create a random
## number. However, you can set it to a specific value, like
## 42, and the "random" numbers will be the same each time.
set.seed(42)

## Now select a random sample from the full dataset
rand.sample <- sample(x=spend.n.save.df$num.apples, size=5, replace=FALSE)

## Now print out the random sample so we can see what it looks like
rand.sample

Now that we have our sample from 5 randomly selected **Spend-n-Save** stores, we can calculate the estimated mean with the `mean()` function:

In [None]:
## Now calculate the estimated mean
estimated.mean <- mean(rand.sample)
estimated.mean

In this case, **Estimated Mean**, **21.2**, is a little larger than the **Population Mean**, **19.9**.

Now let's calculate the estimated standard devation with the `sd()` function. Unlike the population standard deviation, calculate the estimated standard deviation is much, much more commonly done, so `R` has a function for doing it.

In [None]:
## Now calculate the estimated standard deviation
estimated.sd <- sd(rand.sample)
estimated.sd

Again, since that value is a mouthful, we can round it to the nearest 10th.

In [None]:
round(estimated.sd, digits=1)

So, in the end, with our sample of **5** values randomly selected from the dataset, we end up with an **Estimated Standard Deviation**, **2.8**, that is a lot smaller than the **Population Standard Deviation**, **5**.

**NOTE:** The `sd()` function divides the **sum of the squared residuals** by `n-1`. For fun, we can see what would happen if we divided the **sum of the squared residuals** by `n`. However, to do this, we need to do the math by hand. The good news, is that it's essentially the same as the how we calculated the **Population Standard Deviation**, only this time we just use less data and the **Estimated Mean**. To keep the value that uses 'n' in the denominator separate from the actual **Estimated Standard Deviation**, we'll called **Biased Stanard Deviation**. **ALSO NOTE:** We'll see why we are calling it **Biased** in the next section.

In [None]:
numerator <- sum((rand.sample - estimated.mean)^2) # = sigma(x - x_bar)^2
denominator <- length(rand.sample) # = n

biased.var <- numerator / denominator

## print out the population variance
print(paste("biased variance:", biased.var))

## now we can calculate the population standard deviation
## by taking the square root of the population variance.
biased.sd <- sqrt(biased.var)

## print out the population standard deviation
print(paste("biased sd:", biased.sd))

Again, since that value is a mouthful, we can round it to the nearest 10th.

In [None]:
round(biased.sd, digits=1)

So, we see that when we divide the **sum of the squared residuals** by `n`, we get a worse estimate, **2.5** of the **Population Standard Deviation**, **5**, than when we divide by `n-1`, which was **2.8**.

# DOUBLE BAM!!

Now let's see how, on average, dividing the **sum of the squared residuals** by `n` gives us an estimate that biased towards being too small.

----

# See why dividing by $n$ results in a biased estimate and dividing by $n-1$ is unbaised

Now let's see why we get a biased estimate of the standard deivation when we divide by `n` and dividing by `n-1` results in an unbiased estimate. To do this, we are going to calculate both values, but with lots and lots of different, randomly selected, samples and see what happens on average. 

Specifically, we're going to create a sample from **5** randomly selected values, then calculate the **Estimated Standard Deviation** by dividing the **sum of the squared residuals** by `n-1` and by `n`. We'll then keep track of those estimates and repeat the process **10,000** times. In the end, we'll have **10,000** values for the **Estimated Standard Deviation** calculated by dividing by `n-1` and **10,000** values for the **Estimated Standard Deviation** calculated by dividing by `n`. Lastly, we'll take the average of both sets of values and see, on average, what the effect of dividing by `n-1` is compared to dividing by `n`.

In [None]:
max.samples <- 1000 # this is how many times we'll get a sample of random values.

## we'll save all of the estimated standard deviations in these vectors.
estimated.sds.with.n.minus.1 <- vector(length=max.samples)
estimated.sds.with.n <- vector(length=max.samples)

set.seed(42)

for(i in 1:max.samples) {

    ## get a sample of random values...
    rand.values <- sample(x=spend.n.save.df$num.apples, size=5, replace=FALSE)

    ## estimate the standard deviation with `n-1`...
    estimated.sds.with.n.minus.1[i] <- sd(rand.values)

    ## estimate the standard deviation with `n`
    estimated.mean <- mean(rand.values)    
    biased.sd <- sqrt(sum((rand.values - estimated.mean)^2) / length(rand.values))

    estimated.sds.with.n[i] <- biased.sd
}

## lastly, print out the average sd when we divide by n-1...
print(paste("Average SD with n-1:", round(mean(estimated.sds.with.n.minus.1), digits=1)))

## ...and when we divide by n
print(paste("Average SD with n:", round(mean(estimated.sds.with.n), digits=1)))


Now let's compare our **Estimated Standard Deviations** to the **Population Standard Deviation**, 5.

When we divided the **sum of the squared residuals** by `n-1`, the average **Estimated Standard Deviation** was **4.8**, which is closer to the **Population Standard Deviation** than when we calculated the **Estimated Standard Deviation** by dividing by `n`, **4.3**. In other words, when we only divide by `n`, we end up underestimating, on average, the **Population Standard Deviation** more than when we divide by `n-1`.

In the programming exercises associated with the next chapter in **The StatQuest Illustrated Guide to Statistics**, we'll show howe we can more easily visualize these results and make more sense of them. Until then, we'll just say...

# TRIPLE BAM!!