# [The StatQuest Illustrated Guide to Statistics]()
## Chapter 01 - Fundamental Concepts in Statistics!!!

Copyright 2026, Joshua Starmer

In this notebook weâ€™ll learn how to...

- Load data, the apples for sale at every single **Spend-n-Save** store, from a file.
- Calculate the **Population Mean** and **Population Standard Deviation** from **Spend-n-Save** data.
- Randomly select a subset of the data in the file and use it to calculate the **Estimated Mean** and the **Estimated Standard Deviation**.
- Compare the **Population Mean** to the **Estimated Mean** and the **Population Standard Deviation** to the **Estimated Standard Deviation**.
- Compare the **Estimated Standard Deviation** calculated by dividing by $n-1$ to dividing by $n$.

**NOTE:**
This tutorial assumes that you have installed **[Python](https://www.python.org/)** and read Chapter 1 in **[The StatQuest Illustrated Guide to Statistics]()**.

----

# Load data from a file

Just like **'Squatch** does in the book, we're going to calculate the **Population Mean** and **Population Standard Deviation** for the number of apples for sale at every single **Spend-n-Save**. So, the first thing we need to do is load in a file with all of the data. However, before we get to that, we have to load in some modules that will help us load data and do math.

In [1]:
import pandas as pd # to import data into a dataframe
import numpy as np # to set a global random seed
import statistics # to calculate mean and standard deviation

Now we can load in data. We'll use the Pandsa function `read_csv()`, where we need to specify the name of the file (and the path to the file, if necessary), and the character used to separate the columns of data. In this case, `spend_n_save.txt` is a tab-delimited file so we'll set `sep='\t'`.

In [2]:
## First, use pd.read_csv() to read the data in "spend_n_save.txt"
spend_n_save_df = pd.read_csv("spend_n_save.txt", sep='\t')

## Verify that pd.read_csv() was successful by printing out the
## first few rows with the head() function
spend_n_save_df.head()

Unnamed: 0,id,num.apples
0,1,27
1,2,17
2,3,22
3,4,23
4,5,22


Now, one problem with our data is that the second column is called `num.apples`, with a dot, `.`, between `num` and `apples`. This dot, which is common in variable and function names when using the **R** programming language, is will get in the way of our **Python** code, so let's replace it with an underscore, `_`.

In [None]:
## rename num.apples to num_apples so it's easier to access the values
spend_n_save_df.rename(columns={'num.apples': 'num_apples'}, inplace=True)

## print out first few rows to verify the rename
spend_n_save_df.head()

Now that we've seen the first 5 rows of the data, let's see how many rows there are, total with the `len()` function.

In [None]:
print("Number of rows in spend_n_save_df:", len(spend_n_save_df))

Cool. Now let's use all of the data to calculate the **Population Mean** and the **Population Standard Deviation**.

----

# Calculate the Population Mean and Population Standard Deviation

Now that we have the data, we can calcluate the **Population Mean** of the number of apples for sale at each store with the `mean()` function, which is part of the `statistics` module. 

**NOTE:** The equation for the **Population Mean** is the same as the equation for the **Estimated Mean**, the only difference is that for the **Population Mean** we use data from the entire population and for the **Estimated Mean** we only use a subset of the data.

In [None]:
## Calculate the mean of the number of apples for sale
## because we are using the data from every single store
## we are calculating the Population Mean
## Anyway, we'll save the value in a variable called pop_mean
pop_mean = statistics.mean(spend_n_save_df.num_apples)

## print out the population mean
pop_mean

Since the **Population Mean** has so many digits after the decimal point, it's quite a mouthful. We can make it easier to talk about it if we round it to the nearest 10th with the `round()` function, which takes the number we want to round, and then we specify the number of digits past the decimal point that we want to round to with `ndigits`. In this example, we'll set `ndigits=1`.

In [None]:
round(pop_mean, ndigits=1)

Now let's calculate the **Population Standard Devation**. We can do that with the `pstdev()` function, which is also part of the `statistics` module.

In [None]:
## calculate the population standard deviation and save it in 
## a variable called pop_sd
pop_sd = statistics.pstdev(spend_n_save_df.num_apples)

## print out the population standard deviation
pop_sd

Just like we did with the **Population Mean**, we can round the **Population Standard Deviation** to the nearest 10th with the `round()` function.

In [None]:
round(pop_sd, ndigits=1)

# BAM!

Now let's learn how to calculate the **Estimated Mean** and the **Estimated Standard Deviation**.

----

# Calculate the Estimated Mean and the Estimated Standard Deviation with a randomly selected subset of the data

Now that we know how to calculate the **Population Mean** and the **Populatin Standard Deviation**, let's learn how to calculate an **Estimated Mean** and an **Estimated Standard Deviation**.

The first thing we'll do is randomly select 5 values from the dataset. Specifically, we want to randomly select 5 values representing the number of apples for sale at a store. We can do this with the following command...

`rand_sample = spend_n_save_df.num_apples.sample(n=5)`

...where `n=5` specifies that we want to sample 5 values.

In [None]:
## Set seed for reproducibility
np.random.seed(42)

## Select a random sample of size 5 without replacement
rand_sample = spend_n_save_df.num_apples.sample(n=5)

## print out the 5 values in rand_sample
rand_sample

Now that we have our sample from 5 randomly selected **Spend-n-Save** stores, we can calculate the estimated mean with the `mean()` function, just like before.

In [None]:
## calculate the estimated mean
estimated_mean = statistics.mean(rand_sample)

## print out the estimated mean
estimated_mean

To calculate the estimated standard deviation, we can use the `stdev()` function that is part of the `statistics` module.

In [None]:
## Now calculate the estimated standard deviation
estimated_sd = statistics.stdev(rand_sample)
estimated_sd

And since the estimated standar deviation is a mouthful, we can round it to the nearest 10th with the `round()` function.

In [None]:
round(estimated_sd, ndigits=1)

# DOUBLE BAM!!

Now let's get a sense for why the **Population Standard Deviation** is calculated with a slightly different formula than the **Estimated Standard Deviation**.

----

# See why dividing by $n$ results in a biased estimate and dividing by $n-1$ is unbaised

First, let's remember the equation for the **Population Standard Deviation**...

<span style="font-size: 24px;">
$\sqrt(\frac{\sum(x - \mu)^2}{N})$
</span>

...and the equation for the **Estimated Standard Deviation**...

<span style="font-size: 24px;">
$\sqrt(\frac{\sum(x - \bar{x})^2}{n-1})$
</span>

Other than replacing the **Population Mean**, $\mu$, with the **Estimated Mean**, $\bar{x}$, the big difference is in the denominator. The **Population Standard Deviation** has $N$, the total number of measurements in the population, and the **Estimated Standard Deviation** has $n-1$, where $n$ is th number of measurements in the sample.

The reason the denominators are different is, in theory, because dividing by $n$ in the **Estimated Standard Deviation** will, on average, result in an underestimate of the **Population Standard Deviation**, and dividing by $n-1$ corrects for this bias.

So, let's see if it this theory is actually true. We'll do this by selecting a bunch of samples, each with 5 randomly selected measurements (numbers of apples for sale) and calculate the Standard Deviation both ways - using $n$ in the denominator and using $n - 1$ in the denominator.

In [None]:
# Set seed for reproducibility
np.random.seed(42)

max_samples = 1000 # this is how many times we'll get a sample of random values.

## we'll save all of the estimated standard deviations in these vectors.
estimated_sds_with_n_minus_1 = [0] * max_samples
estimated_sds_with_n = [0] * max_samples

## here's a loop where we'll randomly select
## a sample of 5 measurements and then calculate
## the standard deviation two different ways.
for i in range(max_samples):
    
    ## get a sample of random values...
    rand_sample = spend_n_save_df.num_apples.sample(n=5)
    
    # Standard deviation with n-1 (estimated standard deviation)
    estimated_sds_with_n_minus_1[i] = statistics.stdev(rand_sample)

    # Standard deviation with n
    estimated_sds_with_n[i] = statistics.pstdev(rand_sample)

Now let's calculate an print the average standard deviations that we just calculated for both equations...

In [None]:
print("Average SD with n-1:", round(sum(estimated_sds_with_n_minus_1) / max_samples, 1))
print("Average SD with n:", round(sum(estimated_sds_with_n) / max_samples, 1))

...and compare those values to the actual **Population Standard Deviation**...

In [None]:
pop_sd

...and we see that, on average, when we just have $n$ in the denominator, we underestimate the **Population Standard Deviation** more than when we have $n-1$ in the denominator.

# TRIPLE BAM!!!