# The StatQuest Illustrated Guide to Statistics
## Chapter 02 - Visualizing Data and Calculating Probabilities with Histograms!!!!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Learn how to build a histogram with the Number of Apples for sale at each **Spend-n-Save** store. We'll also learn how to save the histogram as a **PDF** file.
- Learn how to change the histogram's bin sizes and see the effect that this has on the insights we can make from the data.
- Learn how to calculate probabilities from the data.
- Lastly, as a bonus, we'll draw histograms of the **Estimated Standard Deviations** that we calculated at the end of the coding exercises for Chapter 1.



**NOTE:**
This tutorial assumes that you have installed **[R](https://cran.rstudio.com/)**, and possibly **[RStudio](https://posit.co/download/rstudio-desktop/)** and read Chapter 2 in **[The StatQuest Illustrated Guide to Statistics]()**.

----

# Load the Spend-n-Save data and use it to draw a histogram

Just like we did in the exercises for Chapter 1, we start by loading in the **Spend-n-Save** data.

In [None]:
## First, use read.delim() to read the data in "spend_n_save.txt"
spend.n.save.df <- read.delim(file="https://raw.githubusercontent.com/StatQuest/sigs/refs/heads/main/chapter_01/spend_n_save.txt", sep="\t")

## Verify that read.delim() was successful by printing out the
## first few rows with the head() function
print("The first few rows fo data in spend.n.save.df")
head(spend.n.save.df)

Now that we have the data, we can use the `hist()` function to draw a histogram of the Number of Apples for sale at each store.

In [None]:
hist(spend.n.save.df$num.apples)

# BAM!

That was easy! Now let's save it as a **PDF** file. We do this by first specifying the name of the **PDF** file we want to create with the `pdf()` function. In this case, we want to name the file `spend_n_save_histogram.pdf`, so we call `pdf("spend_n_save_histogram.pdf")`.

After we have specified the name of the new **PDF** file, we run the code that defines whatever we want int the **PDF**. In this case, we want the **PDF** to contain our histogram of the apple data, so we run the same `hist()` command we used before, `hist(spend.n.save.df$num.apples)`.

Lastly, we call `dev.off()` to tell `R` that we are done adding things to our **PDF** file.

In [None]:
pdf("spend_n_save_histogram.pdf")
hist(spend.n.save.df$num.apples)
dev.off()

Now that we know how to create a basic histogram and save the graph to a **PDF**, let's learn how we can modify the bin sizes and see what effects that has on how we can interpret the data.

-----

# Change the bin sizes.

By default, the `hist()` function tries to automatically define the width of the bins so that the resulting graph is most helpful. However, we can manually adjust the bin sizes as well. First, let's review what the `hist()` function does without us doing anything.

In [None]:
hist(spend.n.save.df$num.apples)

The result of using the `hist()` function on our data without adjusting the bin sizes is actually pretty good. We can see that most of the stores sold between **15** and **25** apples, and the average is close to **20**. This is consistent with the **Population Mean**, **19**, that we calculated in the exercises for Chapter 1.

So, now that we see a relatively good looking histogram, let's see what happens when we require everything to fall into either **1** of **2** bins by setting `breaks=2`.

In [None]:
hist(spend.n.save.df$num.apples, breaks=2) 

Now, by setting `breaks=2`, we have a histogram that is pretty terrible. We can't really see where the average might be and it's impossible to know that stores selling close to **0** apples are quite rare.

Now let's see what happens when we increase the number of bins to **10** with `breaks=10`.

In [None]:
hist(spend.n.save.df$num.apples, breaks=10) 

Now, even with just 10 bins, we can see trends in the data. It's now clear that average should be close to **20** and that stores with close to **0** apples for sale are relatively rare.

# Double BAM!!

Now that we know one way to adjust the number of bins in a histogram, let's learn how to calculate probabilities from them them, and the data in general.

-----

# Calculating probabilities from the data

Now, because we have a histogram, in theory, we could use it to calculate probabilities. However, calculating probabilities directly from the data itself is way, way easier to do, and way, way more flexible. Instead of being limited by the bins, when we calculate probabilities directly from the data, we can pick any value.

For example, the probability of walking into a **Spend-n-Save** store that sells at least **32** apples is the number of stores that sell at least **32** apples divided by the total number of stores.

In order to do this in `R` we use the `sum()` function to count the number of rows in the dataset with **32** or more appples. We do this with a relatively subtle command: `sum(spend.n.save.df$num.apples >= 32)`. This works because `spend.n.save.df$num.apples >= 32` returns a list of TRUEs and FALSEs that reflect whether or not a row had `num.apples >= 32`. Numerically, `TRUE = 1` and `FALSE = 0`, so if we add up all the TRUEs, then we add up all the 1s, and this gives us the number of rows where `num.apples >= 32`.

In [None]:
## count the number of rows in the dataset where num.apples >= 32
num.stores <- sum(spend.n.save.df$num.apples >= 32)

## now print out the number of rows
print(paste("The number of rows with num.apples >= 32:", num.stores))

Now we just divide `num.stores` by the total number of rows in the dataset, which we can get with the `nrow()` function.

In [None]:
num.stores / nrow(spend.n.save.df)

Now let's round that to the nearest 100th...

In [None]:
round(num.stores / nrow(spend.n.save.df), digits=2)

...and the probability that we might randomly walk into a **Spend-n-Save** store and see **32** or more apples for sale is **0.01**.

# TRIPLE BAM!!!

----

# Draw histograms of the estimated standard deviations we calculated in Chapter 1.

In Chapter 1, we ended the exercises by calculating a lot of **Estimated Standard Deviations** to compare what happens when we divide the **sum of the squared residuals** by `n` or `n-1`. Now let's pair those results with histograms to get more insight into why we divide by `n-1` instead of `n` when we estimate the **Standard Deviation**.

First, let's re-create the original data. We can do this because we use `set.seed()` to ensure that we get the same sequence of random numbers each time we run the program.

In [None]:
max.samples <- 1000 # this is how many times we'll get a sample of random values.

## we'll save all of the estimated standard deviations in these vectors.
estimated.sds.with.n.minus.1 <- vector(length=max.samples)
estimated.sds.with.n <- vector(length=max.samples)

set.seed(42)

for(i in 1:max.samples) {

    ## get a sample of random values...
    rand.values <- sample(x=spend.n.save.df$num.apples, size=5, replace=FALSE)

    ## estimate the standard deviation with `n-1`...
    estimated.sds.with.n.minus.1[i] <- sd(rand.values)

    ## estimate the standard deviation with `n`
    estimated.mean <- mean(rand.values)    
    biased.sd <- sqrt(sum((rand.values - estimated.mean)^2) / length(rand.values))

    estimated.sds.with.n[i] <- biased.sd
}

## lastly, print out the average sd when we divide by n-1...
print(paste("Average SD with n-1:", round(mean(estimated.sds.with.n.minus.1), digits=1)))

## ...and when we divide by n
print(paste("Average SD with n:", round(mean(estimated.sds.with.n), digits=1)))

Now let's draw histograms of the data, dividing by `n - 1` and `n`, while also super imposing the population standard deviation on top. So, first let's calculate the population standard deviation.

In [None]:
## Calculate the mean of the number of apples for sale
## because we are using the data from every single store
## we are calculating the Population Mean
## Anyway, we'll save the value in a variable called pop.mean
pop.mean <- mean(spend.n.save.df$num.apples)

## Now print out the value in pop.mean
print(paste("population mean:", pop.mean))

## calculate the population standard deviation and save it in 
## a variable called pop.sd

## First, lets' just calculate the population variance...
numerator <- sum((spend.n.save.df$num.apples - pop.mean)^2) # = sigma(x - mu)^2
denominator <- nrow(spend.n.save.df) # = N

pop.var <- numerator / denominator

## print out the population variance
print(paste("population variance:", pop.var))

## now we can calculate the population standard deviation
## by taking the square root of the population variance.
pop.sd <- sqrt(pop.var)

## print out the population standard deviation
print(paste("population sd:", pop.sd))

Now let's draw the histograms with the population standard deviation on top.

In [None]:
hist(estimated.sds.with.n.minus.1)
abline(v=pop.sd, col="red", lwd=10, lty="dotted") 

The histogram we just drew, for the `n-1` data is relatively symmetrical around the **Population Standard Deviation**, which, if you remember, is **5**. This means that when we divide by `n-1`, we the **Estimated Standard Deviations** tend to underestimate the **Population Standard Deviations** as frequently as it underestimates them. In other words, the **Estimated Standard Deviations** are not biased to being less than or greater than the **Population Standard Deviation**.

Now let's see what the histogram looks like for the when we divided the **sum of the squared residuals** by `n`.

In [None]:
hist(estimated.sds.with.n)
abline(v=pop.sd, col="red", lwd=10, lty="dotted") 

When we divid by `n` instead of `n-1`, we no longer get histogram that is symmetrical around the **Population Standard Deviation**. Instead, it seems squewed towards lower values. In other words, when we divide by `n`, it looks like we tend to underestimate the **Population Standard Deviation** more frequently than the overestimate it, and, as as result, we would call these estimates biased.

# bam.