# [The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)
## Chapter 03 - Saving Time and Money with Probability Distributions and Models!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Draw statistical distributions and save the images as **PDF**s.
- Fit a statistical distribution to a histogram.
- Calculate probablities from statistical distributions.
- Lastly, we'll learn how to generate random numbers from statistical distributions. 

**NOTE:**
This tutorial assumes that you have installed **[R](https://cran.rstudio.com/)**, and possibly **[RStudio](https://posit.co/download/rstudio-desktop/)** and read Chapter 3 in **[The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)**.

----

# Drawing a statistical distribution

There are two primary ways to draw curves, like the ones made by statistical distributions, in R, a rather elegant way that uses the `curve()` function, and a kind of clunky three-step method that I used for before I learned about the `curve()` function. Because both methods are commonly used (and not just by me), we'll learn both. We'll start with the clunky three-step method since it will shed some light in how things are done with the `curve()` function.

## The clunky three-step method for drawing a statistical distribution

The first thing we need to do to draw a statistical distribution with the clunky three-step method is generate an array of x-axis coordinates that span the range that we want to draw. We do this with the `seq()` function, which generates an sequence of numbers from a starting point to an ending point. For a normal distribution with mean = 0 and standard deviation = 1, we'll create a sequence of x-axis coordinates from -5 to 5, with a step size = 0.1. This will create a sequence of 101 values equally spaced between -5 and 5.

In [None]:
## create an array of x-axis coordaintes
x.axis <- seq(from=-5,
              to=5,
              by=0.1) # the step size

## print out the first 10 values
x.axis[1:10]

Next we need to determine the y-axis coordinates that coorespond to each value in `x`. For a normal distribution, we get the y-axis coordinates with the `dnorm()` function, where the **d** stands for **density** and **norm** stands for **normal distribution**. The use of the term **density** comes from the fact that the curve will tell us where the probabilies are most dense.

Anyway, `dnorm()` takes three arguments, `x` needs to be an array of x-axis coordiantes, `mean` is the mean of the distribution and `sd` is the standard deviation. In this example, the mean will be 0 and the standard deviation will be 1.

In [None]:
## create an array of y-axis coordinates that
## correspond to each value in x.axis
y.axis <- dnorm(x=x.axis, mean=0, sd=1)

## print out the first 10 values
y.axis[1:10]

Now that we have both the x-axis coordinates and the corresponding y-axis coordinates for a normal distribution we can draw them with the `plot()` function.

In [None]:
plot(x.axis, y.axis)

And, as we see, by default, `plot()` draws all 101 points that we have coordinates for. However, this isn't really what we want. What we want is a nice, smooth looking curve. 

The good news is that we can draw a nice, smooth looking curve by passing `type="l"` to `plot()`, where `"l"` specifies that we want lines between each point, instead of just a bunch of points.

In [None]:
plot(x.axis, y.axis, type="l")

Lastly, before we move on, I want to point out that we can change the color of the normal distribution with `col`, and we can specify the thickness of the line with `lwd`.which is short for **line width**.

In [None]:
plot(x.axis, y.axis, # x and y-axis coordinates
     type="l", # draw lines between points
     col="blue", # color of the line
     lwd=20) # line thickness = 20

Thus, we can draw a normal distribution with three steps:

- Define the x-axis coordinates with `seq()`
- Get the corresponding y-axis coordinates with `dnorm()`
- Plot the values with `plot()`.

Now let's see how we can do all three steps at once with the `curve()` function.

## Using `curve()` to draw a normal distribution

To use the `curve()` function we have to pass it the distribution function that defines the y-axis coordinate values, which, in this example, is `dnorm()`, the `from` and `to` values we passed to `seq()`, and the number of points we want to draw lines between to see a curve (more points would give us a smoother curve, but take longer to draw).

The one thing that might seem strange about using the `curve()` function is that, the call to `dnorm()` sets the x-axis values to `x`, which is an array we haven't created yet. Instead, `curve()` will create this for us.

Thus, a basic call to `curve()` looks like this:

In [None]:
curve(dnorm(x=x, mean=0, sd=1), # distribution function and parameters we want to draw
      from=-5, # minimum x-axis value
      to=5, # maximum x-axis value
      n=101) # the number of points to draw lines between

However, we can also change the color and with of the line just like we did when we called the `plot()` function earlier.

In [None]:
curve(dnorm(x=x, mean=0, sd=1), # distribution function and parameters we want to draw
      from=-5, # minimum x-axis value
      to=5, # maximum x-axis value
      n=101, # the number of points to draw lines between
      col="blue", # color of the curve
      lwd=20) # width of the line.

Bam! Now let's learn how to save the graph as a PDF.

## Saving a graph as a PDF

To save a graph as a PDF, we first call the `pdf()` function and specify the name of the PDF file we want to create. In this example, we'll call our new file `my_normal_cruve.pdf`. We then can either call `plot()` or `curve()`, whichever one we used to originally draw the graph to the screen. Lastly, we call `dev.off()` to tell **R** that we are done telling it what goes in the new file.

In [None]:
pdf("my_normal_curve.pdf") # define the new file and it's name

curve(dnorm(x=x, mean=0, sd=1), # add the curve to the file
      from=-5,
      to=5,
      n=101,
      col="blue",
      lwd=20)

dev.off() # let R know that we are done telling it what goes in the new file

# BAM!

Now that we know how to draw a graph of a normal curve and then save it to a PDF, let's learn how to do the same thing with the **Exponential Distribution**.

## Drawing an exponential distribution and saving it to a PDF

Drawing an exponential distribution curve is just like drawing a normal distribution, except instead of using `dnorm()` to get the y-axis coordiantes, we use `dexp()`, where, once again, **d** stands for density, but now **exp** stands for **exponential distirbution**.

The big difference between calling `dnorm()` and `dexp()` is that instead of specifying the mean and standard deviation, we have to specify the rate. The rate is defined as `1/mean`.

Since I tend to think more about means than rate, I like to define the mean like this...

In [None]:
exp.mean = 2

..and then pass `1/exp.mean`, which is the rate, to dexp(), like this...

In [None]:
curve(dexp(x=x, rate=1/exp.mean), 
      from=0,
      to=10,
      n=101,
      col="orange",
      lwd=20)

BAM! Now let's save that curve as a PDF.

In [None]:
pdf("my_exp_curve.pdf") # define the new file and it's name

curve(dexp(x=x, rate=1/exp.mean), 
      from=0,
      to=10,
      n=101,
      col="orange",
      lwd=20)

dev.off() # let R know that we are done telling it what goes in the new file

# BAM!

**NOTE:** If we want to draw a uniform distribution, we would do things similarly exept we would use `dunif()` instead of `dnorm()`. Other distributions are similarly titled with **d** preceding an abreviation for the distribution.

Now let's learn how to fit a statistical distribution to a histogram.

-----

# Fitting a statistical distribution to a histogram

In order to fit a statistical distribution to a histogram, we have to first import that we can use to draw the histogram. In this example, we'll use the `spend_n_save.txt` dataset that we have used in previous chapters.

In [None]:
## First, use read.delim() to read the data in "spend_n_save.txt"
spend.n.save.df <- read.delim(file="https://raw.githubusercontent.com/StatQuest/sigs/refs/heads/main/chapter_01/spend_n_save.txt", sep="\t")

## Verify that read.delim() was successful by printing out the
## first few rows with the head() function
print("The first few rows fo data in spend.n.save.df")
head(spend.n.save.df)

Now, just as we have done in the previous chapter, let's draw a histogram of the number of apples for sale at each store with the `hist()` function.

In [None]:
hist(spend.n.save.df$num.apples)

Now, because the histogram looks a little bit like a jagged normal distribution, we'll try to fit a normal distrbition to it. That means we need to calculate the mean and the standard deviation of the data. Specifically, since we have all of the measurements for the entire population, we need to calculate the population mean...

In [None]:
pop.mean <- mean(spend.n.save.df$num.apples)
pop.mean

...and the population standard deviation...

In [None]:
## calculate the population standard deviation 

## First, lets' just calculate the population variance...
numerator <- sum((spend.n.save.df$num.apples - pop.mean)^2) # = sigma(x - mu)^2
denominator <- nrow(spend.n.save.df) # = N

pop.var <- numerator / denominator

## print out the population variance
print(paste("population variance:", pop.var))

## now we can calculate the population standard deviation
## by taking the square root of the population variance.
pop.sd <- sqrt(pop.var)

## print out the population standard deviation
print(paste("population sd:", pop.sd))

Bam!

**NOTE:** If we didn't have all of the measurements for the entire population, we would just use the estimated mean and the estimated standard deviation. In that case, our job would actually be easier since we could just use the `sd()` function to calculate the estimated standard deviation.

Anyway, now that we have the mean and standard deviation, we also need to determine the range of x-axis values we want the normal distribution to span. We can find the minimum x-axis value with the `min()` function...

In [None]:
min.val <- min(spend.n.save.df$num.apples)
min.val

...and we can find the maximum x-axis value with the `max()` function...

In [None]:
max.val <- max(spend.n.save.df$num.apples)
max.val

Now that have the mean, standard deviation, minimum x-axis value, and the maximum x-axis value, we have everything we need to make a call to the `curve()` function. However, since we want the normal distribution to overlap the histogram, the first thing we need to do is call the `hist()` function again to redraw it. However, this time when we call `hist()`, we'll set `freq=FALSE` so that the columns represent the density of the values. This will ensure that the columns are on the same scale as the density function we use to draw the normal distribution.

Then we'll call the `curve()` function like we did earlier in this tutorial. However, this time we'll also include `add=TRUE` to the list of parameters we are passing in so that the curve that it draws is added to the histogram. If `add=FALSE`, which is the default setting, then the curve will be drawn separately.

**NOTE:** Both the `hist()` function and the `curve()` function have been called in the same block of code, otherwise jupyter will attempt to draw them as two separate graphs.

In [None]:
## First draw the histogram
hist(spend.n.save.df$num.apples, freq=FALSE)

## Now draw the normal curve over it
## NOTE: The color for the normal curve is in 
## hexidecimal format where the first two digits
## represent the values for Red, the second two
## digits represent the values for Blue, the third
## two digits represent the values for Green, and the
## last two digits represent the "alpha" or, how opaque
## the color should be. In this case, we're setting the
## alpha to 88 so that the color is semi-transparit. This means
## that we'll be able to see the parts of the histogram
## that are under the normal curve.
curve(dnorm(x=x, mean=pop.mean, sd=pop.sd), 
      from=min.val, 
      to=max.val, 
      n=101, 
      col="#225ea888", 
      lwd=20, 
      add=TRUE)

# Double BAM!!

Now that we know how to fit a statistical distribution to a histogram, let's learn how to calculate probabilities with statisical distributions.

----

# Calculating probabilities with statistical distributions

If we want to calculate probabilities from a normal distribution, we use the `pnorm()` function, which, by default returns the area under the curve from the left edge of the distribution (or negative -infinity if there is no left edge) to a given x-axis coordinate.

For example, if we wanted to use the distribution that we just fit to the `spend.n.save.df$num.apples` histogram to calculate the probability of walking into a store with 10 or fewer apples for sale, we would call `pnorm()` and set `q=10`, where **q** is short for quantile (personally, I think it would be great if we could, instead, set `x=10`, since we're passing in a x-axis coordinate, but that's just me). We would also set `mean=pop.mean` and `sd=pop.sd`. And if we wanted to use a different normal distribtion, then we would just specify different values for `mean` and `sd`.

**NOTE:** If we want to calculate the probability associated with the exponential distribiton, we would do things similarly exept we would use `pexp()` instead of `pnorm()`. Other distributions are similarly titled with **p** preceding an abreviation for the distribution.

In [None]:
pnorm(q=10, mean=pop.mean, sd=pop.sd)

Bam! The result tells us that there is close to a 2% chance that we could walk into a random store and see, at most, 10 apples for sale.

Now let's calculate the probability of walking into a store that has 15 or fewer apples for sale by setting `q=15`.

In [None]:
pnorm(q=15, mean=pop.mean, sd=pop.sd)

Bam! The result tells us that there is close to a 16% chance that we could walk into a random store and see, at most, 15 apples for sale.

Now let's turn things around and calculate the probability of walking into a random store and seeing 30 or *more* apples for sale. This means we have to set `lower.tail=FALSE`. This will calculate the area under the curve from the x-axis coordinate, set with `q=30`, to the right edge of the distribution. And if the distribution doesn't have a right edge, it calculates the area under the curve from the x-axis coordinate to +infinity.

In [None]:
pnorm(q=30, mean=pop.mean, sd=pop.sd, lower.tail=FALSE)

The result tells us that there is close to a 2% chance that we could walk into a random store and see, *at least*, 30 apples for sale.

# TRIPLE BAM!!! 

Now let's learn how to generate random numbers from a statistical distribution.

----

# BONUS Generating random numbers from statistical distributions

In order to generate random numbers from a normal distribution, we use the `rnorm()` function, where **r** stands for **random**. `rnorm()` is like `dnorm()` and `pnorm()` except that now, instead of specifying x-axis coordinates, we specify the number of random values we want. For example, if we want 5 random values, we set `n=5`.

**NOTE:** If we want to generate random numbers from a exponential distribiton, we would do things similarly exept we would use `rexp()` instead of `rnorm()`. Other distributions are similarly titled with **r** preceding an abreviation for the distribution.

In [None]:
set.seed(42) # first, set the seed so that the results are reproduceable.

## now generate 5 random values from the
## normal distribution fit to the histogram
rand.values <- rnorm(n=5, mean=pop.mean, sd=pop.sd)
rand.values

Now, just for fun, we can calculate the estimated mean from that sample...

In [None]:
est.mean <- mean(rand.values)
est.mean

Lastly, let's add a verticle line at the estimated mean to our histogram with the overlapping normal distribution.

In [None]:
## First draw the histogram
hist(spend.n.save.df$num.apples, freq=FALSE)

## Now draw the normal curve over it
curve(dnorm(x=x, mean=pop.mean, sd=pop.sd), 
      from=min.val, 
      to=max.val, 
      n=101, 
      col="#225ea888", 
      lwd=20, 
      add=TRUE)

## now draw a vertical line at the estimated mean
abline(v=est.mean, col="red", lwd="5")

And we see that our estimated mean is to the right of the population mean (the highest point on the normal curve). For more fun, try increasing the sample size to see if the estimated mean gets closer to the highest point on the normal curve.

# BONUS BAM!

----