# [The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)
## Chapter 03 - Saving Time and Money with Probability Distributions and Models!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Draw statistical distributions and save the images as **PDF**s.
- Fit a statistical distribution to a histogram.
- Calculate probablities from statistical distributions.
- Lastly, we'll learn how to generate random numbers from statistical distributions. 

**NOTE:**
This tutorial assumes that you have installed **[Python](https://www.python.org/)** and read Chapter 3 in **[The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)**.

----

Since we're using Python, the first thing we do is load in some modules that will help us load data and do math and plot graphs.

In [None]:
import pandas as pd # to import data into a dataframe
import numpy as np # to generate sequences of numbers
import seaborn as sns # to draw a graphs and have them look somewhat nice
import statistics # to calculate mean and standard deviation
from scipy.stats import norm, expon # to generate y-axis coordinates for distribtuions

# Drawing a statistical distribution

The first thing we need to do to draw a statistical distribution is generate an array of x-axis coordinates that span the range that we want to draw. We do this with the `np.arange()` function, which generates an sequence of numbers from a starting point to an ending point. For a normal distribution with mean = 0 and standard deviation = 1, we'll create a sequence of x-axis coordinates from -5 to 5, with a step size = 0.1. This will create a sequence of 101 values equally spaced between -5 and 5.

In [None]:
## create an array of x-axis coordinates
x_axis = np.arange(start=-5, 
                   stop=5.1, 
                   step=0.1)

## print out the first 10 values
x_axis[:10]

Next we need to determine the y-axis coordinates that coorespond to each value in `x`. For a normal distribution, we get the y-axis coordinates with the `norm.pdf()` function that we imported from `scipy.stats`, `norm` refers to the **normal distribution** and `pdf` stands for **probability density function**. The word **density** is used because the curve will tell us where the probabilies are most dense.

Anyway, `norm.pdf()` takes three arguments, `x` needs to be an array of x-axis coordiantes, `loc` is short for **location** and is the mean of the distribution and `scale` is the standard deviation. For the normal distribution, the mean is called a location parameter because it determines the location of the center, or highest point, of the distrbtion. The standard deviation is called a scale parameter because it scales the height of the distribition. The larger the standard deviation, the smaller the height.

In this example, the mean will be 0 and the standard deviation will be 1.

In [None]:
## create an array of y-axis coordinates that correspond to each value in x_axis
## NOTE: 'loc' = location = the mean. This is because the mean is also called
##       a "location parameter" since the mean determines the location of the
##       center, or highest point, of the distribution.

##       'scale' = standard deviation. This is because the standard deviation is
##       also called a "scale parameter". This is because the standard deviation
##       scales the height of the distribution. The larger the standard deviation
##       the smaller the height.
y_axis = norm.pdf(x=x_axis, 
                  loc=0, 
                  scale=1)

## print out the first 10 values
y_axis[:10]

Now that we have both the x-axis coordinates and the corresponding y-axis coordinates for a normal distribution we can draw them with the `sns.lineplot()` function.

In [None]:
sns.lineplot(x=x_axis, y=y_axis)

Now, before we move on, I want to point out that we can change the color of the normal distribution with `color`, and we can specify the thickness of the line with `linewidth`.

In [None]:
sns.lineplot(x=x_axis, y=y_axis,
            color='green',
            linewidth=10)

Thus, we can draw a normal distribution with three steps:

- Define the x-axis coordinates with `np.arange()`
- Get the corresponding y-axis coordinates with `norm.pdf()`, imported from `scipy.stats`
- Plot the values with `sns.lineplot()`.

## Saving a graph as a PDF

To saving a graph as a PDF is just like saving a histogram as a PDF, which we did in the Python coding tutorial for Chapter 2. The first thing we need to do is save the graph in a variable, so we'll save the last graph in a variable called `my_graph`.

In [None]:
my_graph = sns.lineplot(x=x_axis, y=y_axis,
                        color='green',
                        linewidth=10)

Then, just like we did in the tutorial for Chapter 2, we extract the figure from the variable with `get.figure()` and save the figure as a pdf with `savefig()`.

In [None]:
## extract the figure from our graph...
fig = my_graph.get_figure()
## Save the figure as a PDF file
fig.savefig('normal_curve.pdf') 

# BAM!

Now that we know how to draw a graph of a normal curve and then save it to a PDF, let's learn how to do the same thing with the **Exponential Distribution**.

Drawing an exponential distribution curve is just like drawing a normal distribution, except instead of using `norm.pdf()` to get the y-axis coordiantes, we use `expon.pdf()`, which we imported from `scipy.stats`. Also, `expon` stands for **exponential distirbution** and `pdf` stands for **probability density function**.

The big difference between calling `norm.pdf()` and `expon.pdf()` is that instead of specifying the mean and standard deviation, we just specify the mean. Also, in contrast to `norm.pdf()`, which uses the mean to define the location of the center of the distribution, `expon.pdf()` uses the mean to scale the height of the distribution. In this example, we'll set the mean to 2 with `exp_mean = 2` and then use that to scale the height of the exponential distribution by setting `scale=exp_mean`.

In [None]:
## first, create a sequence x-axis values
x_exp = np.arange(start=0,
                  stop=10,
                  step=0.1)

## now define the mean value for our exponential distribtution
exp_mean = 2

## lastly, get the cooresponding y-axis coordinates for each value in x_exp
y_exp = expon.pdf(x_exp, scale=exp_mean)

Now that we have the x and y-axis coordinates for our exponential distribtuion, we can plot them with `sns.lineplot()`.

In [None]:
my_exp_graph = sns.lineplot(x=x_exp, y=y_exp,
                            color='orange',
                            linewidth=10)

BAM! Now let's save that curve as a PDF.

In [None]:
## extract the figure from our graph...
fig = my_exp_graph.get_figure()
## Save the figure as a PDF file
fig.savefig('exp_curve.pdf') 

# BAM!

**NOTE:** If we want to draw a uniform distribution, we would do things similarly exept we would use `uniform()` from `scipy.stats`. Likewise, we could import other distributions from `scipy.stats` and use them instead.

Now let's learn how to fit a statistical distribution to a histogram.

----

# Fitting a statistical distribution to a histogram

In order to fit a statistical distribution to a histogram, we have to first import that we can use to draw the histogram. In this example, we'll use the `spend_n_save.txt` dataset that we have used in previous chapters.

In [None]:
## First, use pd.read_csv() to read the data in "spend_n_save.txt"
spend_n_save_df = pd.read_csv("https://raw.githubusercontent.com/StatQuest/sigs/refs/heads/main/chapter_01/spend_n_save.txt", sep="\t")

## rename num.apples to num_apples so it's easier to access the values
spend_n_save_df.rename(columns={'num.apples': 'num_apples'}, inplace=True)

# Verify that read_csv() was successful by printing out the first few rows
spend_n_save_df.head()

Now, just as we have done in the previous chapter, let's draw a histogram of the number of apples for sale at each store with the `sns.histplot()` function.

In [None]:
## Calculate bin edges using a specific rule by setting the 'bins' parameter
## Other options 'sturges', 'sqrt', and several more.
my_hist = sns.histplot(data=spend_n_save_df, x='num_apples', bins='scott')

Now, because the histogram looks a little bit like a jagged normal distribution, we'll try to fit a normal distrbition to it. That means we need to calculate the mean and the standard deviation of the data. Specifically, since we have all of the measurements for the entire population, we need to calculate the population mean...

In [None]:
## Calculate the mean of the number of apples for sale
## because we are using the data from every single store
## we are calculating the Population Mean
## Anyway, we'll save the value in a variable called pop_mean
pop_mean = statistics.mean(spend_n_save_df.num_apples)

## print out the population mean
pop_mean

...and the population standard deviation...

In [None]:
## calculate the population standard deviation and save it in 
## a variable called pop_sd
pop_sd = statistics.pstdev(spend_n_save_df.num_apples)

## print out the population standard deviation
pop_sd

Bam!

**NOTE:** If we didn't have all of the measurements for the entire population, we would just use the estimated mean and the estimated standard deviation.

Anyway, now that we have the mean and standard deviation, we also need to determine the range of x-axis values we want the normal distribution to span. We can find the minimum x-axis value with our dataframe's `min()` method...

In [None]:
## extract the minimum value
min_val = spend_n_save_df['num_apples'].min()

## print outthe minimum value
min_val

...and we can find the maximum x-axis value with the `max()` method...

In [None]:
max_val = spend_n_save_df['num_apples'].max()

max_val

Now that have the mean, standard deviation, minimum x-axis value, and the maximum x-axis value, we have everything we need to determine the x and y-axis values for a normal distribution fit to our data. So, the first thing we do is generate a sequence of x-axis values...

In [None]:
x_axis = np.arange(start=min_val,
                   stop=max_val,
                   step=0.1)

x_axis[:10]

...then we calculate the corresponding y-axis values for a normal distribution fit to the data...

In [None]:
y_axis = norm.pdf(x=x_axis,
                  loc=pop_mean, 
                  scale=pop_sd)

y_axis[:10]

Now that have the x and y-axis values for a normal distribution fit to the data, we have everything we need to draw it. However, since we want the normal distribution to overlap the histogram, the first thing we need to do is call the `histplot()` function again to redraw it. However, this time when we call `hist()`, we'll set `stat='density'` so that the columns represent the density of the values. This will ensure that the columns are on the same scale as the density function we use to draw the normal distribution.

Anyway, we save the histogram in a variable, `my_hist`, so that we can draw a normal curve on top of it. To do this, we call `my_hist.plot()` with the x and y-axis values for the normal distribution fit to the data.

In [None]:
## First draw the histogram
my_hist = sns.histplot(data=spend_n_save_df,
                       x='num_apples',
                       bins='scott',
                       stat='density',
                       color='grey')

## Now draw the normal curve over it
## NOTE: The color for the normal curve is in 
## hexidecimal format where the first two digits
## represent the values for Red, the second two
## digits represent the values for Blue, the third
## two digits represent the values for Green, and the
## last two digits represent the "alpha" or, how opaque
## the color should be. In this case, we're setting the
## alpha to 88 so that the color is semi-transparit. This means
## that we'll be able to see the parts of the histogram
## that are under the normal curve.
my_hist.plot(x_axis, y_axis,
             color='#225ea888',
             linewidth=10)

# Double BAM!!

Now that we know how to fit a statistical distribution to a histogram, let's learn how to calculate probabilities with statisical distributions.

----

# Calculating probabilities with statistical distributions

If we want to calculate probabilities from a normal distribution, we use the `norm.cdf()` function, which returns the area under a normal curve from negative infinity to a user specified x-axis coordinate. **cdf** stands for **cumulative distribution function** and reflect that the fact that for any specific x-axis coordinate, `norm.cdf()` returns the probabilty of anything happening to the left of it.

For example, if we wanted to use the distribution that we just fit to the `spend.n.save.df$num.apples` histogram to calculate the probability of walking into a store with 10 or fewer apples for sale, we would call `norm.cdf()` and set `x=10`, where **x** is the x-axis coordinate we are interested in, and we would also set `loc=pop.mean` and `scale=pop.sd`. And if we wanted to use a different normal distribtion, then we would just specify different values for `loc` and `scale`.

**NOTE:** If we want to calculate the probability associated with the exponential distribiton, we would do things similarly exept we would use `expon.cdf()` instead of `norm.cdf()`. Likewise, other distributions would just require us to use their corresponding `cdf()` methods.

In [None]:
norm.cdf(x=10, loc=pop_mean, scale=pop_sd)

Bam! The result tells us that there is close to a 2% chance that we could walk into a random store and see, at most, 10 apples for sale.

Now let's calculate the probability of walking into a store that has 15 or fewer apples for sale by setting `q=15`.

In [None]:
norm.cdf(15, loc=pop_mean, scale=pop_sd)

Bam! The result tells us that there is close to a 16% chance that we could walk into a random store and see, at most, 15 apples for sale.

Now let's turn things around and calculate the probability of walking into a random store and seeing 30 or *more* apples for sale. In other words, we want to calcualte the probability of anything that could occurr to the *right* of a specific x-axis coordinate. 

To do calculate the probability of observing something to the *right* of a specified x-axis coordinate, we have to remember that the total area under the curve is 1 and `norm.cdf()` only calculates cumulative probablities to the *left* of a specified x-axis coordinate. Thus, we calculate the area of something happening to the right of the x-axis coordinate by subtracting the area to the left from 1.

In [None]:
1 - norm.cdf(x=30, loc=pop_mean, scale=pop_sd)

The result tells us that there is close to a 2% chance that we could walk into a random store and see, *at least*, 30 apples for sale.

# TRIPLE BAM!!! 

Now let's learn how to generate random numbers from a statistical distribution.

----

# BONUS! Generating random numbers from statistical distributions

In order to generate random numbers from a normal distribution, we use the `norm.rvs()` function, where **rvs** stands for **random variables**. `norm.rvs()` is like `norm.pdf()` and `norm.cdf()` except that now, instead of specifying x-axis coordinates, we specify the number of random values we want. For example, if we want 5 random values, we set `size=5`.

**NOTE:** If we want to generate random numbers from a exponential distribiton, we would do things similarly exept we would use `expon.rvs()` instead of `norm.rvs()`. Likewise, for other distributions would just requires us to user their corresponding `rvs()` methods.

In [None]:
np.random.seed(42) # first, set the seed so that the results are reproduceable.

## now generate 5 random values from the
## normal distribution fit to the histogram
rand_values = norm.rvs(size=5, loc=pop_mean, scale=pop_sd)

## print out the random values
rand_values

Now, just for fun, we can calculate the estimated mean from that sample...

In [None]:
est_mean = statistics.mean(rand_values)

est_mean

Lastly, let's add a verticle line at the estimated mean to our histogram with the overlapping normal distribution.

In [None]:
## First draw the histogram
my_hist = sns.histplot(data=spend_n_save_df,
                       x='num_apples',
                       bins='scott',
                       stat='density',
                       color='grey')

## Now draw the norml curve
my_hist.plot(x_axis, y_axis,
             color='#225ea888',
             linewidth=10)

## now draw a vertical line at the estimated mean
my_hist.axvline(x=est_mean, color='red', linestyle='--', linewidth=2)

And we see that our estimated mean is to the right of the population mean (the highest point on the normal curve). For more fun, try increasing the sample size to see if the estimated mean gets closer to the highest point on the normal curve.

# BONUS BAM!