# [The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)
## Chapter 06 - Making Decisions and Predictions with Linear Regression!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Fit a line to data with the `ols()` function.
- Calculate the Sum of the Squared Residuals and $R^2$.
- Generating a Linear Regression Summary with `results.summary()`.
- Calculate a *p*-value for the $R^2$ using the null hypothesis to build a histogram.
- Calculate an *F*-value and corresponding *p*-value using the Sum of the Squared Residuals.

**NOTE:**
This tutorial assumes that you have installed **[Python](https://www.python.org/)** and read Chapter 6 in **[The StatQuest Illustrated Guide to Statistics](https://www.amazon.com/dp/B0GMP7Z9ZL)**.

----

Since we're using Python, the first thing we do is load in some modules that will help us do math and plot graphs.

In [None]:
import numpy as np # to generate random numbers
import pandas as pd # to use DataFrames
import seaborn as sns # to draw a graphs and have them look somewhat nice
import matplotlib.pyplot as plt # give us easy control of the range of values for
                                # the x and y axes.
import statsmodels.formula.api as smf ## to do linear regression with ols()...

# Fitting a line to data with the `ols()` function

If we're going to fit a line to some data, then the first thing we need is some data. In this example, we'll use the dataset illustrated in Chapter 6, which has the number of stores in one column and the revenue in another.

In [None]:
# first, let's put the data into a DataFrame
num_stores = [2, 12, 15]
revenue = [3, 12.5, 7]

data = pd.DataFrame({
    'num_stores': num_stores,
    'revenue': revenue
})

# print out the DataFrame
data

In [None]:
my_plot= sns.scatterplot(x='num_stores', 
                y='revenue', 
                data=data,
                s=300, # scale the size of the points
                color='salmon')  # set the color
plt.xlim(0, 16)
plt.ylim(0, 16)

Now that we've plotted our raw data, let's add a linear regression line to it. This requires two steps. First, we determine the slope and y-axis intercept for the linear regression line and second, we use tthe slope and y-axis intercept to draw a line on the graph.

We'll start by determining the slope and y-axis intercept.

**NOTE:** In Python there are a ton of ways to do linear regression to determine the slope and y-axis intercept. One very popular method is with the `LinearRegression()` function that is part of the `scikit-learn` package. However, in these tutorials, we will not use `LinearRegression()` function or many other popular plotting functions, like `regplot()`, because because they don't return basic statistics like $R^2$ and it's *p*-value. In other words, the `LinearRegression()` method is really only intended for people less interested in basic statistics and `regplot()` is only useful for people less interested in the y-axis intercept and the slope that the regression calculates.

Thus, in order to determine the slope and y-axis intercept, we'll use the `ols()` function that comes with the `statsmodels.formula.api` package, where **ols** stands for **Ordingary Least Squares**, which is the method we used to optimize parameters in the book. Anyway, the `old()` function will return an object that we can then use the fit a linear regression model to our data and then print out the results.

The `ols()` function can seem a little strange in that the first thing we pass it is called a **formula**. A **formula** has the form...

**Thing we want to predict ~ Variables we use to make predictions**

...where the **Thing we want to predict** is on the left side of a **~** character, and the **Wariables we want to use to make prediction** are on the right side. In this example, that means we will use `revenue ~ num_stores` as the formula, because we want to use `num_stores` to predict `revenue`.

Anyway, when using the `ols()` function, the other important thing to do is pass in the data, which we do with `data=data`, since all of our data is stored in a data.frame called `data`.

**NOTE:** One bonus we get for using `ols()` from `statsmodels.formula.api` is that the formula that we use in Python is the same as the formula that we would use if we were doing this in `R`. In other words, by using the formula notation, we can, in theory, go back and forth between these two programming languages using the same notation.

In [None]:
## First, create the model object...
model = smf.ols("revenue ~ num_stores", data=data)

Now that we've created an ordinary least squares model called `model`, we can fit it to our data...

In [None]:
## Now fit the model to the data (optimize the 2 parameters,
## the y-axis intercept and the slope)
results = model.fit()

...and then print out the values for the y-axis intercept and the slope with `results.params`.

In [None]:
results.params

**NOTE:** We can also print out the individual parameters by using their names...

In [None]:
results.params.Intercept

In [None]:
results.params.num_stores

Now, if we want to add the linear regression line to our graph of the data, we can do that with the `abline()` function. `abline()` makes this super easy by having an argument, `reg` that we can pass `lr.line` to and it will take care of all the details associated with drawing the line.

In [None]:
## First, draw the data...
my_plot= sns.scatterplot(x='num_stores', 
                y='revenue', 
                data=data,
                s=300, # scale the size of the points
                color='salmon')  # set the color

plt.xlim(0, 16)
plt.ylim(0, 16)

## ...now add the regression line.
my_plot.axline(xy1=(0, results.params.Intercept),
               slope=results.params.num_stores, 
               color='deepskyblue', 
               linestyle='-',
               linewidth=10)

# BAM!

Now that we have a graph of our data and it's corresponding linear regression line, let's learn how interpret the statistics associated with the linear regression.

----

# Generating a Linear Regression Summary with `summary()`

Earlier, when we printed the ouput from `ols()` that we saved in `results`, we specifically printed out the y-axis intercept and the slope. However, `results` contains a lot more data that we can access by using its `summary()` method.

**NOTE:** This will print out a lot of stuff, some of which we talk about in the book, some of which we don't. We'll focus on the stuff in the book, which, in my opinion, is the important stuff.

In [None]:
## Now print out the associated statsitics.
results.summary()

As we can see, the output from `summary()` is pretty extensive. However, right now we just want to know the $R^2$ value and it's associated *p*-value. We can find the value for $R^2$ in the upper right hand corner where it says `R-squared` and see that it is 0.449. The corresponding *p*-value is a few rows down where it says `Prob (F-statistic)`, which refers to how the *p*-value is traditionally calculated, and is 0.533.

Using the standard threshold for staistical significance, 0.05, we would fail to reject the Null Hypothesis that there is no relationshiop between the number of stores a company has and its revenue. That's a little bit of a bummer, but, it could be that there is a relationship, but that we just don't have enough data to be confident in saying so.

Anyway, `summary()` returns a lot of stuff, and it's not always super easy to find what we need, however, the good news is that we can access individual values, like the $R^2$ value or the *p*-value, directly.

For example, to access the $R^2$ value directly, we would use the following command...

In [None]:
results.rsquared

Likewise, we can access parameter values (the y-axis intercept and the slope), which are also called coefficients, with the following command:

In [None]:
## summary() returns 3 tables
summary_stuff = results.summary()
## print out the second one, it has 
## everything related to the coefficients
summary_stuff.tables[1]

As we can see, when print out the coefficients, we get all kinds of information in addtion to the values for the y-axis intercept and the slope, which are in the first column labeled **coef**. The next two columns, **std err** and **t**, are not super interesting right now, but the second to the last column **P(>|t|)** is. What this tells us that is the *p*-value for each parameter testing the Null Hypothesis that the parameter value is actually 0.

For example, for the estimated y-axis intercept that we calculated from the data is, 2.9622, however, the *p*-value that tests the hypothesis that it actually is equal to 0 is 0.699. This tells us that even though our estimate for the y-axis intercept is non-zero, we can't be confident that it really is.

**NOTE:** Although the `summary()` function always returns the *p*-value for the y-axis intercept, it's not used that often. Generally speaking, we don't really care what the y-axis intecept is. What is interesting, however, is the slope and it's *p*-value, because this tells us if we are confident (or not) about a relationship between the two variables we measured. In this case, that would tell us if there is a relationship between Number of Stores and Revenue.

So, in this example, when we look at the *p*-value for **num_stores**, the variable that contains the number of stores, we get 0.533, and this tells us that we fail to reject the Null Hypothesis that the slope is 0. In other words, we fail to reject the hypothesis that just using the mean value for Revenue (which is what we would use if the slope was 0), is significantly worse than using our linear regression line.

If we want to access the just the *p*-value for **num_stores** directly (and exclude the *p*-value for the y-axis intercept), we have two options: 1) We can use the name of the variable like this...

In [None]:
results.pvalues.num_stores

...or, 2) we can use `iloc[1]`.

In [None]:
results.pvalues.iloc[1]

Anyway, now that we know how to do a linear regresion with `lm()` and access and interpret the most important results, let's try to calculate some of these values by hand. We'll start by calculating $R^2$.

----

# Calculating the Sum of the Squared Residuals (SSRs) and $R^2$

Even though the `lm()` function calculated $R^2$, it's also helpful to know how to calculate it both by hand. So, let's start with the equation for $R^2$.

<span style="font-size: 24px;">
$R^2 = \frac{\textrm{SSR(mean)} - \textrm{SSR(fit)}}{\textrm{SSR(mean)}}$
</span>

Where SSR(mean) is the Sum of the Squared Residuals around the mean y-axis value, which, in this example, is Revenue, and SSR(fit) is the sum of the squared residuals around the fitted line. We'll start by calculating SSR(mean) and, more specifically, by calculating the mean value for Revenue:

In [None]:
## calculate mean revenue
mean_revenue = np.mean(data['revenue'])

## print out the mean revenue
mean_revenue

Now that we have the mean value for Revenue, we can calculate the Residuals around the mean by subtracting the mean from each Revenue value.

In [None]:
# Calculate the residuals from mean
mean_residuals = data['revenue'] - mean_revenue

## print out the residuals
mean_residuals

Now let's square each Residual:

In [None]:
## Square the residuals
mean_residuals_squared = mean_residuals ** 2

## print out the squared residuals
mean_residuals_squared

Now we just need to add up the squared residuals. We'll do this by passing `mean_residuals_squared` to the `np.sum()` function:

In [None]:
## Add up the squared residuals
ssr_mean = np.sum(mean_residuals_squared)

## Print out the SSR(mean)
print(ssr_mean)

Bam.

Now let's calculate SSR(fit). **NOTE:** We can extract the residuals directly from `results` or we can calculate them by hand. Here, we'll show you how to do it both ways.

We'll start by seeing the resduals stored in `results`.

In [None]:
results.resid

Now let's calculate the residuals by hand and compare our results.

To calculate the residuals by hand, we'll need the x-axis intercept and the slope of the linear regression line. However, we learned how to access those when we first drew the regression line on our graph. So, that's no problem.

Next, given the y-axis intercept and the slope, we can predict the revenue for each company in our dataset by multipling the number of stores in `data.num_stores` by the slope and then adding the y-axis intercept.

In [None]:
fit_predictions = results.params.Intercept + (results.params.num_stores * data.num_stores)

## print out the predicted values
fit_predictions

Next, we calculate the residuals by subtracting the predicted values from the observed Revenue values in `data.revenue`.

In [None]:
fit_residuals = data.revenue - fit_predictions

## print out the residuals
fit_residuals

Bam! We just calculated the residuals around the fitted line by hand. Now let's compare those to residuals stores in `results.resid`...

In [None]:
results.resid

...and we see that, either way we get the residuals, we get the same thing. In other words, when we calculated the residuals by hand, we didn't make a mistake.

Now let's finish calculating the SSR(fit) by squaring the residuals...

In [None]:
## Square the residuals
fit_residuals_squared = fit_residuals ** 2

## print out the squared residuals
fit_residuals_squared

...and then adding up the squared residuals.

In [None]:
ssr_fit = np.sum(fit_residuals_squared)
ssr_fit

Now that we have calculated SSR(mean) and SSR(fit), we can calculate
<span style="font-size: 18px;">
$R^2 = \frac{\textrm{SSR(mean)} - \textrm{SSR(fit)}}{\textrm{SSR(mean)}}$
</span>

In [None]:
# r squared calculation
r_squared = (ssr_mean - ssr_fit) / ssr_mean

## print out r_squared
r_squared

BAM!

Now let's compare that to the value stored in `results.rsquared`...

In [None]:
# r squared from summary
results.rsquared

...and we see that we got the same value, so we must have done all the math right.

# BAM!

Now let's learn how we can calculate a *p*-value for $R^2$ by with a histogram.

----

# Calculating a *p*-value for the $R^2$ with a histogram

Now that we know how to fit a linear regression line to data with `ols()` and calculate the $R^2$ value with `summary()` (and also by hand), let's learn how we can calculate a *p*-value using a histogram. This requires us to repeat the following steps a lot of times:

- Generate random data
- Fit a line to the data with `ols()`
- Calculate the $R^2$ value for that fit with `summary()`
- Store the $R^2$ value in an array

Once we have an array of $R^2$ values calculated from random datasets, we pass it to `histplot()` to see how they are distributed and then calculate a *p*-value by seeing how many of the "random" $R^2$ values are greater than the one for our original dataset. We'll start by generating the "random" $R^2$ values with the following code (**NOTE:** It might take a minute or so for this code to run).

In [None]:
# since we're going to generate random datasets,
# let's start by setting the seed so that the results
# are reproducible
np.random.seed(2)

# To generate random datasets, we'll use two
# normal distributions, one for the number of stores
# and one for the revenue. These distributions
# will be based on our observed data, so we
# need to calculate their estimated
# means and standard deviations.
mean_num_stores = data['num_stores'].mean()
sd_num_stores = data['num_stores'].std()

mean_revenue = data['revenue'].mean()
sd_revenue = data['revenue'].std()

# Next, we define the number of random
# datasets we want to create...
num_rand_datasets = 10_000

# ...and we define the number of data points
# per dataset
num_datapoints = len(data)

# Create an empty array that is num_rand_datasets long
rand_r_squared = np.empty(num_rand_datasets)

# Here is the loop where we create a bunch of random datasets,
# each with num_datapoints values, fit a linear regression
# line to the random data, then calculate and store
# the R-squared values
for i in range(num_rand_datasets):
    
    ## generate random values for the number of stores
    rand_num_stores = np.random.normal(loc=mean_num_stores,
                                       scale=sd_num_stores,
                                       size=num_datapoints)

    ## generate random values for the revenue
    rand_revenue = np.random.normal(loc=mean_revenue,
                                    scale=sd_revenue,
                                    size=num_datapoints)

    ## bundle the random values together in a DataFrame
    rand_data = pd.DataFrame({
        'rand_num_stores': rand_num_stores,
        'rand_revenue': rand_revenue})
    
    ## fit a linear regression line to the random data
    rand_model = smf.ols("rand_revenue ~ rand_num_stores", data=rand_data)
    rand_results = rand_model.fit()
    
    ## save the R-squared value.
    rand_r_squared[i] = rand_results.rsquared

Now let's draw a histogram of the $R^2$ values with the `histplot()` function...

In [None]:
sns.histplot(data=rand_r_squared)

...and calculate the *p*-value as the precentage of "random" $R^2$ values greater than or equal to the one we got for our original data.

In [None]:
# the number of randomly generated rsquared >= the original rsquared
num_greater = np.sum(rand_r_squared >= results.rsquared)

# calculate the p-value 
p_value = num_greater / num_rand_datasets

# print out the p-value
p_value

Thus, the *p*-value calculated with the histogram is 0.5318. Now let's compare that to the *p*-value stored in `results`...

In [None]:
# print the p-value for the slope coefficient (2nd coefficient)
results.pvalues.num_stores

So, at last, we see that the two *p*-values are essentially the same.

# BAM!

----

# BONUS: Calculating an *F*-value and *p*-value using the Sum of the Squared Residuals

The equation for *F* is...

<span style="font-size: 24px;">
$F = \frac{[\textrm{SSR(mean)} - \textrm{SSR(fit)}] / (p_\textrm{fit} - p_\textrm{mean})}
    {\textrm{SSR(fit)} / (n - p_\textrm{fit})}$
</span>

...so, all we have to do do calculate *F* is plug in the SSR(mean), the SSR(fit), $p_{\textrm{fit}}$, the number of parameters required for the fitted line, which is **2** (one for the slope and one for the y-axis intercept), $p_{\textrm{mean}}$, the number of parameters required for the mean, which is **1** (the y-axis intercept) and *n*, the number of datapoints, which is **3**.

In [None]:
F_numerator = (ssr_mean - ssr_fit) / (2 - 1)
F_denominator = ssr_fit / (3 - 2)

F = F_numerator / F_denominator
F

Now let's see if that matches the value in `results.fvalue`...

In [None]:
results.fvalue

...and it does! Now let's conver that *F*-value into a *p*-value. We do this with the `f.cdf()` function.

In [None]:
## first, import 'f'
from scipy.stats import f

## now calculate the p-value (see notes below)
p_value_f = 1-f.cdf(x=F, dfn=(2 - 1), dfd=(3 - 2))
## NOTE: dfn = degrees of freedom in the numerator, which is DF1
##       dfd = degrees of freedom in the denominator, which is DF2
##
## ALSO NOTE: To do calculate the probability of observing something to the *right* 
## of a specified x-axis coordinate, we have to remember that the total area under 
## the curve is 1 and `f.cdf()` only calculates cumulative probablities to the 
## *left* of a specified x-axis coordinate. Thus, we calculate the area of 
## something happening to the right of the x-axis coordinate by subtracting 
## the area to the left from 1.

p_value_f

Now let's check to see if the *p*-value we calculated by hand matches the value stored in `results.pvalues.num_stores`...

In [None]:
# print the p-value for the slope coefficient (2nd coefficient)
results.pvalues.num_stores

...and it does! Both are equatl to **0.5327**.

# BONUS BAM!!!

-----