# The StatQuest Illustrated Guide to Statistics
## Chapter 07 - Using More Variables to Make Predictions with Multiple Regression!!!!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Use the `ols()` for Multiple Regression.
- Use random data to create a histogram and a *p*-value for Multiple Regression.
- Understand how $R^2$ and Adjusted $R^2$ change when we add bogus variables to a model.
- Calculate an *F*-value that compares predictions from a fitted plane to a fitted line.

**NOTE:**
This tutorial assumes that you have installed **[Python](https://www.python.org/)** and read Chapter 7 in **[The StatQuest Illustrated Guide to Statistics]()**.

----

Since we're using Python, the first thing we do is load in some modules that will help us do math and plot graphs.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns # to draw a graphs and have them look somewhat nice
import matplotlib.pyplot as plt # give us easy control of the range of values for
                                # the x and y axes.
import statsmodels.formula.api as smf ## to do linear regression with ols()...

# Using the `ols()` function for Multiple Regression

If we're going to fit a a shape to some data, then the first thing we need is some data. In this example, we'll use the dataset illustrated in Chapter 7, which has the number of stores, the number of products, and the revenue for 5 companies.

In [None]:
## first, let's put the data into a DataFrame
num_stores = np.array([2, 12, 15, 9, 8])
num_products = np.array([1, 9, 7, 8, 7])
revenue = np.array([3, 12.5, 7, 11, 9])

data = pd.DataFrame({
    "num_stores": num_stores,
    "num_products": num_products,
    "revenue": revenue
})

data

Now that we have the data all packed up neatly in a data frame, we can pass it to the `ols()` function and then use `fit()` to fit a shape to it. Specifically, because we're using two variables, `num_stores` and `num_products` to predict `revenue`, we'll fit a plane to the data.

**NOTE:** Just like in the last tutorial, we have to give the `ols()` function a **formula**. In this case, because we are using `num_stores` and `num_products` to predict `revenue`, the formula is `revenue ~ num_stores + num_products`.

In [None]:
## do a multiple regression and save the results in mr.plane,
## where mr = multiple regression.
mr_plane = smf.ols("revenue ~ num_stores + num_products", data=data)

Now that we've created an ordinary least squares model called `model`, we can fit it to our data...

In [None]:
## Now fit the model to the data (optimize the 3 parameters,
## the y-axis intercept and coefficients for num_stores and
## num_products)
mr_results = mr_plane.fit()

Now, like we did in the last tutorial, we can calculate the $R^2$ and it's *p*-value, and a ton of other things, with the `summary()` method.

In [None]:
mr_results.summary()

The summary tells us the $R^2$ for the plane fit to the data is 0.9657. The coorresponding *p*-value for the $R^2$ value is listed as `Prob (F-statistic)`, and it is equal to 0.0343. So, in this case, we can reject the Null Hypothesis that Revenue predictions derived from the plane are no different from predictions derived from just the mean value of the Revenue.

Bam.

**NOTE:** Because we did multiple regression, the `Adjusted R-squared`, which is right below the `R-squared` bit,  is also of interest. In this case, the value is quite high, 0.931, suggesting that we have not included a bunch of useless variables to our model.

Now let's talk about the coefficents, the parameter values for the the y-axis intercept and the slopes associated with the independent variables (the ones we are using to make predictions), `num_stores` and `num_products`. The coefficients, and related statistics, are in the middle of the output generated by `summary()`. However, they will be easier to talk about if we just print them out without everything else. We can do that with the following command.

In [None]:
## summary() returns 3 tables
summary_stuff = mr_results.summary()
## print out the second one, it has
## everything related to the coefficients
summary_stuff.tables[1]

Like when we did simple linear regression, the first column gives us the parameter values, the y-axis intercept, 1.8716, the slope for `num_stores`, -0.3743, and the slope for `num_products`, 1.5738. Also, like before, the next two columns, **Std. Error** and **t value**, are not super interesting. However, the second to the last column, **P(>|t|)**, is very interesting. It tells us the *p*-value for each parameter testing the Null Hypothesis that the specific parameter value is actually 0.

For example, the *p*-value for `num_stores`, 0.152, is for the Null Hypothesis that the interept is 0. In this case, we can't reject the null hypothesis. In other words, we could try omitting `num_stores` from the regression, and just do a simple linear regression with `num_products` and we might not see a huge difference in predictions.

In contrast, the *p*-value for `num_products` is 0.026, which tells us that we can reject the Null Hypothesis that this parameter is 0. This suggests that if we omitted `num_products` from the regression, then the simple linear regression that only used `num_stores` would make much worse predictions.

**NOTE:** Unlike when we did the simple linear regression, the *p*-values for all of the parameters (coefficients) are different from the *p*-value for the entire model, 0.0343, which is in the upper right hand corner of the summary. The *p*-value for the entire model is based on the Null Hypothesis that the parameters for both `num_stores` *and* `num_products` are 0. Whereas, the *p*-values in the coefficents section are just for when the Null Hypothesis is that one specific parameter is 0.

Now that we know how to do a multiple regressin in *R* and interpret the output, let's learn how we can calculate a *p*-value for the $R^2$ value with a histogram.

----

# Using random data to create a histogram and a p-value for Multiple Regression

Just like we did in the last tutorial, we can calculate a *p*-value for a $R^2$ value with a histogram. The only difference is that now we have to generate random data for more variables.

In [None]:
## since we're going to generate random datasets,
## let's start by setting the seed so that the results
## are reproducable
np.random.seed(42)

## To generate random datasets, we'll use
## normal distributions. These distributions
## will be based on our observed data, so we
## we need to calculate their estimated
## means and standard deviations.
mean_num_stores = data["num_stores"].mean()
sd_num_stores = data["num_stores"].std(ddof=1)

mean_num_products = data["num_products"].mean()
sd_num_products = data["num_products"].std(ddof=1)

mean_revenue = data["revenue"].mean()
sd_revenue = data["revenue"].std(ddof=1)

## Next, we define the number of random
## datasets we wantt o create...
num_rand_datasets = 10_000

## ...and we define the number of data points
## per dataset
num_datapoints = len(data)

## Create an empty array that is num_rand_datasets long
rand_r_squared = np.empty(num_rand_datasets)

## Here is the loop where we create a bunch of random datasets,
## each with num_datapoints values, fit a multiple regression
## line to the random data, then calculate and store
## the R-squared values
for i in range(num_rand_datasets):

    ## generate random values for each variable
    rand_num_stores = np.random.normal(loc=mean_num_stores,
                                       scale=sd_num_stores,
                                       size=num_datapoints)

    rand_num_products = np.random.normal(loc=mean_num_products,
                                         scale=sd_num_products,
                                         size=num_datapoints)

    rand_revenue = np.random.normal(loc=mean_revenue,
                                    scale=sd_revenue,
                                    size=num_datapoints)

    # bundle into a DataFrame
    rand_data = pd.DataFrame({
        "rand_num_stores": rand_num_stores,
        "rand_num_products": rand_num_products,
        "rand_revenue": rand_revenue
    })

    ## fit a multiple regression to the random data
    ## and calculate R-squared
    rand_model = smf.ols("rand_revenue ~ rand_num_stores + rand_num_products", data=rand_data)
    rand_results = rand_model.fit()

    ## save the R-squared value.
    rand_r_squared[i] = rand_results.rsquared

Now let's draw a histogram of the $R^2$ values with the `histplot()` function...

In [None]:
sns.histplot(data=rand_r_squared)

...and calculate the *p*-value as the precentage of "random" $R^2$ values greater than or equal to the one we got for our original data.

In [None]:
# the number of randomly generated rsquared >= the original rsquared
num_greater = np.sum(rand_r_squared >= mr_results.rsquared)

# calculate the p-value
p_value = num_greater / num_rand_datasets

# print out the p-value
p_value

Thus, the *p*-value calculated with the histogram is 0.0336. Now let's compare that to the *p*-value stored in `results`...

In [None]:
mr_results.f_pvalue

And at last, we see that the two *p*-values are essentially the same.

# Double BAM!!

Now let's see how $R^2$ and Adjusted $R^2$ change when we add bogus variables to a model.

----

# Understanding how $R^2$ and Adjusted $R^2$ change when we add bogus variables to a model

First, let's add a bogus variable to our existing data.

In [None]:
## make a copy of the original data.frame
big_data = data.copy()

## add a bogus variable to big.data
## this variable, bogus_1, just has values
## randomly selected from a normal distribution
## with mean=0 and sd=1.
np.random.seed(1)
big_data["bogus_1"] = np.random.normal(loc=0, scale=1, size=len(data))

## print out the first few rows of big.data
big_data

Now that we have the data, which now includes a bogus variable, we can perform a multiple regression on it. This means we have to pass the `ols()` function a **formula** that includes all of the possible independent variables: `revenue ~ num_stores + num_products + bogus_1`.

In [None]:
## do multiple regression to predict revenue using all
## of the remaining columns in big_data to participate
## in the prediction
big_model = smf.ols("revenue ~ num_stores + num_products + bogus_1", data=big_data)
big_results = big_model.fit()

## print out the summary of the multiple regression
big_results.summary()

So, when we include the additional variable that is not at all correlated with Revenue, we see that, due to random chance, the value for $R^2$ goes up a little compared to before (0.971 vs. 0.966), suggesting that maybe the new predictions will, in general, be better than before. However, the adjusted $R^2$ goes down when we include the uncorrelated variable (0.883 vs 0.931), which suggests that maybe the predictions won't be better.

# TRIPLE BAM!!!

----

# BONUS: Calculate an *F*-value to compare predictions from a fitted plane to a fitted line

As a bonus, let's see if our fitted plane makes significantly better predictions than a straight line that just uses `num_stores` to predict `revenue`.

So, the first thing we need to do is a linear regression that just uses `num_stores` to predict `revenue`.

In [None]:
## First, create the model object...
model_line = smf.ols("revenue ~ num_stores", data=data)
results_line = model_line.fit()
results_line.summary()

Now we can draw a graph of the data with the linear regression line on it.

In [None]:
## First, draw the data...
my_plot= sns.scatterplot(x='num_stores',
                y='revenue',
                data=data,
                s=300, # scale the size of the points
                color='salmon')  # set the color

plt.xlim(0, 16)
plt.ylim(0, 16)

## ...now add the regression line.
my_plot.axline(xy1=(0, results_line.params.Intercept),
               slope=results_line.params.num_stores,
               color='deepskyblue',
               linestyle='-',
               linewidth=10)

Now we can print out the revenue values predicted by the line...

In [None]:
results_line.fittedvalues

...and we can print out the SSR(line)...

In [None]:
ssr_line = results_line.ssr
ssr_line

Likewise, we can print out the values predicted by the plane...

In [None]:
mr_results.fittedvalues

...and we can print out the SSR(plane)...

In [None]:
ssr_plane = mr_results.ssr
ssr_plane

Now that we have SSR(line) and SSR(plane) we can calculate the *F*-value that compares their predictions.

Since I can never remember the equation for *F*, here it is (rewritten ...

<span style="font-size: 24px;">
$F = \frac{[\textrm{SSR(Simpler)} - \textrm{SSR(Fancier)}] / (p_\textrm{Fancier} - p_\textrm{Simpler})}
    {\textrm{SSR(Fancier)} / (n - p_\textrm{Fancier})}$
</span>

...so, all we have to do do calculate *F* is plug in the SSR(line) for SSR(Simpler), the SSR(plane) for SSR(Fancier), $p_{\textrm{Simpler}}$, the number of parameters required for the fitted line, which is **2** (one for the slope and one for the y-axis intercept), $p_{\textrm{Fancer}}$, the number of parameters required for the fitted plane, which is **3** (the y-axis intercept, one for the number of stores and one for the number of products) and *n*, the number of datapoints, which is **5**.

In [None]:
## Calculate the F-value that compares the plane to the line...
F = ((ssr_line - ssr_plane) / (3 - 2)) / (ssr_plane / (5 - 3))
F

Lastly, let's conver that *F*-value into a *p*-value. We do this with the `f.cdf()` function.

In [None]:
## first, import 'f'
from scipy.stats import f

## now calculate the p-value (see notes below)
p_value_f = 1-f.cdf(x=F, dfn=(3 - 2), dfd=(5 - 3))
## NOTE: dfn = degrees of freedom in the numerator, which is DF1
##       dfd = degrees of freedom in the denominator, which is DF2
##
## ALSO NOTE: To do calculate the probability of observing something to the *right*
## of a specified x-axis coordinate, we have to remember that the total area under
## the curve is 1 and `f.cdf()` only calculates cumulative probablities to the
## *left* of a specified x-axis coordinate. Thus, we calculate the area of
## something happening to the right of the x-axis coordinate by subtracting
## the area to the left from 1.

p_value_f

And the *p*-value, 0.026, tells us to reject the null hypothesis that there is no difference between predictions made with the line compared to predictions made with the plane.

# BONUS BAM!!!

----