# The StatQuest Illustrated Guide to Statistics
## Chapter 07 - Using More Variables to Make Predictions with Multiple Regression!!!!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Use the `lm()` for Multiple Regression.
- Use random data to create a histogram and a *p*-value for Multiple Regression.
- Understand how $R^2$ and Adjusted $R^2$ change when we add bogus variables to a model.
- Calculate an *F*-value that compares predictions from a fitted plane to a fitted line.

**NOTE:**
This tutorial assumes that you have installed **[R](https://cran.rstudio.com/)**, and possibly **[RStudio](https://posit.co/download/rstudio-desktop/)** and read the Chapter 7 in **[The StatQuest Illustrated Guide to Statistics]()**.

----

# Using the `lm()` function for Multiple Regression

If we're going to fit a a shape to some data, then the first thing we need is some data. In this example, we'll use the dataset illustrated in Chapter 7, which has the number of stores, the number of products, and the revenue for 5 companies.

In [None]:
## first, let's put the data into a data.frame()
num.stores <- c(2, 12, 15, 9, 8)
num.products <- c(1, 9, 7, 8, 7)
revenue <- c(3, 12.5, 7, 11, 9)

data <- data.frame(num.stores, num.products, revenue)

## print out the first few rows
data

Now that we have the data all packed up neatly in a data frame, we can pass it to the `lm()` function to fit a shape to it. Specifically, because we're using two variables, `num.stores` and `num.products` to predict `revenue`, we'll fit a plane to the data.

**NOTE:** Just like in the last tutorial, we have to give the `lm()` function a **formula**. In this case, because we are using `num.stores` and `num_products` to predict `revenue`, the formula is `revenue ~ num.stores + num.products`.

In [None]:
## do a multiple regression and save the results in mr.plane,
## where mr = multiple regression.
mr.plane <- lm(revenue ~ num.stores + num.products, data=data)

## print out the coefficients, or parameters, associated with
## the plane
mr.plane

Now, like we did in the last tutorial, we can calculate the $R^2$ and it's *p*-value, and a ton of other things, by passing `mr.plane` to the `summary()` function.

In [None]:
## get a summary of the multiple regression
mr.summary <- summary(mr.plane)

## print out the summary
mr.summary

The summary tells us the $R^2$ for the plane fit to the data `Multiple R-squared:  0.9657`, where **Multiple R-squared** now makes more sense, since we did multiple regression, however, it still just refers to the $R^2$ for the plane. The coorresponding *p*-value for the $R^2$ value is in the bottom right hand corner, `p-value: 0.03433`. So, in this case, we can reject the Null Hypothesis that Revenue predictions derived from the plane are no different from predictions derived from just the mean value of the Revenue.

Bam.

**NOTE:** Because we did multiple regression, the `Adjusted R-squared:  0.9313` is also of interest. In this case, the value is quite high, 0.931, suggesting that we have not included a bunch of useless variables to our model.

Now let's print out the coefficents, the parameter values for the the y-axis intercept and the slopes associated with the independent variables (the ones we are using to make predictions), `num.stores` and `num.products`.

In [None]:
mr.summary$coefficients

Like when we did simple linear regression, the first column gives us the parameter values, the y-axis intercept, 1.8715847, the slope for `num.stores`, -0.3743169, and the slope for `num.products`, 1.5737705. Also, like before, the next two columns, **Std. Error** and **t value**, are not super interesting. However, the last column, **Pr(>|t|)**, is very interesting. It tells us the *p*-value for each parameter testing the Null Hypothesis that the specific parameter value is actually 0.

For example, the *p*-value for `num.stores`, 0.15182659, is for the Null Hypothesis that the interept is 0. In this case, we can't reject the null hypothesis. In other words, we could try omitting `num.stores` from the regression, and just do a simple linear regression with `num.products` and we might not see a huge difference in predictions.

In contrast, the *p*-value for `num.products` is 0.02565899, which tells us that we can reject the Null Hypothesis that this parameter is 0. This suggests that if we omitted `num.products` from the regression, then the simple linear regression that only used `num.stores` would make much worse predictions.

**NOTE:** Unlike when we did the simple linear regression, the *p*-values for all of the parameters (coefficients) are different from the *p*-value for the entire model, 0.03433, which is in the lower right hand corner of the summary. The *p*-value for the entire model is based on the Null Hypothesis that the parameters for both `num.stores` *and* `num.products` are 0. Whereas, the *p*-values in the coefficents section are just for when the Null Hypothesis is that one specific parameter is 0.

Now that we know how to do a multiple regressin in *R* and interpret the output, let's learn how we can calculate a *p*-value for the $R^2$ value with a histogram.

-----

# Using random data to create a histogram and a p-value for Multiple Regression

Just like we did in the last tutorial, we can calculate a *p*-value for a $R^2$ value with a histogram. The only difference is that now we have to generate random data for more variables.

In [None]:
## since we're going to generate random datasets,
## let's start by setting the seed so that the results
## are reproducable
set.seed(42)

## To generate random datasets, we'll use
## normal distributions. These distributions
## will be based on our observed data, so we
## we need to calculate their estimated
## means and standard deviations.
mean.num.stores <- mean(data$num.stores)
sd.num.stores <- sd(data$num.stores)

mean.num.products <- mean(data$num.products)
sd.num.products <- sd(data$num.products)

mean.revenue <- mean(data$revenue)
sd.revenue <- sd(data$revenue)

## Next, we define the number of random
## datasets we wantt o create...
num.rand.datasets <- 10000

## ...and we define the number of data points
## per dataset
num.datapoints <- nrow(data)

## Create an empty array that is num.rand.datasets long
rand.r.squared <- rep(NA, times=num.rand.datasets)

## Here is the loop were we create a bunch of random datasets,
## each with num.datapoints values, fit a multiple regression
## line to the random data, then calculate and store
## the R-squared values
for(i in 1:num.rand.datasets) {

    ## generate random values for each variable
    rand.num.stores <- rnorm(n=num.datapoints,
                             mean=mean.num.stores,
                             sd=sd.num.stores)

    rand.num.products <- rnorm(n=num.datapoints,
                               mean=mean.num.products,
                               sd=sd.num.products)
    
    rand.revenue <- rnorm(n=num.datapoints,
                          mean=mean.revenue,
                          sd=sd.revenue)

    ## bundle the random values together in a data.frame
    rand.data <- data.frame(
        rand.num.stores,
        rand.num.products,
        rand.revenue)
    
    ## fit a multiple regression to the random data
    ## and calculate R-squared
    rand.mr.plane <- summary(lm(rand.revenue ~ rand.num.stores + rand.num.products, data=rand.data))    
    
    ## save the R-squared value.
    rand.r.squared[i] <- rand.mr.plane$r.squared
}

Now let's draw a histogram of the $R^2$ values with the `hist()` function...

In [None]:
hist(rand.r.squared)

...and calculate the *p*-value as the precentage of "random" $R^2$ values greater than or equal to the one we got for our original data.

In [None]:
## the number of randomly generated r.squared >= the original r.squared
num.greater <- sum(rand.r.squared >= mr.summary$r.squared)

## calculate the p-value 
p.value <- num.greater / num.rand.datasets

## print out the p-value
p.value

Thus, the *p*-value calculated with the histogram is 0.0322. Now let's compare that to the *p*-value calculated when we passed `mr.plane` to `summary()`...

**NOTE:** Unlike the last tutorial, there's no easy way to get just the *p*-value from the summary, so we have to print the whole thing out.

In [None]:
mr.summary

And at last, we see that the two *p*-values are essentially the same.

# Double BAM!!

Now let's see how $R^2$ and Adjusted $R^2$ change when we add bogus variables to a model.

-----

# Understanding how $R^2$ and Adjusted $R^2$ change when we add bogus variables to a model

First, let's add a bogus variable to our existing data.

In [None]:
## make a copy of the original data.frame
big.data <- data

## add a bogus variable to big.data
## this variable, bogus.1, just has values
## randomly selected from a normal distribution
## with mean=0 and sd=1.
set.seed(42)
big.data$bogus.1 <- rnorm(n=nrow(data), mean=0, sd=1)

## print out the first few rows of big.data
big.data

Now that we have the data, which now includes a bogus variable, we can perform a multiple regression on it. This means we have to pass the `lm()` function a **formula**. We can do this two ways. We can follow the approach we used before, where we manually add together the variables we want to use to make predictions, like this, `revenue ~ num.stores + num.products`. Or we can use a shorthand notation where we use a dot, `.`, to represent the sum of all of the columns in the data.frame that we are not prediction. So, in this case, where we are trying to predict revenue, the shorthand version of the formula is `revenue ~ .`

In [None]:
## do multiple regression to predict revenue using all
## of the remaining columns in big.data to participate
## in the prediction
mr.big <- lm(revenue ~ ., data=big.data)

## print out the summary of the multiple regression
mr.big.summary <- summary(mr.big)
mr.big.summary

So, when we include the additional variable that is not at all correlated with Revenue, we see that, due to random chance, the value for $R^2$ goes up a little compared to before (0.9805 vs. 0.9657), suggesting that maybe the new predictions will, in general, be better than before. However, the adjusted $R^2$ goes down when we include the uncorrelated variable (0.9222 vs 0.9313), which suggests that maybe the predictions won't be better.

# TRIPLE BAM!!!

-----

# BONUS: Calculate an *F*-value to compare predictions from a fitted plane to a fitted line

As a bonus, let's see if our fitted plane makes significantly better predictions than a straight line that just uses `num.stores` to predict `revenue`. 

So, the first thing we need to do is a linear regression that just uses `num.stores` to predict `revenue`.

In [None]:
## do a linear regression that only uses num.stores to predict revenue
lr.line <- lm(revenue ~ num.stores, data=data)

## Print out the coefficients
lr.line


Now we can draw a graph of the data with the linear regression line on it.

In [None]:
## First, draw the data...
plot(data$num.stores, data$revenue, 
     xlim=c(0, 15), ylim=c(0, 15), 
     pch=19, cex=4, col="salmon")

## ...now add the regression line.
abline(reg=lr.line, 
       lwd=10, # use a relatively thick line
       col="deepskyblue") # match the color in the book

Now we can print out the revenue values predicted by the line...

In [None]:
## predicted values for the line
lr.line$fitted.values

...and we can use those to calculate the SSR(line)...

In [None]:
## SSR(line)
ssr.line <- sum((data$revenue - lr.line$fitted.values)^2)
ssr.line

Alternatively, we can calculate the SSR(line) with the residuals stored in `lr.line`.

In [None]:
sum(lr.line$residuals^2)

Likewise, we can print out the revenue values predicted by the plane...

In [None]:
## predicted values for the plane
mr.plane$fitted.values

In [None]:
sum(mr.plane$residuals^2)

...and use those to calculate the SSR(plane)...

In [None]:
## SSR(plane)
ssr.plane <- sum((data$revenue - mr.plane$fitted.values)^2)
ssr.plane

...or we can calculate the SSR(plane) with the residuals stored in `mr.plane`...

In [None]:
sum(mr.plane$residuals^2)

Now that we have SSR(line) and SSR(plane) we can calculate the *F*-value that compares their predictions.

Since I can never remember the equation for *F*, here it is (rewritten ...

<span style="font-size: 24px;">
$F = \frac{[\textrm{SSR(Simpler)} - \textrm{SSR(Fancier)}] / (p_\textrm{Fancier} - p_\textrm{Simpler})}
    {\textrm{SSR(Fancier)} / (n - p_\textrm{Fancier})}$
</span>

...so, all we have to do do calculate *F* is plug in the SSR(line) for SSR(Simpler), the SSR(plane) for SSR(Fancier), $p_{\textrm{Simpler}}$, the number of parameters required for the fitted line, which is **2** (one for the slope and one for the y-axis intercept), $p_{\textrm{Fancer}}$, the number of parameters required for the fitted plane, which is **3** (the y-axis intercept, one for the number of stores and one for the number of products) and *n*, the number of datapoints, which is **5**.

In [None]:
## Calculate the F-value that compares the plane to the line...
f.value <- ((ssr.line - ssr.plane) / (3 - 2)) / (ssr.plane / (5 - 3))
f.value

Lastly, let's conver that *F*-value into a *p*-value. We do this with the `pf()` function.

In [None]:
## Calculate the p-value with f.value and the degrees of freedom...
df1 <- (3 - 2)
df2 <- (5 - 3)
pf(f.value, df1=df1, df2=df2, lower.tail=FALSE)

And the *p*-value, 0.026, tells us to reject the null hypothesis that there is no difference between predictions made with the line compared to predictions made with the plane.

# BONUS BAM!!!

-----