# The StatQuest Illustrated Guide to Statistics
## Chapter 06 - Making Decisions and Predictions with Linear Regression!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Fit a line to data with the `lm()` function.
- Calculate the Sum of the Squared Residuals and $R^2$.
- Generating a Linear Regression Summary with `summary()`.
- Calculate a *p*-value for the $R^2$ using the null hypothesis to build a histogram.
- Calculate an *F*-value and corresponding *p*-value using the Sum of the Squared Residuals.

**NOTE:**
This tutorial assumes that you have installed **[R](https://cran.rstudio.com/)**, and possibly **[RStudio](https://posit.co/download/rstudio-desktop/)** and read the Chapter 6 in **[The StatQuest Illustrated Guide to Statistics]()**.

----

# Fitting a line to data with the `lm()` function

If we're going to fit a line to some data, then the first thing we need is some data. In this example, we'll use the dataset illustrated in Chapter 6, which has the number of stores in one column and the revenue in another.

In [None]:
## first, let's put the data into a data.frame()
num.stores <- c(2, 12, 15)
revenue <- c(3, 12.5, 7)

data <- data.frame(
    num.stores,
    revenue)

## print out the data.frame
data

Now, just to veriy that the data are what we expect, let's graph it.

In [None]:
## here we are setting the domain and range of the graph
## with xlim() and ylim() so that the graph will
## look similar to what you saw in the book.
## Likewise, we're setting the symbol used for each point to a filled circle by
## setting pch=19 (pch is short for plotting character) 
## and making the filled circle bigger than it would be by default
## by setting cex=4 (cex is short for character expansion).
## Lastly, we set the color of the points with col="salmon".
plot(data$num.stores, # x-axis coordinates
     data$revenue,  # y-axis coordinates
     xlim=c(0, 15), # set the range of x-axis values
     ylim=c(0, 15), # set the range of y-axis values
     pch=19, # set the shape
     cex=4,  # scale the size of the shape
     col="salmon") # set the color

Now that we've plotted our raw data, let's add a linear regression line to it. This requires two steps. First, we determine the slope and y-axis intercept for the linear regression line and second, we use tthe slope and y-axis intercept to draw a line on the graph.

We'll start by determining the slope and y-axis intercept with the `lm()` function, where **lm** is short for **linear model** (and **linear regression** is a type of **linear model**). 

The `lm()` function is a little strange in that the first thing we pass it is called a **formula**. In **R**, a **formula** has the form...

**Thing we want to predict ~ Variables we use to make predictions**

...where the **Thing we want to predict** is on the left side of a **~** character, and the **Variables we want to use to make prediction** are on the right side. In this example, that means we will use `revenue ~ num.stores` as the formula, because we want to use `num.stores` to predict `revenue`.

**NOTE:** In this tutorial on simple linear regression, we are only using a single variable to make predictions, so we won't go into details about how these variables are arranged right now. However, we'll dive into this topic when we learn about multiple regression in the next tutorial.

Anyway, when using the `lm()` function, the other important thing to do is pass in the data, which we do with `data=data`, since all of our data is stored in a data.frame called `data`.

In [None]:
## do a liner regression and save the results in lr.line,
## where lr = linear regression.
lr.line = lm(revenue ~ num.stores, data=data)

Among many other things, the `lm()` function returns an object with the slope and y-axis intercept, which we can see by just printing out the contents of `lr.line`.

In [None]:
lr.line

Now, if we want to add the linear regression line to our graph of the data, we can do that with the `abline()` function. `abline()` makes this super easy by having an argument, `reg` that we can pass `lr.line` to and it will take care of all the details associated with drawing the line.

In [None]:
## First, draw the data...
plot(data$num.stores, data$revenue, 
     xlim=c(0, 15), ylim=c(0, 15), 
     pch=19, cex=4, col="salmon")

## ...now add the regression line.
abline(reg=lr.line, 
       lwd=10, # use a relatively thick line
       col="deepskyblue") # match the color in the book

# BAM!

Now that we have a graph of our data and it's corresponding linear regression line, let's learn how interpret the statistics associated with the linear regression.


----

# Generating a Linear Regression Summary with `summary()`

Earlier, when we printed the ouput from `lm()` that we saved in `lr.line`, all we got that was interesting was the y-axis intercept and the slope. However, `lr.line` contains a lot more data that we can access by passing it to the `summary()` function.

In [None]:
## get a summary of the linear regression
lr.summary <- summary(lr.line)

## print out the summary
lr.summary

As we can see, the output from `summary()` is pretty extensive. Of specific interest to us right now, however, are two bits close to the bottom: `Multiple R-squared:  0.4488` and `p-value: 0.5327`. For now, just know that **Multiple R-squared** is just another way to say **R-squared**, and, in this case, the $R^2$ value for our linear regression is 0.4488, which rounds to 0.45. The summary also tells us that the *p*-value for that $R^2$ is 0.5327, which rounds to 0.54. In other words, using the standard threshold for staistical significance, 0.05, we would fail to reject the Null Hypothesis that there is no relationshiop between the number of stores a company has and its revenue. That's a little bit of a bummer, but, it could be that there is a relationship, but that we just don't have enough data to be confident in saying so.

Anyway, since we saved the output from `summary()` in `lr.summary`, we can access the individual values, like the $R^2$ value, by adding a `$` to the variable name and the value we want to access. For example, if we wanted to access the $R^2$ value directly, we would use this command:

In [None]:
lr.summary$r.squared

Likewise, we can access parameter values (the y-axis intercept and the slope), which are also called coefficients, with the following command:

In [None]:
lr.summary$coefficients

As we can see, when print out the coefficients, we get all kinds of information in addtion to the values for the y-axis intercept and the slope, which are in the first column labeled **Estimate**. The next two columns, **Std. Error** and **t value**, are not super interesting right now, but the last column **Pr(>|t|)** is. What this tells us that is the *p*-value for each parameter testing the Null Hypothesis that the parameter value is actually 0.

For example, for the estimated y-axis intercept that we calculated from the data is, 2.9622302, however, the *p*-value that tests the hypothesis that it actually is equal to 0 is 0.6994323. This tells us that even though our estimate for the y-axis intercept is non-zero, we can't be confident that it really is.

**NOTE:** Although the `summary()` function always returns the *p*-value for the y-axis intercept, it's not used that often. Generally speaking, we don't really care what the y-axis intecept is. What is interesting, however, is the slope and it's *p*-value, because this tells us if we are confident (or not) about a relationship between the two variables we measured. In this case, that would tell us if there is a relationship between Number of Stores and Revenue.

So, in this example, when we look at the *p*-value for **num.stores**, the variable that contains the number of stores, we get 0.5326583, and this tells us that we fail to reject the Null Hypothesis that the slope is 0. In other words, we fail to reject the hypothesis that just using the mean value for Revenue (which is what we would use if the slope was 0), is significantly worse than using our linear regression line.

**NOTE:** IF you look carefully, you'll see that the *p*-value for **num.stores** is the same as the *p*-value in the bottom right hand corner of the output we got when we printed the original summary with `lr.summary`. This is useful, because, while we can directly access the *p*-value for the **num.stores** with the following command...

In [None]:
## access the p-value for the slope
## The [2, 4] at the end says: 
##    give me the value in the second row, fourth column
lr.summary$coefficients[2, 4]

...we can't directly access the *p*-value in the bottome righthand corner of the output. So, getting the *p*-value out of our linear regression summary is a little awkward, but not impossible.

Anyway, now that we know how to do a linear regresion with `lm()` and access and interpret the most important results, let's try to calculate some of these values by hand. We'll start by calculating $R^2$.

----

# Calculating the Sum of the Squared Residuals (SSRs) and $R^2$

Even though the `lm()` function calculated $R^2$, it's also helpful to know how to calculate it both by hand. So, let's start with the equation for $R^2$.

<span style="font-size: 24px;">
$R^2 = \frac{\textrm{SSR(mean)} - \textrm{SSR(fit)}}{\textrm{SSR(mean)}}$
</span>

Where SSR(mean) is the Sum of the Squared Residuals around the mean y-axis value, which, in this example, is Revenue, and SSR(fit) is the sum of the squared residuals around the fitted line. We'll start by calculating SSR(mean) and, more specifically, by calculating the mean value for Revenue:

In [None]:
## calculate the mean revenue value
mean.revenue <- mean(data$revenue)

## print out the mean revenue
mean.revenue

Now that we have the mean value for Revenue, we can calculate the Residuals around the mean by subtracting the mean from each Revenue value.

In [None]:
## Calculate the residuals
## NOTE: data$revenue contains multiple values
##       so R subtracts the mean from each one
##       and returns an array of differences that,
##       in this case, we save in mean.residuals.
mean.residuals <- data$revenue - mean.revenue

## print out the residuals
mean.residuals

Now let's square each Residual:

In [None]:
## Square the residuals
mean.residuals.squared <- mean.residuals^2

## print out the squared residuals
mean.residuals.squared

Now we just need to add up the squared residuals. We'll do this by passing `mean.residuals.squared` to the `sum()` function:

In [None]:
## Add up the squared residuals
ssr.mean <- sum(mean.residuals.squared)

## Print out the SSR(mean)
ssr.mean

Bam.

Now let's calculate SSR(fit). **NOTE:** The `lm()` function can give us the residuals around the fitted line or we can calculate them by hand. Here, we'll show you how to do it both ways.

We'll start by seeing the resduals that `lm()` gives us.

In [None]:
lr.line$residuals

Now let's calculate the residuals by hand and compare our results.

To calculate the residuals by hand, we'll first need the x-axis intercept and the slope of the linear regression line, and we can get those from the results of the original call to `lm()`, `lr.line`, or the summary, created by `summary()` and saved in `lr.summary`. Since it's just a little easier to get them from `lr.line`, we'll use that, but first, let's just remind ourselves of what `lr.line` looks like by printing it out:

In [None]:
lr.line

And we can access the coefficients, the intecept and slop, directly with `lr.line$coefficients`:

In [None]:
lr.line$coefficients

So, now let's save the y-axis intercept in a variable called `y.int`...

In [None]:
y.int <- lr.line$coefficients[1]

## print out the y-axis intercept
y.int

...and save the slope in a variable called `slope`...

In [None]:
slope <- lr.line$coefficients[2]

## print out the slope
slope

Now, given the y-axis intercept and the slope, we can predict the revenue for each company in our dataset by multipling the number of stores in `data$num.stores` by the slope and then adding the y-axis intercept.

In [None]:
fit.predictions <- (data$num.stores * slope) + y.int

## print out the predicted values
fit.predictions

**NOTE:** The original call to `lm()` also returns the predicted values, and, if we didn't want to calculate them ourselves, we could print them out like this:

In [None]:
lr.line$fitted.values

And we, since our predicted values are the same as what `lm()` calculated, we must have done things correctly.

Next, we calculate the residuals by subtracting the predicted values from the observed Revenue values in `data$revenue`.

In [None]:
fit.residuals <- data$revenue - fit.predictions

## print out the residuals
fit.residuals

Bam! We just calculated the residuals around the fitted line by hand. Now let's compare those to residuals that the `lm()` function calculates for us...

In [None]:
lr.line$residuals

...and we see that, either way we get the residuals, we get the same thing. In other words, when we calculated the residuals by hand, we didn't make a mistake.

Now let's finish calculating the SSR(fit) by squaring the residuals...

In [None]:
fit.residuals.squared <- fit.residuals^2
fit.residuals.squared

...and then adding up the squared residuals.

In [None]:
ssr.fit <- sum(fit.residuals.squared)
ssr.fit

Now that we have calculated SSR(mean) and SSR(fit), we can calculate
<span style="font-size: 18px;">
$R^2 = \frac{\textrm{SSR(mean)} - \textrm{SSR(fit)}}{\textrm{SSR(mean)}}$
</span>

In [None]:
r.squared <- (ssr.mean - ssr.fit) / ssr.mean

## print out r.squared
r.squared

BAM!

Now let's compare that to the value that **R** calculated for us when we we called the `summary()` function...

In [None]:
lr.summary$r.squared

...and we see that we got the same value, so we must have done all the math right.

# BAM!

Now let's learn how we can calculate a *p*-value for $R^2$ by with a histogram.

----

# Calculating a *p*-value for the $R^2$ with a histogram

Now that we know how to fit a linear regression line to data with `lm()` and calculate the $R^2$ value with `summary()` (and also by hand), let's learn how we can calculate a *p*-value using a histogram. This requires us to repeat the following steps a lot of times:

- Generate random data
- Fit a line to the data with `lm()`
- Calculate the $R^2$ value for that fit with `summary()`
- Store the $R^2$ value in an array

Once we have an array of $R^2$ values calculated from random datasets, we pass it to `histogram()` to see how they are distributed and then calculate a *p*-value by seeing how many of the "random" $R^2$ values are greater than the one for our original dataset. We'll start by generating the "random" $R^2$ values with the following code (**NOTE:** It might take a minute or so for this code to run).

In [None]:
## since we're going to generate random datasets,
## let's start by setting the seed so that the results
## are reproducable
set.seed(42)

## To generate random datasets, we'll use two
## normal distributions, one for the number of stores
## and one for the revenue. These distributions
## will be based on our observed data, so we
## we need to calculate their estimated
## means and standard deviations.
mean.num.stores <- mean(data$num.stores)
sd.num.stores <- sd(data$num.stores)

mean.revenue <- mean(data$revenue)
sd.revenue <- sd(data$revenue)

## Next, we define the number of random
## datasets we wantt o create...
num.rand.datasets <- 10000

## ...and we define the number of data points
## per dataset
num.datapoints <- nrow(data)

## Create an empty array that is num.rand.datasets long
rand.r.squared <- rep(NA, times=num.rand.datasets)

## Here is the loop were we create a bunch of random datasets,
## each with num.datapoints values, fit a linear regression
## line to the random data, then calculate and store
## the R-squared values
for(i in 1:num.rand.datasets) {

    ## generate random values for the number of stores
    rand.num.stores <- rnorm(n=num.datapoints,
                             mean=mean.num.stores,
                             sd=sd.num.stores)

    ## generate random values for the revenue
    rand.revenue <- rnorm(n=num.datapoints,
                          mean=mean.revenue,
                          sd=sd.revenue)

    ## bundle the random values together in a data.frame
    rand.data <- data.frame(
        rand.num.stores,
        rand.revenue)

    ## fit a linear regression line to the random data
    ## and calculate R-squared
    rand.lr.line <- summary(lm(rand.revenue ~ rand.num.stores, data=rand.data))    

    ## save the R-squared value.
    rand.r.squared[i] <- rand.lr.line$r.squared
}

Now let's draw a histogram of the $R^2$ values with the `hist()` function...

In [None]:
hist(rand.r.squared)

...and calculate the *p*-value as the precentage of "random" $R^2$ values greater than or equal to the one we got for our original data.

In [None]:
## the number of randomly generated r.squared >= the original r.squared
num.greater <- sum(rand.r.squared >= lr.summary$r.squared)

## calculate the p-value 
p.value <- num.greater / num.rand.datasets

## print out the p-value
p.value

Thus, the *p*-value calculated with the histogram is 0.5318. Now let's compare that to the *p*-value calculated when we passed `lm.line` to `summary()`...

In [None]:
lr.summary$coefficients[2,4]

So, at last, we see that the two *p*-values are essentially the same.

# BAM!

----

# BONUS: Calculating an *F*-value and *p*-value using the Sum of the Squared Residuals

The equation for *F* is...

<span style="font-size: 24px;">
$F = \frac{[\textrm{SSR(mean)} - \textrm{SSR(fit)}] / (p_\textrm{fit} - p_\textrm{mean})}
    {\textrm{SSR(fit)} / (n - p_\textrm{fit})}$
</span>

...so, all we have to do do calculate *F* is plug in the SSR(mean), the SSR(fit), $p_{\textrm{fit}}$, the number of parameters required for the fitted line, which is **2** (one for the slope and one for the y-axis intercept), $p_{\textrm{mean}}$, the number of parameters required for the mean, which is **1** (the y-axis intercept) and *n*, the number of datapoints, which is **3**.

In [None]:
F.numerator <- (ssr.mean - ssr.fit) / (2 - 1)
F.denominator <- ssr.fit / (3 - 2)

F <- F.numerator / F.denominator
F

Now let's see if that matches the value that `summary()` gave us...

In [None]:
lr.summary$fstatistic

...and it does! Now let's conver that *F*-value into a *p*-value. We do this with the `pf()` function.

In [None]:
## We set lower.tail=FALSE so that we calculate
## the area under the curve from the F-value to
## positive infinity. If lower.tail=TRUE, then we
## will get the area under the curve from 0 to the F-value.
p.value <- pf(F, df1=(2-1), df2=(3-2), lower.tail=FALSE)
p.value

Now let's check to see if the *p*-value we calculated by hand matches the value we got from the call to `summary()`...

In [None]:
lr.summary

...and it does! Both are equatl to **0.5327**.

# BONUS BAM!!!

----