# The StatQuest Illustrated Guide to Statistics
## Chapter 08 - Using Regression to Test for Differences with *t*-tests!!!

Copyright 2026, Joshua Starmer

In this notebook we'll learn how to...

- Use the `ols()` functiion to do a *t*-test.
- Use the `ttest_ind()` function to do a *t*-test.
- Use random data to create a histogram and a *p*-value for a *t*-test.
- Understand the relationship between the *F*-distribution and the *t*-distribution.

**NOTE:**
This tutorial assumes that you have installed **[Python](https://www.python.org/)** and read Chapter 8 in **[The StatQuest Illustrated Guide to Statistics]()**.

----

Since we're using Python, the first thing we do is load in some modules that will help us do math and plot graphs.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy import stats
import seaborn as sns # to draw a graphs and have them look somewhat nice

# Using the `ols()` function to do a *t*-test

If we're going do a *t*-test, then the first thing we need is some data. In this example, we'll use the dataset illustrated in Chapter 8, which is the length of time it took people to recover from a cold. 3 of the people took drug A, and 3 other people took drug B. So, we start by defining the amount of time it took people taking drug A or B to recover.

In [None]:
## the amount of time required to recover
## using two different drugs
drug_a = [8, 12, 22]
drug_b = [20, 29, 39]

Ultimately, the times in `drug.a` and `drug.b` are the values we want to predict with the `ols()` function. In other words, the dependent variable (the variable we want to predict) should contain the times it took everyone to recover. Since we only have a single dependent variable, we can concatenate the times for drugs A and B together with the `+` operator.

In [None]:
## now concatenate the times to recover into a single variable
recovery_time = drug_a + drug_b

## print out the values
recovery_time

Now that we have the times to recover concatentated into a single variable, `recovery.time`, we need a way to keep track of which drug corresponds with which time. We'll do this by creating a **factor**. A **factor** is, essentially, a descrete variable that has a limited number of options. Here, we'll create a factor that tells us that the first 3 values in `recovery_time` should be associated with drug A and the second 3 values should be associated with drub B.

In [None]:
## create a "factor" variable that
## allows us to keep track of which drug
## is associated with which time in recovery_time
drug = (['a'] * 3) + (['b'] * 3)

## print out the values
drug

Now let's combine `recovery_time` and `drug` in a single `DataFrame` that puts all of our data in one place.

In [None]:
## create a DataFrame
df = pd.DataFrame({
    'time': recovery_time,
    'drug': pd.Categorical(drug)
})

## print it out
df

Now we have a `DataFrame`, `df`, that pairs each recovery time with a label for a drug.

Bam!

Now let's use the data to do a regression, where we use `drug` to predict `time`.

In [None]:
model = smf.ols('time ~ drug', data=df)
results = model.fit()
results.summary()

Now let's talk about the output. On the right side of the top table, we see the *p*-value for our *t*-test, `Prob (F-statistic):	0.0900`. Thus, we fail to reject the null hypothesis that using 2 maeans (one per drug) to predict recovery time is significantly different than just using a single, overall, mean to predict recovery time.

The other thing worth noting in the output are in the `coef` column, where we see the coefficients for the two drugs. The format will look strange at first, because, by default, rather than have one estimated mean for Drug A, and a second estimated mean for Drug B, `ols()` estimtaes a y-axis intercept for Drug A, which is the mean reovery time for each person who took Drug A, 14.0, and an offset from the y-axis intercept for Drug B, 15.3. The offset for Drug B is the mean recovery time for each person that took Drug B minus the y-axis intercept. In other words, the mean value for Drug B is the intercept, 14.0, plus the coefficient for `drugb`, 15.3. So the mean recovery time for people that took Drug B is 14 + 15.3 = 29.3.

Now that we know how to do a *t*-test using the `ols()` function, let's learn how to do a *t*-test the way most people would do it, using the `ttest_ind()` function.

----

# Using the `ttest_ind()` function to do a *t*-test

Using the `ttest_ind()` function, where the `ind` refers to the fact that both samples are independent, which, in this case, means they were not measured from the same people, is really straight forward. You just pass it two variables, one for each group of measurements. In this case, we pass in the times to recover for Drug A, `drug.a`, and the times to recover for Drug B, `drug.b`. We also set `equal_var=True` because we want to compare the *p*-value to what we got when we use the `ols()` function. In other words, when we use the `ols()` function, we assume that the **Population Variances** are the same for both populations: people taking drug A and people taking drug B.

In [None]:
## t-test with equal variances = F-test
stats.ttest_ind(drug_a, drug_b, equal_var=True)

As we see, the *p*-value when we use `ttest_ind()` is the same as when we use `ols()`. The big differences are that `ttest_ind()` is a little easier to use since we didn't have to concatenate the recovery times for both drugs or create a `factor` variable.

Now let's use random data to create a histogram that we can use to calculate a *p*-value for a *t*-test. This will verify the intuition for what the *p*-value represents.

----

# Using random data to create a histogram and a *p*-value for a *t*-test

Using random data to create a histogram and calculate a *p*-value for a *t*-test is pretty much the same as what we did in Chapters 6 and 7 for Linear and Muliple Regression.

In [None]:
## since we're going to generate random datasets,
## let's start by setting the seed so that the results
## are reproducable
np.random.seed(42)

all_data = drug_a + drug_b

## To generate random datasets, we'll use
## anormal distribution based on our observed data, 
## so we we need to calculate the estimated
## mean and standard deviation from all the data.
mean_all_data = np.mean(all_data)
sd_all_data = np.std(all_data, ddof=1)

## Next, we define the number of random
## datasets we wantt o create...
num_rand_datasets = 10_000

## ...and we define the number of data points
## per dataset
num_datapoints = 3

## Here, we're just creating a factor that keeps
## track of which drug each recovery time is associated with
drugs = ['a'] * num_datapoints + ['b'] * num_datapoints

## Create an empty array that is num.rand.datasets long
rand_r_squared = np.empty(num_rand_datasets)

## Here is the loop were we create a bunch of random datasets,
## each with num.datapoints values, fit a regression
## line to the random data, then calculate and store
## the R-squared values
for i in range(num_rand_datasets):
    
    ## generate random recovery times for each drug
    rand_drug_a_recovery = np.random.normal(mean_all_data, sd_all_data, num_datapoints)
    rand_drug_b_recovery = np.random.normal(mean_all_data, sd_all_data, num_datapoints)

    ## bundle the random values together in a DataFrame
    data = pd.DataFrame({
        'x': pd.Categorical(drugs),
        'y': np.concatenate([rand_drug_a_recovery, rand_drug_b_recovery])
    })

    # fit regression and calculate R-squared
    model = smf.ols('y ~ x', data=data)
    model_results = model.fit()
    rand_r_squared[i] = model_results.rsquared

Now let's draw a histogram of the $R^2$ values with the `histplot()` function...

In [None]:
sns.histplot(data=rand_r_squared)

...and calculate the *p*-value as the precentage of "random" $R^2$ values greater than or equal to the one we got for our original data.

In [None]:
# the number of randomly generated rsquared >= the original rsquared
num_greater = np.sum(rand_r_squared >= results.rsquared)

# calculate the p-value 
p_value = num_greater / num_rand_datasets

# print out the p-value
p_value

Thus, the *p*-value calculated with the histogram is 0.0928. Now let's compare that to the *p*-value stored in `results`...

In [None]:
results.f_pvalue

So, at last, we see that the two *p*-values are essentially the same.

# BAM!

Now, let's see how the *F*-distribution is related to the *t*-distribution.

----

# Understanding the relationship between the *F*-distribution and the *t*-distribution

I could just tell you that the *F*-distribution is a generalization of the *t*-distribution and that you can square any *t*-value and use it as an *F*-value to get the exact same *p*-value, but it's better to show you. So, we'll start by re-doing the *t*-test we did before, only this time we'll save the output in a variable called `t_test_stuff`.

In [None]:
## t-test with equal variances = F-test
t_test_stuff = stats.ttest_ind(drug_a, drug_b, equal_var=True)

# print out t_test_stuff
t_test_stuff

We can access the *t*-value direclty like this:

In [None]:
t_test_stuff.statistic

Like the *F*-value, the *t*-value is an x-axis coordinate underneath a curve. Now, just for fun, let's square the *t*-value for the *t*-tests...

In [None]:
t_test_stuff.statistic ** 2

...and compare the squared *t*-value to the *F*-value we got when we did the *t*-test with `ols()`.

In [None]:
results.fvalue

Notice that both values are the same! This is true for any *t*-test. The *t*-value that it creates can be squared and used an *F*-value with an *F*-distribution that where DF1 = 1 (meaning we are comparing a simple model with 1 parameter to a fancy model with 2 parameters), and DF2 = $n - p_{\textrm{Fancy}}$. In this example n = 6, both drugs had 3 measurements each, and $p_{\textrm{Fancy}} = 2$, since we had one mean value per drug, so DF2 = 4.

Now let's show that we can get the same *p*-values both ways. First, we'll calculate the *p*-value from the *F*-distributiion using the original *F*-value...

In [None]:
## first, import 'f'
from scipy.stats import f

## p-value with f-distribution
df1 = 1
df2 = 4

1-f.cdf(x=results.fvalue, dfn=df1, dfd=df2)

...then we'll square the *t*-value and use it to calculate the *p*-value from an *F*-distribution.

In [None]:
1-f.cdf(x=(t_test_stuff.statistic ** 2), dfn=df1, dfd=df2)

And we see that both *p*-values are the same, with maybe a small rounding error way out in the decimal backwaters. So, now we have seen that the *F*-distribution is a generalization of the *t*-distribution and that you can square any *t*-value and use it as an *F*-value to get the exact same *p*-value.

Bam.

----