# Guide to Stepwise Regression and Best Subset Regression

Automatic variable selection procedures are algorithms that pick the variables to include in your regression model. Stepwise regression and Best Subsets regression are two of the more common variable selection methods. In this post, I compare how these methods work and which one provides better results.

These automatic procedures can be helpful when you have many independent variables and you need some help in the investigative stages of the variable selection process. You could specify many models with different combinations of independent variables, or you can have your statistical software do this for you.

These procedures are especially useful when theory and experience provide only a vague sense of which variables you should include in the model. However, if theory and expertise are strong guides, it’s generally better to follow them than to use an automated procedure. Additionally, if you use one of these procedures, you should consider it as only the first step of the model selection process.

Here are some objectives for this tutorial:

<ul>
    <li>Show how stepwise regression and best subsets regression work differently.</li>
    <li>Use both procedures on one example dataset to compare their results.</li>
    <li>Explore whether one procedure is better.</li>
    <li>Examine the factors that affect a method’s ability to choose the correct model.</li>
</ul>

# How Stepwise Regression Works

As the name stepwise regression suggests, this procedure selects variables in a step-by-step manner. The procedure adds or removes independent variables one at a time using the variable’s statistical significance.<b> Stepwise either adds the most significant variable or removes the least significant variable.</b> It does not consider all possible models, and it produces a single regression model when the algorithm ends.

Typically, you can control the specifics of the stepwise procedure. For example, you can specify whether it can only add variables, only remove variables, or both. You can also set the <b>significance level</b> for including and excluding the independent variables.

# How Best Subsets Regression Works

Best subsets regression is also known as<b> “all possible regressions” and “all possible models.”</b> Again, the name of the procedure indicates how it works. Unlike stepwise, best subsets regression fits all possible models based on the independent variables that you specify.

The number of models that this procedure fits multiplies quickly. If you have 10 independent variables, it fits 1024 models. However, if you have 20 variables, it fits 1,048,576 models! <b>Best subsets regression fits 2<sup>P</sup> models, where P is the number of predictors</b> in the dataset.

After fitting all of the models, best subsets regression then displays the best fitting models with one independent variable, two variables, three variables, and so on. Usually, either <b>adjusted R-squared or Mallows Cp is the criterion for picking the best fitting models</b> for this process.

The result is a display of the besting fit models of different sizes up to the full model. You need to compare the models to determine which one is the best. In some cases, <b>it is not clear which model is the best, and you’ll need to use your judgment.</b>

# Comparison of Stepwise to Best Subsets Regression

While both automatic variable selection procedures assess the set of independent variables that you specify, the end results can be different. Stepwise regression does not fit all models but instead assesses the statistical significance of the variables one at a time and arrives at a single model. Best subsets regression fits all possible models and displays some of the best candidates based on adjusted R-squared or Mallows’ Cp.

The single model that stepwise regression produces can be simpler for the analyst. However, best subsets regression presents more information that is potentially valuable.

Enough talk about how these procedures work. Let’s see them in action!

# Example Using Stepwise and Best Subsets on the Same Dataset

Our example scenario models a manufacturing process. We’ll determine whether the production conditions are related to the strength of a product. If you want to try this yourself, you can download the CSV data file: <a href='https://statisticsbyjim.com/wp-content/uploads/2017/05/ProductStrength.csv'>ProductStrength.</a>

For both variable selection procedures, we’ll use the same independent and dependent variables.

<b>Dependent variable:</b>  Strength

<b>Independent variables:</b> Temperature, Pressure, Rate, Concentration, Time

In [44]:
import pandas as pd
df = pd.read_csv('ProductStrength.csv')
df.head()

Unnamed: 0,Strength,Temperature,Pressure,Rate,Concentration,Time
0,271.8,783.35,33.53,40.55,16.66,13.2
1,264.0,748.45,36.5,36.19,16.46,14.11
2,238.8,684.45,34.66,37.31,17.66,15.68
3,230.7,827.8,33.13,32.52,17.5,10.53
4,251.6,860.45,35.75,33.71,16.4,11.0


In [45]:
X = df.drop('Strength',axis=1).values
y = df['Strength'].values

In [46]:
print(X.shape)
print(y.shape)

(29, 5)
(29,)


In [47]:
from sklearn.model_selection import train_test_split

In [48]:
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Example of Stepwise Regression

Let’s use stepwise regression to pick the variables for our model. I’ll use the stepwise method that allows the procedure to both add and remove independent variables as needed. The output below shows the steps up to the fourth and final step.

In [60]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

In [61]:
# Build RF classifier to use in feature selection
clf = LinearRegression()

# Build step forward feature selection
sfs1 = sfs(clf,k_features = 3,forward=True,floating=False, scoring='r2',cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)

In [62]:
# Which features?
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)

[1, 2, 3]


In [63]:
from sklearn.linear_model import LinearRegression
# Build full model with selected features
clf = LinearRegression()
clf.fit(X_train[:, feat_cols], y_train)

y_train_pred = clf.predict(X_train[:, feat_cols])
print('Training accuracy on selected features: %.3f' % r2_score(y_train, y_train_pred))

y_test_pred = clf.predict(X_test[:, feat_cols])
print('Testing accuracy on selected features: %.3f' % r2_score(y_test, y_test_pred))

Training accuracy on selected features: 0.883
Testing accuracy on selected features: 0.750


In [64]:
# Build full model on ALL features, for comparison
clf = LinearRegression()
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
print('Training accuracy on all features: %.3f' % r2_score(y_train, y_train_pred))

y_test_pred = clf.predict(X_test)
print('Testing accuracy on all features: %.3f' % r2_score(y_test, y_test_pred))

Training accuracy on all features: 0.902
Testing accuracy on all features: 0.791


For our example data, the stepwise procedure added a variable in each step. The process stopped when there were no variables it could add or remove from the model. The final column displays the model that the procedure produced.

The four independent variables in our model are Concentration, Rate, Pressure, and Temperature. This model has an R-squared of 88.30% and the highest adjusted R-squared. You also want Mallows’ Cp to be close to the number of independent variables plus the constant. Mallows’ Cp for the final model is closer to the ideal value than the other models. It all looks good!

# Example of Best Subsets Regression

Next, I’ll perform best subsets regression on the same dataset.

The best subsets procedure fits all possible models using our five independent variables. That means it fit 25 = 32 models. Each horizontal line represents a different model. By default, this statistical software package displays the top two models for each number of independent variables that are in the model. X’s indicate the independent variables that are in each model.

Below are the results.

<img src='img/bestsubsets_1.png'>

We’re looking for a model that has a high adjusted R-squared, a small standard error of the regression, and a Mallows’ Cp close to the number of variables plus constant.


The model I circled is the one that the stepwise method produced. Based on the goodness-of-fit measures, this model appears to be a good candidate. However, the best subsets regression results provide a larger context that might help us make a choice using our subject-area knowledge and goals.

# Using Best Subsets Regression in conjunction with Our Requirements

We might have specific priorities that affect our choice for the best model.

For instance, if our top priorities are to simplify and reduce the costs of data collection, we might be interested in the models with fewer independent variables that fit the data nearly as well. The first model listed with three variables has an adjusted R-squared that is only 1.4 percentage points less than the circled model. In fact, the best two-variable model is not far behind.

On the other hand, if using the model to make accurate predictions is our top priority, we might be interested in the model with all five independent variables. Almost all of the goodness-of-fit measures are marginally better for the full model compared to the best model with four variables. However, the predicted R-squared for the full model declined slightly compared to the model with four variables.

Often, predicted R-squared starts to decline when the model becomes too complex and begins to fit the noise in the data. Sometimes simpler models can produce more precise predictions. For the most predictive model, we might use the best two-variable model because it has the highest predicted R-squared.

I value this extra information that best subsets regression provides. While this procedure requires more knowledge and effort to sort through the multiple models, it helps us choose the best model based our specific requirements. However, this method also fits many more models than stepwise regression, which increases the risk of finding chance correlations.

# Assess Your Candidate Regression Models Thoroughly

If you use stepwise regression or best subsets regression to help pick your model, you need to investigate the candidates thoroughly. That entails fitting the candidate models the normal way and <b>checking the residual plots to be sure the fit is unbiased.</b> You also need <b>to assess the signs and values of the regression coefficients</b> to be sure that they make sense. These automatic model selection procedures can find chance correlations in the sample data and produce models that don’t make sense in the real world.

Automatic variable selection procedures can be helpful tools, particularly in the exploratory stage. However, you can’t expect an automated algorithm to understand the subject area better than you! Be aware of the following potential problems.


<ul>
    <li>These procedures can sift through many different models and find correlations that exist by chance in the sample. Assess the results critically and use your expertise to determine whether they make sense.</li>
    <li>These procedures cannot take real-world knowledge into account. The model may not be right in a practical sense.</li>
    <li>Stepwise regression does not always choose the model with the largest R-squared value.
</li>
</ul>
We saw how stepwise and best subsets regression compare. At this point, there is a logical question. Does one of these procedures work better?

# Which is Better, Stepwise Regression or Best Subsets Regression?

Which automatic variable selection procedure works better? Olejnik, Mills, and Keselman* performed a simulation study to compare how frequently stepwise regression and best subsets regression choose the correct model. The authors include 32 conditions in their study that differ by the number of candidate variables, number of correct variables, sample size, and amount of multicollinearity. For each state, a computer generated 1000 datasets. The authors analyzed each dataset using both stepwise and best subsets regression. For best subsets regression, they compared the effectiveness of using the lowest Mallows’ Cp to using the highest adjusted R-squared.

Drum roll, please!

The winner is … stepwise regression!

<b>Stepwise regression does not usually pick the correct model!</b>

# How Accurate is Stepwise Regression?

Let’s take a closer look at the results. I’m going to cover only the stepwise results. However, best subsets regression using the lowest Mallows’ Cp follows the same patterns and is virtually tied.

First, let’s define some terms in this study.


<ul>
    <li>Authentic variables are the independent variables that truly have a relationship with the dependent variable.</li>
    <li>Noise variables are independent variables that do not have an actual relationship with the dependent variable.</li>
    <li>The correct model includes all of the authentic variables and excludes all of the noise variables.</li>
</ul>