# Extension 5 - Stepwise linear regression

Implement the stepwise linear regression discussed in class where you add variables to the regression model one-by-one in a greedy fashion: each variable added out of the available ones not already entered in the regression should result in the largest increase in the adjusted $R^2$ value on the validation data.

In [4]:
# Importing the required stuff

import os
import random
import numpy as np
import matplotlib.pyplot as plt

from data import Data
from linear_regression import LinearRegression

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
# Storing the filenames of testdata52.csv
test52_file_name = os.path.join('data', 'testdata52.csv')

# Loading the data
test52_data = Data(test52_file_name)

# Creating a LinearRegression object
test52_lr = LinearRegression(test52_data)

# Printing the data
print(test52_data)

data/testdata52.csv (20x5)
Headers:
  D0	  D1	  D2	  D3	  D4	
-----------------------
Showing first 5/20 rows.
-0.3	-0.49	1.35	-1.21	0.46	
0.73	0.28	1.15	-0.2	-0.22	
0.48	-0.25	-0.83	0.46	0.35	
0.33	-0.26	1.53	0.24	-1.01	
-0.99	-1.24	0.74	-1.57	1.22	



In [6]:
# Performing stepwise linear regression with D0, D1, D2, and D3 as independent variables and D4 as the dependent variable
test52_lr.stepwise_linear_regression(['D0', 'D1', 'D2', 'D3'], 'D4')

Added variable: D0. Adjusted R^2 increased by 0.054.
Added variable: D3. Adjusted R^2 increased by 0.105.


# Discussion

In this extension, we implemented a stepwise linear regression algorithm, where variables are added to the regression model one-by-one in a greedy fashion based on the increase in adjusted $R^2$ on the validation data. We used the LinearRegression class implemented in the main notebook to perform linear regression and calculate $R^2$ and adjusted $R^2$ values.

We tested the algorithm using the test data set 'testdata52.csv' with D0, D1, D2, and D3 as independent variables and D4 as the dependent variable. The algorithm started with the first variable D0, and then added D3, resulting in an increase in adjusted $R^2$ of 0.105. The final model included D0 and D3 as independent variables.

The stepwise linear regression algorithm can be useful for selecting the most important variables in a regression model, especially when dealing with a large number of potential predictors. By adding variables one at a time and selecting the variable that results in the largest increase in adjusted $R^2$, we can gradually build a more accurate model and avoid overfitting.

To use this algorithm, one would need to specify the independent and dependent variables and call the stepwise_linear_regression method of the LinearRegression class with these variables as arguments. The method will print information about the added variables and the increase in adjusted $R^2`, and set the attributes of the LinearRegression object to match the final model.

Overall, the implementation of the stepwise linear regression algorithm was successful, and it can be a useful addition to the LinearRegression class for selecting the most important variables in a regression model.