# Machine Learning A-Z: Section 10 Evaluating Regression Model Performance

When we look at how well a regression model performs, we often look at the r<sup>2</sup> value. The r<sup>2</sup> value is a measure of how much of the variability in the dependent variable can be explained by the value of the independent variable. The closer an r<sup>2</sup> value is to one, the more variability it explains. This sounds great, but r<sup>2</sup> has a hidden problem. Any random independent variable we add to the regression model will have some slight correlation to the dependent variable, and hence will improve the r<sup>2</sup> value on paper, but will just add dead weight (or worse reduce accuracy) of the model. 

An alternate to r<sup>2</sup> is adjusted-r<sup>2</sup>. Adjusted-r<sup>2</sup> is very similar to r<sup>2</sup> with the difference being a penalty factor which reduces adjusted-r<sup>2</sup> for each variable. Thus adjusted-r<sup>2</sup> can both increase or decrease with the addition of variables and in order to increase the benefit to performance (r<sup>2</sup>) must outweigh the penalty of adding an extra factor.

In this section we'll write a function to do bi-directional elimiation stepwise regression. We'll use the p = 0.05 significance level for both adding and removing as well as looking at the adjusted-r<sup>2</sup> value.

## Step 1 Import and Prepare the data.

We'll use the template we created in Section 2 to import and preprocess the data.

In [1]:
import numpy as np # Libraries for fast linear algebra and array manipulation
import pandas as pd # Import and manage datasets
from plotly import __version__ as py__version__
import plotly.express as px # Libraries for ploting data
import plotly.graph_objects as go # Libraries for ploting data
from sklearn import __version__ as skl__version__
from sklearn.preprocessing import OneHotEncoder # Libraries to do encoding of categorical variables
from sklearn.compose import ColumnTransformer # Library to transform only certain columns/features at a time
from sklearn.model_selection import train_test_split # Library to split data into training and test sets.
from sklearn.linear_model import LinearRegression # Library for creating Linear Regression Models
from statsmodels import __version__ as statsmodels__version__
import statsmodels.api as sm

Library versions used in this code:

In [2]:
print('Numpy: ' + np.__version__)
print('Pandas: ' + pd.__version__)
print('Plotly: ' + py__version__)
print('Scikit-learn: ' + skl__version__)
print('Stats Models: ' + statsmodels__version__)

Numpy: 1.16.4
Pandas: 0.25.1
Plotly: 4.0.0
Scikit-learn: 0.21.2
Stats Models: 0.10.1


In [4]:
def LoadData():
    dataset = pd.read_csv('50_Startups.csv')
    return dataset

dataset = LoadData()
print(dataset.head(3))
print()
print(dataset.info())

   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
R&D Spend          50 non-null float64
Administration     50 non-null float64
Marketing Spend    50 non-null float64
State              50 non-null object
Profit             50 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
None


In [7]:
X = dataset.iloc[:,:-1].values # All the columns except the last are features
y = dataset.iloc[:,-1].values # The last column is the dependent variable

#Do the One-Hot encoding on our categorical data.
columntransformer = ColumnTransformer(
    [('Country_Category', OneHotEncoder(), [3])],
    remainder = 'passthrough')

X = np.array(columntransformer.fit_transform(X))

#Remove one of the new dummy variables to avoid the dummy variable trap.
X = X [:,1:]

Our data is prepared and ready for creating the model.

We will create an outer loop to look through the unused variables and decide which is the best to add to the model and an inner to look through the used variables decide if we should remove any. 

In [15]:
X.shape[1]

5

In [20]:
usedVariables = []
unUsedVariables = [i for i in range(0,X.shape[1])]

while unUsedVariables.len() > 0: # While there are still unused values and we haven't exited elswhere, look for the best value to add.
    bestIndex = -1
    bestPValue = 1
    bestAdjR2 = -1
    #look through the variables and find the one to add that most improves the model
    for index in range(0, unUsedVariables.len()):
        
    

[[1.0 165349.2 136897.8]
 [0.0 162597.7 151377.59]
 [0.0 153441.51 101145.55]
 [1.0 144372.41 118671.85]
 [0.0 142107.34 91391.77]
 [1.0 131876.9 99814.71]
 [0.0 134615.46 147198.87]
 [0.0 130298.13 145530.06]
 [1.0 120542.52 148718.95]
 [0.0 123334.88 108679.17]
 [0.0 101913.08 110594.11]
 [0.0 100671.96 91790.61]
 [0.0 93863.75 127320.38]
 [0.0 91992.39 135495.07]
 [0.0 119943.24 156547.42]
 [1.0 114523.61 122616.84]
 [0.0 78013.11 121597.55]
 [1.0 94657.16 145077.58]
 [0.0 91749.16 114175.79]
 [1.0 86419.7 153514.11]
 [0.0 76253.86 113867.3]
 [1.0 78389.47 153773.43]
 [0.0 73994.56 122782.75]
 [0.0 67532.53 105751.03]
 [1.0 77044.01 99281.34]
 [0.0 64664.71 139553.16]
 [0.0 75328.87 144135.98]
 [1.0 72107.6 127864.55]
 [0.0 66051.52 182645.56]
 [1.0 65605.48 153032.06]
 [0.0 61994.48 115641.28]
 [1.0 61136.38 152701.92]
 [0.0 63408.86 129219.61]
 [0.0 55493.95 103057.49]
 [0.0 46426.07 157693.92]
 [1.0 46014.02 85047.44]
 [0.0 28663.76 127056.21]
 [0.0 44069.95 51283.14]
 [1.0 20229