# Machine Learning A-Z: Section 10 Evaluating Regression Model Performance

When we look at how well a regression model performs, we often look at the r<sup>2</sup> value. The r<sup>2</sup> value is a measure of how much of the variability in the dependent variable can be explained by the value of the independent variable. The closer an r<sup>2</sup> value is to one, the more variability it explains. This sounds great, but r<sup>2</sup> has a hidden problem. Any random independent variable we add to the regression model will have some slight correlation to the dependent variable, and hence will improve the r<sup>2</sup> value on paper, but will just add dead weight (or worse reduce accuracy) of the model. 

An alternate to r<sup>2</sup> is adjusted-r<sup>2</sup>. Adjusted-r<sup>2</sup> is very similar to r<sup>2</sup> with the difference being a penalty factor which reduces adjusted-r<sup>2</sup> for each variable. Thus adjusted-r<sup>2</sup> can both increase or decrease with the addition of variables and in order to increase the benefit to performance (r<sup>2</sup>) must outweigh the penalty of adding an extra factor.

In this section we'll write a function to do bi-directional eliminatin stepwise regression. We'll use the p = 0.05 significance level for both adding and removing, but we'll also consider the effect to adjusted-r<sup>2</sup> value.

## Step 1 Import and Prepare the data.

We'll use the template we created in Section 2 to import and preprocess the data.

In [1]:
import numpy as np # Libraries for fast linear algebra and array manipulation
import pandas as pd # Import and manage datasets
from plotly import __version__ as py__version__
import plotly.express as px # Libraries for ploting data
import plotly.graph_objects as go # Libraries for ploting data
from sklearn import __version__ as skl__version__
from sklearn.preprocessing import OneHotEncoder # Libraries to do encoding of categorical variables
from sklearn.compose import ColumnTransformer # Library to transform only certain columns/features at a time
from sklearn.model_selection import train_test_split # Library to split data into training and test sets.
from sklearn.linear_model import LinearRegression # Library for creating Linear Regression Models
from statsmodels import __version__ as statsmodels__version__
import statsmodels.api as sm

Library versions used in this code:

In [2]:
print('Numpy: ' + np.__version__)
print('Pandas: ' + pd.__version__)
print('Plotly: ' + py__version__)
print('Scikit-learn: ' + skl__version__)
print('Stats Models: ' + statsmodels__version__)

Numpy: 1.16.4
Pandas: 0.25.1
Plotly: 4.0.0
Scikit-learn: 0.21.2
Stats Models: 0.10.1


In [3]:
def LoadData():
    dataset = pd.read_csv('50_Startups.csv')
    return dataset

dataset = LoadData()
print(dataset.head(3))
print()
print(dataset.info())

   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
R&D Spend          50 non-null float64
Administration     50 non-null float64
Marketing Spend    50 non-null float64
State              50 non-null object
Profit             50 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
None


In [17]:
X = dataset.iloc[:,:-1].values # All the columns except the last are features
y = dataset.iloc[:,-1].values # The last column is the dependent variable

#Do the One-Hot encoding on our categorical data.
columntransformer = ColumnTransformer(
    [('Country_Category', OneHotEncoder(), [3])],
    remainder = 'passthrough')

X = np.array(columntransformer.fit_transform(X))

#Remove one of the new dummy variables to avoid the dummy variable trap.
X = X [:,1:]

print(X[0:2,:])

[[0.0 1.0 165349.2 136897.8 471784.1]
 [0.0 0.0 162597.7 151377.59 443898.53]]


Our data is prepared and ready for creating the model.

We will create an outer loop to look through the unused variables and decide which is the best to add to the model and an inner to look through the used variables decide if we should remove any. 

In [5]:
X.shape[1]

5

In [21]:
usedVariables = []
unUsedVariables = [i for i in range(0,X.shape[1])]

currentAdjR2 = -1
padd = 0.05
premove = 0.05
addedCol = True
removedCol = True

def evaluate_OLS_model(y, variables):
    #fit and return the performance on an OLS model
    #first add a constant
    variables = np.append(values = variables.astype(float), arr = np.ones((len(variables),1)).astype(int), axis = 1)
    #fit the model
    ols_results = sm.OLS(endog = y, exog = variables).fit()
    #return pvalues, rsquared
    return (ols_results.pvalues.tolist()[1:], ols_results.rsquared_adj)

while len(unUsedVariables) > 0 and (addedCol or removedCol) : # While there are still unused values and we haven't exited elswhere, look for the best value to add.
    bestIndex = -1
    bestPValue = 1
    bestAdjR2 = -1
    addedCol = False
    removedCol = False
    #look through the variables and find the one to add that most improves the model
    for index in range(0, len(unUsedVariables)):
        test_columns = usedVariables + [unUsedVariables[index]]
        (pvals, r2) = evaluate_OLS_model(y, X[:,test_columns])
        if (bestPValue > pvals[-1]):#is this the best p-value we've found?
            bestIndex = index
            bestPValue = pvals[-1]
            bestAdjR2 = r2
    
    #Check to see if we should add the best column we found
    if (bestPvalue < padd or bestAdjR2 > currentAdjR2): #The column is significant at the chosen level or it improves adjusted R2
        usedVariables.append(unUsedVariables.pop(bestIndex))
        currentAdjR2 = bestAdjR2
        addedCol = True
    else: #No Column is significant at the chosen level or improves R2
        break
    
    #Check to see if any variables should be removed, we can't test a model with less than 1 
    while len(usedVariables) > 1:
        (pvals, r2) = evaluate_OLS_model(y, X[:,usedVariables])
        if(max(pvals) > premove):
            #remove the worst column and see if the model is better. 
            worst_col = usedVariables.pop(pvals.index(max(pvals)))
            (test_pvals, test_r2) = evaluate_OLS_model(y, X[:,usedVariables])
            if(test_r2 >= r2): #the model is better without the worst column, remove it
                unUsedVariables.append(worst_col)
                removedCol = True
            else: #the model is better with the worst column even though it doesn't meet the significance level
                usedVariables.append(worst_col)
                currentAdjR2 = r2
                break
        else:
            #no p-values were large enough to be removed, stop iterating
            currentAdjR2 = r2
            break

print(f'Final Model Columns: {usedVariables}')

Final Model Columns: [2, 4]


Bi-directional elimination is an intersting way to create a model. You start by finding the independent variable the has the best correlation with the dependent variable and add it to the model. Then you find the next independent variable which when added the current model improves it the most. After every addition we check to see if we any independent variables should be removed. At first glance this appears to be redundent, why would a variable be removed if it was already important enough to get added. Consider the following example:
>We have 3 potential columns to use in our model: [__A__, __B__, __C__]
>__A__ by itself correlates better than either __B__ or __C__ by themselves
>However, __B__ and __C__ together correlate better than __A__ and make __A__ redundant (No additional improvement to adjusted-R<sup>2</sup>
>__A__ and __C__ together also correlate well and perform better than __A__ alone.
>In this case the bi-directional elimination would first add __A__ to the model.
>In the next iteration, __C__ would be added to the model.
>Checking for columns to remove would show nothing needed removal.
>In the next iteration, __B__ would be added to the model.
>Now checking if anything needed removal would show that __A__ should be removed. 
>Once more through the checks would show that __A__ shouldn't be added and nothing else need removed.
>Our final model result would be [__C__, __B__]