Backward elimination is a feature selection technique while building a machine learning model. It is used to remove those features that do not have a significant effect on the dependent variable or prediction of output

Steps of Backward Elimination:

Step-1: Firstly, We need to select a significance level to stay in the model. (SL=0.05)
Step-2: Fit the complete model with all possible predictors/independent variables.

Step-3: Choose the predictor which has the highest P-value, such that.

If P-value >SL, go to step 4.
Else Finish, and Our model is ready.
Step-4: Remove that predictor.

Step-5: Rebuild and fit the model with the remaining variables.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data_set=pd.read_csv('50_Startups.csv')

In [3]:
data_set

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


In [4]:
# Extracting Independent and dependent Variable  
x=data_set.iloc[:,:-1].values
y=data_set.iloc[:,4].values

In [5]:
x

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 29491

In [6]:
y

array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
       156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
       141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
       124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
       108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
        99937.59,  97483.56,  97427.84,  96778.92,  96712.8 ,  96479.51,
        90708.19,  89949.14,  81229.06,  81005.76,  78239.91,  77798.83,
        71498.49,  69758.98,  65200.33,  64926.08,  49490.75,  42559.73,
        35673.41,  14681.4 ])

In [7]:
# Categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [8]:
ct1=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])],remainder='passthrough')

In [9]:
encoded_data_x=ct1.fit_transform(x)

In [10]:
# Convert encoded data to dataframe
encoded_df1_x=pd.DataFrame(encoded_data_x,columns=ct1.get_feature_names_out())

In [11]:
encoded_df1_x

Unnamed: 0,encoder__x3_California,encoder__x3_Florida,encoder__x3_New York,remainder__x0,remainder__x1,remainder__x2
0,0.0,0.0,1.0,165349.2,136897.8,471784.1
1,1.0,0.0,0.0,162597.7,151377.59,443898.53
2,0.0,1.0,0.0,153441.51,101145.55,407934.54
3,0.0,0.0,1.0,144372.41,118671.85,383199.62
4,0.0,1.0,0.0,142107.34,91391.77,366168.42
5,0.0,0.0,1.0,131876.9,99814.71,362861.36
6,1.0,0.0,0.0,134615.46,147198.87,127716.82
7,0.0,1.0,0.0,130298.13,145530.06,323876.68
8,0.0,0.0,1.0,120542.52,148718.95,311613.29
9,1.0,0.0,0.0,123334.88,108679.17,304981.62


In [12]:
encoded_df1_x.columns = ['California', 'Florida','NewYork','R&D Spend','Administration','Marketing Spend']

In [13]:
encoded_df1_x

Unnamed: 0,California,Florida,NewYork,R&D Spend,Administration,Marketing Spend
0,0.0,0.0,1.0,165349.2,136897.8,471784.1
1,1.0,0.0,0.0,162597.7,151377.59,443898.53
2,0.0,1.0,0.0,153441.51,101145.55,407934.54
3,0.0,0.0,1.0,144372.41,118671.85,383199.62
4,0.0,1.0,0.0,142107.34,91391.77,366168.42
5,0.0,0.0,1.0,131876.9,99814.71,362861.36
6,1.0,0.0,0.0,134615.46,147198.87,127716.82
7,0.0,1.0,0.0,130298.13,145530.06,323876.68
8,0.0,0.0,1.0,120542.52,148718.95,311613.29
9,1.0,0.0,0.0,123334.88,108679.17,304981.62


In [14]:
#avoiding the dummy variable trap:  
encoded_df1_x= encoded_df1_x.drop(['California'],axis=1)

In [15]:
encoded_df1_x = encoded_df1_x.astype(int)

In [16]:
encoded_df1_x

Unnamed: 0,Florida,NewYork,R&D Spend,Administration,Marketing Spend
0,0,1,165349,136897,471784
1,0,0,162597,151377,443898
2,1,0,153441,101145,407934
3,0,1,144372,118671,383199
4,1,0,142107,91391,366168
5,0,1,131876,99814,362861
6,0,0,134615,147198,127716
7,1,0,130298,145530,323876
8,0,1,120542,148718,311613
9,0,0,123334,108679,304981


In [17]:
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(encoded_df1_x, y, test_size= 0.2, random_state=42)  

In [18]:
x_train.head()

Unnamed: 0,Florida,NewYork,R&D Spend,Administration,Marketing Spend
12,1,0,93863,127320,249839
4,1,0,142107,91391,366168
37,0,0,44069,51283,197029
8,0,1,120542,148718,311613
3,0,1,144372,118671,383199


In [19]:
y_train

array([141585.52, 166187.94,  89949.14, 152211.77, 182901.99, 156122.51,
        77798.83,  49490.75,  42559.73, 129917.04, 149759.96, 126992.93,
       108552.04,  96712.8 ,  97483.56, 192261.83,  65200.33, 105008.31,
        96778.92, 156991.12, 101004.64, 144259.4 ,  90708.19, 191792.06,
       111313.02, 191050.39,  69758.98,  96479.51, 108733.99,  78239.91,
       146121.95, 110352.25, 124266.9 ,  14681.4 , 118474.03, 155752.6 ,
        71498.49, 132602.65, 103282.38,  81229.06])

In [20]:
#Fitting the MLR model to the training set:  
from sklearn.linear_model import LinearRegression  
regressor= LinearRegression()  
regressor.fit(x_train, y_train)  

In [21]:
#Predicting the Test set result;  
y_pred= regressor.predict(x_test)

In [22]:
print('Train Score: ', regressor.score(x_train, y_train))  
print('Test Score: ', regressor.score(x_test, y_test))  

Train Score:  0.9537015031120023
Test Score:  0.8987245306209188


In [23]:
# The difference between both scores is 
0.9537019995248526-0.8987266414319832

0.05497535809286935

In [24]:
import statsmodels.api as smf 
#library, which is used for the estimation of various statistical models such as OLS

Adding a column in matrix of features: As we can check in our MLR equation (a), there is one constant term b0, but this term is not present in our matrix of features, so we need to add it manually. We will add a column having values x0 = 1 associated with the constant term b0.

To add this, we will use append function of Numpy library and will assign a value of 1. Below is the code for it.

In [25]:
# encoded_df1_x=np.append(arr=np.ones((50,1)).astype(int),values=x,axis=1)

In [26]:
encoded_df1_x

Unnamed: 0,Florida,NewYork,R&D Spend,Administration,Marketing Spend
0,0,1,165349,136897,471784
1,0,0,162597,151377,443898
2,1,0,153441,101145,407934
3,0,1,144372,118671,383199
4,1,0,142107,91391,366168
5,0,1,131876,99814,362861
6,0,0,134615,147198,127716
7,1,0,130298,145530,323876
8,0,1,120542,148718,311613
9,0,0,123334,108679,304981


In [27]:
encoded_df1_x=np.append(arr=np.ones((50,1)).astype(int),values=encoded_df1_x,axis=1)

In [28]:
encoded_df1_x

array([[     1,      0,      1, 165349, 136897, 471784],
       [     1,      0,      0, 162597, 151377, 443898],
       [     1,      1,      0, 153441, 101145, 407934],
       [     1,      0,      1, 144372, 118671, 383199],
       [     1,      1,      0, 142107,  91391, 366168],
       [     1,      0,      1, 131876,  99814, 362861],
       [     1,      0,      0, 134615, 147198, 127716],
       [     1,      1,      0, 130298, 145530, 323876],
       [     1,      0,      1, 120542, 148718, 311613],
       [     1,      0,      0, 123334, 108679, 304981],
       [     1,      1,      0, 101913, 110594, 229160],
       [     1,      0,      0, 100671,  91790, 249744],
       [     1,      1,      0,  93863, 127320, 249839],
       [     1,      0,      0,  91992, 135495, 252664],
       [     1,      1,      0, 119943, 156547, 256512],
       [     1,      0,      1, 114523, 122616, 261776],
       [     1,      0,      0,  78013, 121597, 264346],
       [     1,      0,      1,

Step: 2:

Now, we are actually going to apply a backward elimination process. Firstly we will create a new feature vector x_opt, which will only contain a set of independent features that are significantly affecting the dependent variable.
Next, as per the Backward Elimination process, we need to choose a significant level(0.5), and then need to fit the model with all possible predictors. So for fitting the model, we will create a regressor_OLS object of new class OLS of statsmodels library. Then we will fit it by using the fit() method.
Next we need p-value to compare with SL value, so for this we will use summary() method to get the summary table of all the values. Below is the code for it:

In [29]:
encoded_df1_x = encoded_df1_x.astype(int)

In [30]:
encoded_df1_x

array([[     1,      0,      1, 165349, 136897, 471784],
       [     1,      0,      0, 162597, 151377, 443898],
       [     1,      1,      0, 153441, 101145, 407934],
       [     1,      0,      1, 144372, 118671, 383199],
       [     1,      1,      0, 142107,  91391, 366168],
       [     1,      0,      1, 131876,  99814, 362861],
       [     1,      0,      0, 134615, 147198, 127716],
       [     1,      1,      0, 130298, 145530, 323876],
       [     1,      0,      1, 120542, 148718, 311613],
       [     1,      0,      0, 123334, 108679, 304981],
       [     1,      1,      0, 101913, 110594, 229160],
       [     1,      0,      0, 100671,  91790, 249744],
       [     1,      1,      0,  93863, 127320, 249839],
       [     1,      0,      0,  91992, 135495, 252664],
       [     1,      1,      0, 119943, 156547, 256512],
       [     1,      0,      1, 114523, 122616, 261776],
       [     1,      0,      0,  78013, 121597, 264346],
       [     1,      0,      1,

In [31]:
x_opt=encoded_df1_x[:,[0,1,2,3,4,5]]
regressor_OLS=smf.OLS(endog=y,exog=x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Mon, 19 Jun 2023",Prob (F-statistic):,1.34e-27
Time:,22:25:50,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.855,7.281,0.000,3.63e+04,6.4e+04
x1,198.7542,3371.026,0.059,0.953,-6595.103,6992.611
x2,-42.0063,3256.058,-0.013,0.990,-6604.161,6520.148
x3,0.8060,0.046,17.368,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.783,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.267
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


From the table, we will choose the highest p-value, which is for x1=0.953 Now, we have the highest p-value which is greater than the SL value, so will remove the x1 variable (dummy variable) from the table and will refit the model. Below is the code for it:



In [32]:
x_opt=encoded_df1_x[:, [0,2,3,4,5]]  
regressor_OLS=smf.OLS(endog = y, exog=x_opt).fit()  
regressor_OLS.summary()  

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Mon, 19 Jun 2023",Prob (F-statistic):,8.5e-29
Time:,22:25:50,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.018e+04,6747.657,7.437,0.000,3.66e+04,6.38e+04
x1,-136.6070,2801.735,-0.049,0.961,-5779.592,5506.378
x2,0.8059,0.046,17.571,0.000,0.714,0.898
x3,-0.0269,0.052,-0.521,0.605,-0.131,0.077
x4,0.0271,0.017,1.625,0.111,-0.007,0.061

0,1,2,3
Omnibus:,14.892,Durbin-Watson:,1.284
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.666
Skew:,-0.949,Prob(JB):,1.97e-05
Kurtosis:,5.608,Cond. No.,1430000.0


In [33]:
x_opt=encoded_df1_x[:, [0,3,4,5]]  
regressor_OLS=smf.OLS(endog = y, exog=x_opt).fit()  
regressor_OLS.summary()  

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Mon, 19 Jun 2023",Prob (F-statistic):,4.53e-30
Time:,22:25:50,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.384,7.626,0.000,3.69e+04,6.34e+04
x1,0.8057,0.045,17.846,0.000,0.715,0.897
x2,-0.0268,0.051,-0.526,0.602,-0.130,0.076
x3,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.839,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.443
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.587,Cond. No.,1400000.0


In [34]:
x_opt=encoded_df1_x[:, [0,3,5]]  
regressor_OLS=smf.OLS(endog = y, exog=x_opt).fit()  
regressor_OLS.summary()  

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Mon, 19 Jun 2023",Prob (F-statistic):,2.1600000000000003e-31
Time:,22:25:50,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.941,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.265,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.678,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.162
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


In [35]:
x_opt=encoded_df1_x[:, [0,3]]  
regressor_OLS=smf.OLS(endog = y, exog=x_opt).fit()  
regressor_OLS.summary()  

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Mon, 19 Jun 2023",Prob (F-statistic):,3.5000000000000004e-32
Time:,22:25:50,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.900,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.538
Skew:,-0.911,Prob(JB):,9.43e-05
Kurtosis:,5.361,Cond. No.,165000.0


As we can see in the above output image, only two variables are left. So only the R&D independent variable is a significant variable for the prediction. So we can now predict efficiently using this variable.

Below is the code for Building Multiple Linear Regression model by only using R&D spend:

In [36]:
import numpy as np
import pandas as pd

In [37]:
data_set=pd.read_csv('50_Startups.csv')

In [38]:
data_set.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [39]:
x_BE=data_set.iloc[:,0]
y_BE=data_set.iloc[:,-1]

In [40]:
x_BE.head()

0    165349.20
1    162597.70
2    153441.51
3    144372.41
4    142107.34
Name: R&D Spend, dtype: float64

In [41]:
y_BE.head()

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64

In [42]:
from sklearn.model_selection import train_test_split
x_BE_train,x_BE_test,y_BE_train,y_BE_test=train_test_split(x_BE,y_BE,test_size=0.2,random_state=0)

In [43]:
# Fitting the MLR model to training test
from sklearn.linear_model import LinearRegression  
regressor= LinearRegression()  
regressor.fit(np.array(x_BE_train).reshape(-1,1), y_BE_train)  

In [44]:
x_BE_train.head()

33    55493.95
35    46014.02
26    75328.87
34    46426.07
18    91749.16
Name: R&D Spend, dtype: float64

In [45]:
# predicting the test set result
y_pred= regressor.predict(np.array(x_BE_train).reshape(-1,1))  

In [46]:
# Checking the score
print('Train Score:',regressor.score(np.array(x_BE_train).reshape(-1,1), y_BE_train))
print('Test Score:',regressor.score(np.array(x_BE_test).reshape(-1,1), y_BE_test))

Train Score: 0.9449589778363044
Test Score: 0.9464587607787219


As we can see, the training score is 94% accurate, and the test score is also 94% accurate. The difference between both scores is .00149.