# Experiment 3 - Feature Selection using Backward Elimination

## Theory
The significance level is the amount of change a feature will affect towards the final output i.e. how important is this feature and how much it affects the final output. Generally, we take 5% or 0.05 significance level by default.

p-value refers to the hypothesis of the significance level.

Let’s say you have a friend who says that a feature is absolutely of no use. (that is called as null hypothesis). The higher the p-value’s value is, the more he is correct and vice versa.

p-value goes from 0 to 1.

So say column 1 has p-value of 0.994, null hypothesis is true i.e. this column does not provide any noticeable change to the output and can be easily removed without consequences.

Now, column 2 has a p-value of 0.001, null hypothesis is false i.e. is provides very significant change to the output.

Where do we draw the line of something being significant or not? Well that is where significance level comes in.

In our case we won’t consider any p-values over 0.05.

## Importing and fitting data

Importing dependencies.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

import pandas as pd
import numpy as np

import statsmodels.api as sm

Read data from a csv file.

In [2]:
data = pd.read_csv('housing-data.csv')
data

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished


In [3]:
data.columns

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


In [5]:
data.isna().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

In [6]:
data.shape

(545, 13)

In [7]:
categoricalColumns = data.select_dtypes(include='O').keys()
categoricalColumns

Index(['mainroad', 'guestroom', 'basement', 'hotwaterheating',
       'airconditioning', 'prefarea', 'furnishingstatus'],
      dtype='object')

In [8]:
categoricalDf = pd.read_csv('housing-data.csv', usecols=categoricalColumns)
categoricalDf.head()

Unnamed: 0,mainroad,guestroom,basement,hotwaterheating,airconditioning,prefarea,furnishingstatus
0,yes,no,no,no,yes,yes,furnished
1,yes,no,no,no,yes,no,furnished
2,yes,no,yes,no,no,yes,semi-furnished
3,yes,no,yes,no,yes,yes,furnished
4,yes,yes,yes,no,yes,no,furnished


In [9]:
for column in categoricalDf.columns:
    print(column, ':', len(categoricalDf[column].unique()))

mainroad : 2
guestroom : 2
basement : 2
hotwaterheating : 2
airconditioning : 2
prefarea : 2
furnishingstatus : 3


In [10]:
data = pd.get_dummies(data, categoricalColumns)
data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking,mainroad_no,mainroad_yes,guestroom_no,guestroom_yes,...,basement_yes,hotwaterheating_no,hotwaterheating_yes,airconditioning_no,airconditioning_yes,prefarea_no,prefarea_yes,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,7420,4,2,3,2,0,1,1,0,...,0,1,0,0,1,0,1,1,0,0
1,12250000,8960,4,4,4,3,0,1,1,0,...,0,1,0,0,1,1,0,1,0,0
2,12250000,9960,3,2,2,2,0,1,1,0,...,1,1,0,1,0,0,1,0,1,0
3,12215000,7500,4,2,2,3,0,1,1,0,...,1,1,0,0,1,0,1,1,0,0
4,11410000,7420,4,1,2,2,0,1,0,1,...,1,1,0,0,1,1,0,1,0,0


In [11]:
X = data.drop(["price"], axis=1)
t = data["price"]

In [12]:
X.head()

Unnamed: 0,area,bedrooms,bathrooms,stories,parking,mainroad_no,mainroad_yes,guestroom_no,guestroom_yes,basement_no,basement_yes,hotwaterheating_no,hotwaterheating_yes,airconditioning_no,airconditioning_yes,prefarea_no,prefarea_yes,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,7420,4,2,3,2,0,1,1,0,1,0,1,0,0,1,0,1,1,0,0
1,8960,4,4,4,3,0,1,1,0,1,0,1,0,0,1,1,0,1,0,0
2,9960,3,2,2,2,0,1,1,0,0,1,1,0,1,0,0,1,0,1,0
3,7500,4,2,2,3,0,1,1,0,0,1,1,0,0,1,0,1,1,0,0
4,7420,4,1,2,2,0,1,0,1,0,1,1,0,0,1,1,0,1,0,0


In [13]:
t.head()

0    13300000
1    12250000
2    12250000
3    12215000
4    11410000
Name: price, dtype: int64

In [14]:
X = MinMaxScaler().fit_transform(X)

In [15]:
X = X[:, :].tolist()
t = t[:].tolist()

In [16]:
X = sm.add_constant(X)
 
result = sm.OLS(t, X).fit()

print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.682
Model:                            OLS   Adj. R-squared:                  0.674
Method:                 Least Squares   F-statistic:                     87.52
Date:                Tue, 14 Feb 2023   Prob (F-statistic):          9.07e-123
Time:                        09:08:10   Log-Likelihood:                -8331.5
No. Observations:                 545   AIC:                         1.669e+04
Df Residuals:                     531   BIC:                         1.675e+04
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       8.235e+05   4.79e+04     17.207      0.0

In [17]:
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size = 0.30)

model = LinearRegression()  
model.fit(X_train, t_train)

score = model.score(X_test, t_test)
score

0.6548670900024449

## Improving Model Accuracy

In [18]:
X.shape

(545, 21)

In [19]:
constValues = np.arange(1, 21, 1)
constValues

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])

### Iteration 1
Checking all the features to find the highest p-valued one.

In [20]:
X_opt = X[:, constValues]

11, 20, 2, 14

regressor_OLS = sm.OLS(t, X_opt).fit()

print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.682
Model:                            OLS   Adj. R-squared:                  0.674
Method:                 Least Squares   F-statistic:                     87.52
Date:                Tue, 14 Feb 2023   Prob (F-statistic):          9.07e-123
Time:                        09:08:11   Log-Likelihood:                -8331.5
No. Observations:                 545   AIC:                         1.669e+04
Df Residuals:                     531   BIC:                         1.675e+04
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1          3.552e+06   3.53e+05     10.052      0.0

In [21]:
constValues = np.delete(constValues, np.where(constValues == 11))

X_opt = X[:, constValues]

regressor_OLS = sm.OLS(t, X_opt).fit()

print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.682
Model:                            OLS   Adj. R-squared:                  0.674
Method:                 Least Squares   F-statistic:                     87.52
Date:                Tue, 14 Feb 2023   Prob (F-statistic):          9.07e-123
Time:                        09:08:11   Log-Likelihood:                -8331.5
No. Observations:                 545   AIC:                         1.669e+04
Df Residuals:                     531   BIC:                         1.675e+04
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1          3.552e+06   3.53e+05     10.052      0.0

In [22]:
constValues = np.delete(constValues, np.where(constValues == 20))

X_opt = X[:, constValues]

regressor_OLS = sm.OLS(t, X_opt).fit()

print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.682
Model:                            OLS   Adj. R-squared:                  0.674
Method:                 Least Squares   F-statistic:                     87.52
Date:                Tue, 14 Feb 2023   Prob (F-statistic):          9.07e-123
Time:                        09:08:11   Log-Likelihood:                -8331.5
No. Observations:                 545   AIC:                         1.669e+04
Df Residuals:                     531   BIC:                         1.675e+04
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1          3.552e+06   3.53e+05     10.052      0.0

In [23]:
constValues = np.delete(constValues, 1)

X_opt = X[:, constValues]

regressor_OLS = sm.OLS(t, X_opt).fit()

print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.680
Model:                            OLS   Adj. R-squared:                  0.673
Method:                 Least Squares   F-statistic:                     94.34
Date:                Tue, 14 Feb 2023   Prob (F-statistic):          3.14e-123
Time:                        09:08:11   Log-Likelihood:                -8332.8
No. Observations:                 545   AIC:                         1.669e+04
Df Residuals:                     532   BIC:                         1.675e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1          3.596e+06   3.53e+05     10.194      0.0

The final R-squared score stands at 0.618 from 0.682 after iteratively removing features. 