## Nauman Anwar (22i-0123)
## Forward Selection & Backward Elimination 

## House Prices - Advanced Regression Techniques
Predict sales prices and practice feature engineering.
#### About Dataset
 Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

dataset available at https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

#### Forward selection
Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

In [22]:
# import the Santander customer satisfaction 

data = pd.read_csv('./train.csv', nrows=35000)


In [21]:
# check shape of training and test sets

data.shape

(1460, 38)

In [14]:
# step forward feature selection

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

In [19]:
# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

(1460, 38)

In [20]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 37), (438, 37))

In [23]:
# find and remove correlated features
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

corr_features = correlation(X_train, 0.8)
print('correlated features: ', len(set(corr_features)) )

correlated features:  3


Here, we have identified correlated features by keeping the threshold at 0.8 or 80% among I.V's

In [24]:
# Removing correlated features 
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((1022, 34), (438, 34))

In [25]:
X_train.fillna(0, inplace=True)

In [26]:
# step forward feature selection

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(RandomForestRegressor(), 
           k_features=10, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='r2',
           cv=3)

sfs1 = sfs1.fit(np.array(X_train), y_train)


[2023-10-03 00:56:50] Features: 1/10 -- score: 0.6687711501852845
[2023-10-03 00:57:26] Features: 2/10 -- score: 0.7206834604220509
[2023-10-03 00:58:11] Features: 3/10 -- score: 0.7468623330355473
[2023-10-03 00:58:54] Features: 4/10 -- score: 0.7679817772047693
[2023-10-03 00:59:37] Features: 5/10 -- score: 0.7714930945722646
[2023-10-03 01:00:31] Features: 6/10 -- score: 0.7815862575939766
[2023-10-03 01:01:59] Features: 7/10 -- score: 0.8236510506026145
[2023-10-03 01:03:36] Features: 8/10 -- score: 0.8405074978646484
[2023-10-03 01:05:14] Features: 9/10 -- score: 0.8488111620646994
[2023-10-03 01:06:47] Features: 10/10 -- score: 0.853585222011951

In [31]:
sfs1.k_feature_idx_

(4, 5, 6, 9, 14, 16, 17, 19, 24, 32)

In [33]:
X_train.columns[list(sfs1.k_feature_idx_)]

Index(['OverallQual', 'OverallCond', 'YearBuilt', 'BsmtFinSF1', '2ndFlrSF',
       'GrLivArea', 'BsmtFullBath', 'FullBath', 'GarageCars', 'MoSold'],
      dtype='object')

#### We can see that forward feature selection results in the above columns being selected from all the given columns.

## Backward Elimination

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

In [35]:
# step backward feature elimination

sfs1 = SFS(RandomForestRegressor(), 
           k_features=10, 
           forward=False, 
           floating=False, 
           verbose=2,
           scoring='r2',
           cv=3)

sfs1 = sfs1.fit(np.array(X_train), y_train)


[2023-10-03 01:18:21] Features: 33/10 -- score: 0.8559718337976711
[2023-10-03 01:24:42] Features: 32/10 -- score: 0.8550439249082022
[2023-10-03 01:29:51] Features: 31/10 -- score: 0.8564527157289809
[2023-10-03 01:34:49] Features: 30/10 -- score: 0.8568093788113665
[2023-10-03 01:40:34] Features: 29/10 -- score: 0.858351365124017
[2023-10-03 01:44:50] Features: 28/10 -- score: 0.8600240559156228
[2023-10-03 01:48:45] Features: 27/10 -- score: 0.8603260052075443
[2023-10-03 01:52:49] Features: 26/10 -- score: 0.8603617910286333
[2023-10-03 01:56:20] Features: 25/10 -- score: 0.8618478458865843
[2023-10-03 01:59:20] Features: 24/10 -- score: 0.8618486620531406
[2023-10-03 02:02:11] Features: 23/10 -- score: 0.8641185338466976
[2023-10-03 02:04:51] Features: 22/10 -- score: 0.8649673286397933
[2023-10-03 02:07:18] Features: 21/10 -- score: 0.8644207161272965
[2023-10-03 02:09:32] Features: 20/10 -- score: 0.8659028631890133
[2023-10-03 02:11:41] Features: 19/10 -- score: 0.864564935501

In [36]:
sfs1.k_feature_idx_

(4, 5, 6, 7, 11, 12, 14, 16, 19, 24)

In [37]:
X_train.columns[list(sfs1.k_feature_idx_)]

Index(['OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'BsmtUnfSF',
       'TotalBsmtSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'GarageCars'],
      dtype='object')

#### So, backward feature elimination results in the above 10 features being selected.

# The End