---
## 7.&nbsp;Challenge 😃

Incorporate feature selection into your Housing Regression Pipeline. While you don't have to apply all the methods shown here, we suggest experimenting with 2-3 methods and ultimately choosing one. Research how to **integrate feature selection into your Scikit-Learn pipeline** and consider tuning it with GridSearchCV, which will enhance the cohesiveness of your ML workflow, though it's not mandatory. Happy experimenting!

In [1]:
import pandas as pd

In [2]:
#import data
data = pd.read_csv('/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_7 Supervised_ML/Data/housing_iteration_6_regression/housing_iteration_6_regression.csv')

In [3]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
X = data.drop('Id', axis=1).copy()

In [6]:
y = X.pop('SalePrice')

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# building preprocessor

In [9]:
# Importing the necessary libraries

from sklearn import set_config

# Setting the display configuration for scikit-learn
set_config(display='diagram', transform_output='pandas')


In [10]:
# Separate categorical and numerical features
X_cat = X.select_dtypes(exclude='number').copy()
X_num = X.select_dtypes(include='number').copy()


In [11]:
#creating pipeline for numeric columns
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
numeric_pipe = make_pipeline(SimpleImputer(strategy='mean'))

In [12]:
# creating pipeline for categorical columns
ordinal_cols_names = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'KitchenQual', 'FireplaceQu']
ordinal_cols = X_cat.columns.get_indexer(ordinal_cols_names)
non_ordinal_cols_names = set(X_cat.columns).difference(ordinal_cols_names)
onehot_cols =  X_cat.columns.get_indexer(non_ordinal_cols_names)

ordinal_cols


array([18, 19, 21, 22, 23, 24, 30, 32])

In [13]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.pipeline import make_pipeline
ordinal_cols_rank = [['N_A', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                     ['N_A', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                     ['N_A', 'NA','Po', 'Fa', 'TA', 'Gd', 'Ex'],
                     ['N_A', 'NA','Po', 'Fa', 'TA', 'Gd', 'Ex'],
                     ['N_A', 'NA', 'No', 'Mn', 'Av', 'Gd'],
                     ['N_A', 'NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], 
                     ['N_A', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                     ['N_A', 'NA','Po', 'Fa', 'TA', 'Gd', 'Ex']]

#define categorical preprocessor
categorical_preprocessor = make_column_transformer(
    (OrdinalEncoder(categories=ordinal_cols_rank), ordinal_cols),
    (OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), onehot_cols)
)

#create categorical_pipeline
categorical_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='N_A'),
                                 categorical_preprocessor)

#create final preprocessor
preprocessor = make_column_transformer(
    (numeric_pipe, X_num.columns),
    (categorical_pipe, X_cat.columns)
)
preprocessor

---
## .&nbsp;Baseline Model 🧱

In this codealong, we will focus on comparing the performance of two baseline models: a Decision Tree with default parameters and a K-Nearest Neighbors model with K=1. For simplicity, we won't be tuning the model parameters here. However, you are encouraged to explore parameter tuning independently. After applying each new feature selection strategy, we will track and evaluate the models to understand their impact on the predictive performance.

## Decision Tree Regressor (base model)

In [16]:
from sklearn.tree import DecisionTreeRegressor

decision_tree_pipeline = make_pipeline(preprocessor,
                                       DecisionTreeRegressor())
decision_tree_pipeline.fit(X_train, y_train)

In [18]:
dt_pred = decision_tree_pipeline.predict(X_test)



### error analysis (r2_score)

In [19]:
from sklearn.metrics import r2_score
r2_dt_tree = r2_score(y_test, dt_pred)
r2_dt_tree

0.7846831873401106

## KNeighborsRegressor (base mode)

In [20]:
from sklearn.neighbors import KNeighborsRegressor
knn_pipline = make_pipeline(preprocessor,
                            KNeighborsRegressor())
knn_pipline.fit(X_train, y_train)

In [21]:
knn_pred = knn_pipline.predict(X_test)



In [23]:
r2_knn = r2_score(y_test, knn_pred)
r2_knn

0.703539259178418

In [24]:
#comparing two models
performances = pd.DataFrame({'decision_tree': r2_dt_tree,
                             'knn': r2_knn},
                            index=['baseline'])

performances

Unnamed: 0,decision_tree,knn
baseline,0.784683,0.703539


You can already see how a Decision Tree handles a noisy dataset much better than K-Nearest Neighbors. Decision Trees selectively consider only the "best" features in the algorithm, while K-Nearest Neighbors treats all features equally. However, it's essential to remember that the Decision Tree might not always be the better choice; after preprocessing, K-Nearest Neighbors could potentially perform better.



## .&nbsp;Feature selection based only on features 🔨

### 5.1.&nbsp;Variance Threshold

Features with low variance carry limited information, and with this transformer, we can eliminate those features by setting a threshold. Any feature with a variance below this threshold will be dropped from the dataset. 
Let's first look at the range and variance of the columns.

In [26]:
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_train_preprocessed

Unnamed: 0,pipeline-1__MSSubClass,pipeline-1__LotFrontage,pipeline-1__LotArea,pipeline-1__OverallQual,pipeline-1__OverallCond,pipeline-1__YearBuilt,pipeline-1__YearRemodAdd,pipeline-1__MasVnrArea,pipeline-1__BsmtFinSF1,pipeline-1__BsmtFinSF2,...,pipeline-2__onehotencoder__HeatingQC_Gd,pipeline-2__onehotencoder__HeatingQC_Po,pipeline-2__onehotencoder__HeatingQC_TA,pipeline-2__onehotencoder__CentralAir_Y,pipeline-2__onehotencoder__RoofMatl_CompShg,pipeline-2__onehotencoder__RoofMatl_Metal,pipeline-2__onehotencoder__RoofMatl_Roll,pipeline-2__onehotencoder__RoofMatl_Tar&Grv,pipeline-2__onehotencoder__RoofMatl_WdShake,pipeline-2__onehotencoder__RoofMatl_WdShngl
254,20.0,70.0,8400.0,5.0,6.0,1957.0,1957.0,0.0,922.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1066,60.0,59.0,7837.0,6.0,7.0,1993.0,1994.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
638,30.0,67.0,8777.0,5.0,7.0,1910.0,1950.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
799,50.0,60.0,7200.0,5.0,7.0,1937.0,1950.0,252.0,569.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
380,50.0,50.0,5000.0,5.0,6.0,1924.0,1950.0,0.0,218.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,20.0,78.0,9317.0,6.0,5.0,2006.0,2006.0,0.0,24.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1130,50.0,65.0,7804.0,4.0,3.0,1928.0,1950.0,0.0,622.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1294,20.0,60.0,8172.0,5.0,7.0,1955.0,1990.0,0.0,167.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
860,50.0,55.0,7642.0,7.0,8.0,1918.0,1998.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0


In [27]:
range_var_df = (pd.DataFrame({
'range':X_train_preprocessed.max()-X_train_preprocessed.min(),
'Variance':X_train_preprocessed.var()})
.sort_values(by='Variance'))

In [29]:
range_var_df.head()

Unnamed: 0,range,Variance
pipeline-2__onehotencoder__Condition2_RRNn,1.0,0.000856
pipeline-2__onehotencoder__Condition2_PosA,1.0,0.000856
pipeline-2__onehotencoder__Exterior1st_ImStucc,1.0,0.000856
pipeline-2__onehotencoder__HeatingQC_Po,1.0,0.000856
pipeline-2__onehotencoder__RoofMatl_Metal,1.0,0.000856


In [30]:
range_var_df.tail()

Unnamed: 0,range,Variance
pipeline-1__BsmtUnfSF,2336.0,199241.3
pipeline-1__BsmtFinSF1,5644.0,210746.2
pipeline-1__GrLivArea,5308.0,275029.6
pipeline-1__MiscVal,15500.0,305852.9
pipeline-1__LotArea,213945.0,115764000.0


#### 5.1.1.&nbsp;Scaling the data

Certain scaling processes can result in transforming features to have the same variance, like the standard scaler, which changes the standard deviation to 1. However, this is **not** our desired outcome. We need to choose a scaler that preserves variance, and for this purpose, we will use min-max scaling.

In [31]:
from sklearn.preprocessing import MinMaxScaler

#Inilialize the scaler
my_scaler = MinMaxScaler().set_output(transform='pandas')

#fit the scaler to X_tarin_preprocessed data
X_train_preprocessed_scaled = my_scaler.fit_transform(X_train_preprocessed)
X_train_preprocessed_scaled

Unnamed: 0,pipeline-1__MSSubClass,pipeline-1__LotFrontage,pipeline-1__LotArea,pipeline-1__OverallQual,pipeline-1__OverallCond,pipeline-1__YearBuilt,pipeline-1__YearRemodAdd,pipeline-1__MasVnrArea,pipeline-1__BsmtFinSF1,pipeline-1__BsmtFinSF2,...,pipeline-2__onehotencoder__HeatingQC_Gd,pipeline-2__onehotencoder__HeatingQC_Po,pipeline-2__onehotencoder__HeatingQC_TA,pipeline-2__onehotencoder__CentralAir_Y,pipeline-2__onehotencoder__RoofMatl_CompShg,pipeline-2__onehotencoder__RoofMatl_Metal,pipeline-2__onehotencoder__RoofMatl_Roll,pipeline-2__onehotencoder__RoofMatl_Tar&Grv,pipeline-2__onehotencoder__RoofMatl_WdShake,pipeline-2__onehotencoder__RoofMatl_WdShngl
254,0.000000,0.167808,0.033186,0.444444,0.625,0.615942,0.116667,0.000000,0.163359,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1066,0.235294,0.130137,0.030555,0.555556,0.750,0.876812,0.733333,0.000000,0.000000,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
638,0.058824,0.157534,0.034948,0.444444,0.750,0.275362,0.000000,0.000000,0.000000,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
799,0.176471,0.133562,0.027577,0.444444,0.750,0.471014,0.000000,0.182874,0.100815,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
380,0.176471,0.099315,0.017294,0.444444,0.625,0.376812,0.000000,0.000000,0.038625,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,0.000000,0.195205,0.037472,0.555556,0.500,0.971014,0.933333,0.000000,0.004252,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1130,0.176471,0.150685,0.030400,0.333333,0.250,0.405797,0.000000,0.000000,0.110206,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1294,0.000000,0.133562,0.032120,0.444444,0.750,0.601449,0.666667,0.000000,0.029589,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
860,0.176471,0.116438,0.029643,0.666667,0.875,0.333333,0.800000,0.000000,0.000000,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0


Let's examine how the scaling impacted the range and variance of our columns

In [32]:
range_var_df_scaled = (pd.DataFrame({
'range':X_train_preprocessed_scaled.max()-X_train_preprocessed_scaled.min(),
'Variance':X_train_preprocessed_scaled.var()})
.sort_values(by='Variance'))

In [34]:
range_var_df_scaled.head()

Unnamed: 0,range,Variance
pipeline-2__onehotencoder__Condition2_RRNn,1.0,0.000856
pipeline-2__onehotencoder__Condition2_PosA,1.0,0.000856
pipeline-2__onehotencoder__Exterior1st_ImStucc,1.0,0.000856
pipeline-2__onehotencoder__HeatingQC_Po,1.0,0.000856
pipeline-2__onehotencoder__RoofMatl_Metal,1.0,0.000856


In [35]:
range_var_df_scaled.tail()

Unnamed: 0,range,Variance
pipeline-2__onehotencoder__GarageFinish_Unf,1.0,0.242279
pipeline-2__onehotencoder__MasVnrType_N_A,1.0,0.243024
pipeline-2__onehotencoder__Foundation_CBlock,1.0,0.245519
pipeline-2__onehotencoder__Foundation_PConc,1.0,0.247209
pipeline-2__onehotencoder__HouseStyle_1Story,1.0,0.250178


After scaling features pipeline-1__LotArea	 no longer has the largest variances. Now, we can proceed with applying the `VarianceThreshold` transformation.

All features with a smaller variance than the `threshold` will be deleted from the dataset.

In [41]:
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=.02)
X_train_var = selector.fit_transform(X_train_preprocessed_scaled)
X_train_var

Unnamed: 0,pipeline-1__MSSubClass,pipeline-1__OverallQual,pipeline-1__YearBuilt,pipeline-1__YearRemodAdd,pipeline-1__BsmtUnfSF,pipeline-1__2ndFlrSF,pipeline-1__BsmtFullBath,pipeline-1__FullBath,pipeline-1__HalfBath,pipeline-1__Fireplaces,...,pipeline-2__onehotencoder__Electrical_SBrkr,pipeline-2__onehotencoder__LotShape_IR2,pipeline-2__onehotencoder__LotShape_Reg,pipeline-2__onehotencoder__GarageFinish_N_A,pipeline-2__onehotencoder__GarageFinish_RFn,pipeline-2__onehotencoder__GarageFinish_Unf,pipeline-2__onehotencoder__HeatingQC_Fa,pipeline-2__onehotencoder__HeatingQC_Gd,pipeline-2__onehotencoder__HeatingQC_TA,pipeline-2__onehotencoder__CentralAir_Y
254,0.000000,0.444444,0.615942,0.116667,0.167808,0.000000,0.333333,0.333333,0.0,0.000000,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
1066,0.235294,0.555556,0.876812,0.733333,0.342038,0.373850,0.000000,0.666667,0.5,0.333333,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
638,0.058824,0.444444,0.275362,0.000000,0.340753,0.000000,0.000000,0.333333,0.0,0.000000,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
799,0.176471,0.444444,0.471014,0.000000,0.069349,0.381114,0.333333,0.333333,0.5,0.666667,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
380,0.176471,0.444444,0.376812,0.000000,0.345890,0.322034,0.000000,0.666667,0.0,0.333333,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,0.000000,0.555556,0.971014,0.933333,0.552226,0.000000,0.000000,0.666667,0.0,0.333333,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
1130,0.176471,0.333333,0.405797,0.000000,0.214041,0.316223,0.333333,0.666667,0.0,0.666667,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
1294,0.000000,0.444444,0.601449,0.666667,0.298373,0.000000,0.333333,0.333333,0.0,0.000000,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
860,0.176471,0.666667,0.333333,0.800000,0.390411,0.248910,0.000000,0.333333,0.5,0.333333,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


In [44]:
range_var_df_th = (pd.DataFrame({
'range':X_train_var.max()-X_train_var.min(),
'Variance':X_train_var.var()})
.sort_values(by='Variance'))

In [46]:
range_var_df_th.tail()

Unnamed: 0,range,Variance
pipeline-2__onehotencoder__GarageFinish_Unf,1.0,0.242279
pipeline-2__onehotencoder__MasVnrType_N_A,1.0,0.243024
pipeline-2__onehotencoder__Foundation_CBlock,1.0,0.245519
pipeline-2__onehotencoder__Foundation_PConc,1.0,0.247209
pipeline-2__onehotencoder__HouseStyle_1Story,1.0,0.250178


lets check how many features we have dropped

In [42]:
print("shape before:", X_train_preprocessed_scaled.shape)
print("shape after:", X_train_var.shape)

shape before: (1168, 233)
shape after: (1168, 121)


we have dropped 112 features using .02 threshold

The next step: make sure to transform the test set using the `transform` method. Remember to avoid using the `fit_transform` method, as it is reserved exclusively for the train set. This way, you ensure proper feature scaling without introducing data leakage from the test set into the model.

In [47]:
#preprocessed the test set
X_test_preprocessed = preprocessor.transform(X_test)



In [48]:
#scale the test set after preprocessing
X_test_preprocessed_scaled = my_scaler.transform(X_test_preprocessed)

In [49]:
#Apply variance threshold
X_test_var = selector.transform(X_test_preprocessed_scaled)

Let's check how well our model performs with the new dataset.

In [50]:
#decision tree regressor
var_tree = DecisionTreeRegressor()
var_tree.fit(X_train_var, y_train)
var_tree_pred = var_tree.predict(X_test_var)
var_tree_r2_score = r2_score(y_test, var_tree_pred)
var_tree_r2_score

0.752286576189729

In [51]:
#knn regressor
var_knn = KNeighborsRegressor()
var_knn.fit(X_train_var, y_train)
var_knn_pred = var_knn.predict(X_test_var)
var_knn_r2_score = r2_score(y_test, var_knn_pred)
var_knn_r2_score

0.7077642805381001

In [52]:
performances.loc["varThreshold_0_02", "decision_tree"] = var_tree_r2_score
performances.loc["varThreshold_0_02", "knn"] = var_knn_r2_score

performances

Unnamed: 0,decision_tree,knn
baseline,0.784683,0.703539
varThreshold_0_02,0.752287,0.707764


the performance of k nearest is almost the same, however decision tree performance has degraded, implying that the dropped columns contaied valuable information. Such outcomes are not uncommon when uisng the variance threshold.

Lets repeat the process with more conservative threshold to explore if it makes a difference in preserviing iportant features

#### 5.1.2.&nbsp;Scaling the data: 2nd iteration
Set the threshold to 0, meaning we will remove only those columns with zero variance. This will help us eliminate features that carry no valuable information.

In [56]:
selector_2 = VarianceThreshold(threshold=0).set_output(transform='pandas')
X_train_var2 = selector_2.fit_transform(X_train_preprocessed_scaled)
print('shape before:', X_train_preprocessed_scaled.shape)
print('shape after:', X_train_var2.shape)

shape before: (1168, 233)
shape after: (1168, 233)


we did not drop any feature, lets fix it as 0.001

In [62]:
selector_3 = VarianceThreshold(threshold=0.001).set_output(transform='pandas')
X_train_var3 = selector_3.fit_transform(X_train_preprocessed_scaled)
print('shape before:', X_train_preprocessed_scaled.shape)
print('shape after:', X_train_var3.shape)

shape before: (1168, 233)
shape after: (1168, 214)


In [63]:
#Apply variance threshold
X_test_var3 = selector_3.transform(X_test_preprocessed_scaled)

In [64]:
# Decision tree.
var3_tree = DecisionTreeRegressor()
var3_tree.fit(X_train_var3, y_train)
var3_tree_pred = var3_tree.predict(X_test_var3)

# K-Nearest Neighbors.
var3_knn = KNeighborsRegressor(n_neighbors=1)
var3_knn.fit(X_train_var3, y_train)
var3_knn_pred = var3_knn.predict(X = X_test_var3)

performances.loc["varThreshold_0_00", "decision_tree"] = r2_score(y_test, var3_tree_pred)
performances.loc["varThreshold_0_00", "knn"] = r2_score(y_test, var3_knn_pred)

performances

Unnamed: 0,decision_tree,knn
baseline,0.784683,0.703539
varThreshold_0_02,0.752287,0.707764
varThreshold_0_00,0.807749,0.696836


improvement in decision tree