# &nbsp; Supervised ML: Predicting housing prices (Phase2: Regression)
- Iteration 7: feature selection (automatically)

Feature selection is aimed at identifying and selecting the most relevant and informative features from a given dataset. With the abundance of available features, selecting the right subset of variables can significantly impact the model's performance. By pruning irrelevant or redundant features, feature selection not only enhances the accuracy and generalisation capabilities of models but also reduces computational complexity, ensuring faster and more efficient predictions.

## 1.&nbsp; Import libraries 💾

In [43]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn import set_config
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_regression, RFECV, SelectFromModel
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error , root_mean_squared_error

In [2]:
# 0. Set the config so that we can view our preprocessor, and to transform output from numpy arrays to pandas dataframes
set_config(display="diagram")
set_config(transform_output="pandas")

## 2.&nbsp; Data reading 📂

In [3]:
# reading: housing_iteration_6_regression
url = "https://drive.google.com/file/d/1mOYOiWuqCybgAsb4l9pJONJKVBcyDFu-/view?usp=sharing"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]

housing_data = pd.read_csv(path)

In [4]:
housing_data.shape

(1460, 81)

In [5]:
housing_data.sample(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
468,469,20,RL,98.0,11428,Pave,,IR1,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,250000
1331,1332,80,RL,55.0,10780,Pave,,IR1,Lvl,AllPub,...,0,,,,0,7,2006,WD,Normal,132500
1384,1385,50,RL,60.0,9060,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,10,2009,WD,Normal,105000


In [6]:
housing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

## 3.&nbsp; Train-test split 🔀

All data transformations and feature selection should rely solely on the information from the training set, with no consideration of the test set. In feature selection, this involves deciding the usefulness of columns based only on the training set. Once we identify which columns to drop, we apply the same removal to the test set as well. 

In [7]:
X = housing_data.drop(columns=["Id"])
y = X.pop("SalePrice")

# data splitting
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=123)

## 4.&nbsp; Preprocessing Pipeline 🧱

In [8]:
# 1. defining categorical & numerical columns
X_cat = X_train.select_dtypes(exclude="number").copy()
X_num = X_train.select_dtypes(include="number").copy()

# 2. numerical pipeline
numeric_pipe = make_pipeline(
    SimpleImputer(strategy="mean"))

# 3. categorical pipeline

# # 3.1 defining ordinal & onehot columns
# .get_indexer() get's the index to solve the problem described above about losing column names
ordinal_cols = ["ExterQual", "ExterCond", "BsmtQual", "BsmtCond", 
                "BsmtExposure", "BsmtFinType1", "KitchenQual", 
                "FireplaceQu", "GarageFinish", "GarageQual", 
                "GarageCond", "PavedDrive", "LotShape", "Utilities", 
                "LandSlope", "BsmtFinType2", "HeatingQC"]
onehot_cols = ["MSZoning", "Condition1", "Heating", "Street", 
               "CentralAir", "Foundation", "GarageType", "SaleType", 
               "SaleCondition", "LandContour", "LotConfig", 
               "Neighborhood", "Condition2", "BldgType", "HouseStyle", 
               "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", 
               "Electrical", "Functional"]

# # 3.2. defining the categorical encoder

# # # 3.2.1. we manually establish the order of the categories for our ordinal features, from less important to the most important and including "N_A"
ExterQual_cats = ["Po", "Fa", "TA", "Gd", "Ex"]
ExterCond_cats = ["Po", "Fa", "TA", "Gd", "Ex"]
BsmtQual_cats = ["N_A", "Po", "Fa", "TA", "Gd", "Ex"]
BsmtCond_cats = ["N_A", "Po", "Fa", "TA", "Gd", "Ex"]
BsmtExposure_cats = ["N_A", "No", "Mn", "Av", "Gd"]
BsmtFinType1_cats = ["N_A", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]
KitchenQual_cats = ["Po", "Fa", "TA", "Gd", "Ex"]
FireplaceQu_cats = ["N_A", "Po", "Fa", "TA", "Gd", "Ex"]
GarageFinish_cats = ["N_A", "Unf", "RFn", "Fin"]
GarageQual_cats = ["N_A", "Po", "Fa", "TA", "Gd", "Ex"]
GarageCond_cats = ["N_A", "Po", "Fa", "TA", "Gd", "Ex"]
PavedDrive_cats = ["N", "P", "Y"]
LotShape_cats = ["IR3", "IR2", "IR1", "Reg"]
Utilities_cats = ["ELO", "NoSeWa", "NoSewr", "AllPub"]
LandSlope_cats = ["Sev", "Mod", "Gtl"]
BsmtFinType2_cats = ["N_A", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]
HeatingQC_cats = ["Po", "Fa", "TA", "Gd", "Ex"]

# # # 3.2.2. defining the categorical encoder: a ColumnTransformer with 2 branches: ordinal & onehot
categorical_encoder = ColumnTransformer(
    transformers=[
        ("cat_ordinal", OrdinalEncoder(
            categories=[ExterQual_cats, ExterCond_cats, BsmtQual_cats, 
                        BsmtCond_cats, BsmtExposure_cats, 
                        BsmtFinType1_cats, KitchenQual_cats, 
                        FireplaceQu_cats, GarageFinish_cats, 
                        GarageQual_cats, GarageCond_cats, 
                        PavedDrive_cats, LotShape_cats, Utilities_cats, 
                        LandSlope_cats, BsmtFinType2_cats, 
                        HeatingQC_cats], handle_unknown="use_encoded_value", unknown_value= 10), ordinal_cols),
        ("cat_onehot", OneHotEncoder(
            handle_unknown="ignore", sparse_output=False), onehot_cols),
    ]
)

# # 3.3. categorical pipeline = "N_A" imputer + categorical encoder
categorical_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="N_A"), 
        categorical_encoder
        )

# 4. full preprocessing: a ColumnTransformer with 2 branches: numeric & categorical
full_preprocessing = ColumnTransformer(
    transformers=[
        ("num_pipe", numeric_pipe, X_num.columns),
        ("cat_pipe", categorical_pipe, X_cat.columns),
    ]
)

full_preprocessing

## 5.&nbsp; Grid Search & Cross Validation

Here, we will focus on comparing the performance of two pipeline models: Decision Tree and K-Nearest Neighbors model. First we will find the best parameters for each model, then we apply different feature selection strategies. After applying each feature selection strategy, we will track and evaluate the models to understand their impact on the predictive performance.

### 5.1.&nbsp; Decision Tree Regressor

In [10]:
dt_full_pipeline_GS = make_pipeline(full_preprocessing, 
                                 MinMaxScaler(), 
                                 DecisionTreeRegressor())

# The parameter grid for the regressor
param_grid_dt = {
    "columntransformer__num_pipe__simpleimputer__strategy": ["mean", "median"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    "decisiontreeregressor__max_depth": range(8, 15, 2),
    "decisiontreeregressor__min_samples_split": [2, 3, 4],
    "decisiontreeregressor__min_samples_leaf": [2, 3, 4, 10, 13, 15]
}

dt_search = GridSearchCV(
    dt_full_pipeline_GS,
    param_grid_dt,
    cv=5,
    verbose=1,
    scoring="r2"
    #scoring="neg_mean_squared_error"  # Replace with "r2" or other metrics as needed
)

# Fitting the pipeline to the training data
dt_search.fit(X_train, y_train)

# Get the best R² score and parameters
best_r2_dt = dt_search.best_score_
best_params_dt = dt_search.best_params_

# Converting the negative MSE to positive for interpretability
#best_neg_mse = dt_search.best_score_  # This is negative MSE
#best_mse = -best_neg_mse  # Convert to positive
# Alternatively, if using R², simply use `dt_search.best_score_`

# Output results
scores_dt = {
    "best_r2_dt": best_r2_dt,
    "best_params_dt": best_params_dt
}

scores_dt

Fitting 5 folds for each of 144 candidates, totalling 720 fits


{'best_r2_dt': np.float64(0.7582413951338827),
 'best_params_dt': {'columntransformer__num_pipe__simpleimputer__fill_value': 10,
  'columntransformer__num_pipe__simpleimputer__strategy': 'median',
  'decisiontreeregressor__max_depth': 8,
  'decisiontreeregressor__min_samples_leaf': 3,
  'decisiontreeregressor__min_samples_split': 4}}

Making predictions:

In [11]:
y_pred_dt = dt_search.predict(X_test)

In this evaluation, we will utilise R-squared to assess our models' performance and gauge the impact of our feature selection process. While having a primary metric is recommended, exploring multiple metrics can provide diverse insights into the model's behavior. Therefore, we explore alternative evaluation metrics to gain a more comprehensive understanding of our model's strengths and weaknesses.

In [13]:
assessment_df = pd.DataFrame(columns=['MAE', 'RMSE', "MAPE","R2_score"])
assessment_df.loc['Decision Tree','MAE'] = mean_absolute_error(y_true = y_test, y_pred = y_pred_dt)
assessment_df.loc['Decision Tree','RMSE'] = root_mean_squared_error(y_true = y_test, y_pred = y_pred_dt)
assessment_df.loc['Decision Tree','MAPE'] = mean_absolute_percentage_error(y_true = y_test, y_pred = y_pred_dt)
assessment_df.loc['Decision Tree','R2_score'] = r2_score(y_true = y_test, y_pred = y_pred_dt)

In [14]:
assessment_df

Unnamed: 0,MAE,RMSE,MAPE,R2_score
Decision Tree,24659.692079,41641.250104,0.134126,0.719401


### 5.2.&nbsp; K Nearest Neighbors Regressor

In [18]:
knn_full_pipeline_GS = make_pipeline(full_preprocessing, 
                                 MinMaxScaler(), 
                                 KNeighborsRegressor())

# The parameter grid for the regressor
param_grid_knn = {
    "columntransformer__num_pipe__simpleimputer__strategy": ["mean", "median"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    "kneighborsregressor__n_neighbors": range(2, 15, 2),
    "kneighborsregressor__weights": ["uniform", "distance"],
    "kneighborsregressor__leaf_size": range(2, 15, 2)
}

knn_search = GridSearchCV(
    knn_full_pipeline_GS,
    param_grid_knn,
    cv=5,
    verbose=1,
    scoring="r2"
    #scoring="neg_mean_squared_error"  # Replace with "r2" or other metrics as needed
)

# Fitting the pipeline to the training data
knn_search.fit(X_train, y_train)

# Get the best R² score and parameters
best_r2_knn = knn_search.best_score_
best_params_knn = knn_search.best_params_

# Output results
scores_knn = {
    "best_r2": best_r2_knn,
    "best_params": best_params_knn
}

scores_knn

Fitting 5 folds for each of 196 candidates, totalling 980 fits


{'best_r2': np.float64(0.6956323423774793),
 'best_params': {'columntransformer__num_pipe__simpleimputer__fill_value': 10,
  'columntransformer__num_pipe__simpleimputer__strategy': 'mean',
  'kneighborsregressor__leaf_size': 2,
  'kneighborsregressor__n_neighbors': 8,
  'kneighborsregressor__weights': 'distance'}}

Making predictions:

In [19]:
y_pred_knn = knn_search.predict(X_test)

In this evaluation, we will utilise R-squared to assess our models' performance and gauge the impact of our feature selection process. While having a primary metric is recommended, exploring multiple metrics can provide diverse insights into the model's behavior. Therefore, feel free to explore alternative evaluation metrics to gain a more comprehensive understanding of your model's strengths and weaknesses.

In [20]:
assessment_df.loc['K Neighbors','MAE'] = mean_absolute_error(y_true = y_test, y_pred = y_pred_knn)
assessment_df.loc['K Neighbors','RMSE'] = root_mean_squared_error(y_true = y_test, y_pred = y_pred_knn)
assessment_df.loc['K Neighbors','MAPE'] = mean_absolute_percentage_error(y_true = y_test, y_pred = y_pred_knn)
assessment_df.loc['K Neighbors','R2_score'] = r2_score(y_true = y_test, y_pred = y_pred_knn)
assessment_df

Unnamed: 0,MAE,RMSE,MAPE,R2_score
Decision Tree,24659.692079,41641.250104,0.134126,0.719401
K Neighbors,27889.113725,44463.870311,0.152491,0.680071


You can already see how a Decision Tree handles a noisy dataset much better than K-Nearest Neighbors. Decision Trees selectively consider only the "best" features in the algorithm, while K-Nearest Neighbors treats all features equally. However, it's essential to remember that the Decision Tree might not always be the better choice; after feature selection, K-Nearest Neighbors could potentially perform better.

> **Note:** A negative R-squared score suggests that the KNN model performs worse than a horizontal line, indicating that it fails to capture any meaningful relationships between the input features and the target variable.

## 6.&nbsp;Feature selection based on features and labels 🔧

### 6.1.&nbsp;K Best

Select K Best allows us to use statistical tests like ANOVA or chi2 to rank and select the best features. Refer to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) and [user guide](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) to see the model's methods, understand its parameters, and explore examples. 

We will have to choose an appropriate statistical test based on the data type: for our regression problem, an f-test will be used. A detailed explanations and examples on the f-test can be found [here](https://www.youtube.com/watch?v=ie-MYQp1Nic&ab_channel=BenLambert).

In short, Scikit-Learn computes the f-statistic for each univariate linear model (one for each feature). The f-statistic measures how much better the linear model with a single feature predicts compared to using only a constant value. This "score" allows us to rank the features.

The f-test gives a ranking of the "best" features based on their individual predictive ability in a linear model. The SelectKBest transformer performs this test and allows you to control the number of "top" features to retain using the K parameter. 

> **Note:** In a pipeline with `GridSearchCV`, you can fine-tune the `K` parameter by trying out various possible values, along with other preprocessing and modeling parameters. Machine Learning often involves automated search or optimisation techniques to find the best parameter values, and it is normal not to have to intuitively know the ideal parameter values from the outset.

#### 6.1.1.&nbsp; Decision Tree Regressor

In [26]:
dt_full_pipeline_KBest = make_pipeline(full_preprocessing, 
                                 MinMaxScaler(), 
                                 SelectKBest(score_func=f_regression),
                                 DecisionTreeRegressor())

# The parameter grid for the regressor
param_grid_dt = {
    "columntransformer__num_pipe__simpleimputer__strategy": ["mean", "median"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    "selectkbest__k": range(10, 51, 10),
    "decisiontreeregressor__max_depth": range(6, 13, 2),
    "decisiontreeregressor__min_samples_split": [2, 3, 4],
    "decisiontreeregressor__min_samples_leaf": range(2, 5, 1)
}

dt_search_KBest = GridSearchCV(
    dt_full_pipeline_KBest,
    param_grid_dt,
    cv=5,
    verbose=1,
    scoring="r2"
    
)

# Fitting the pipeline to the training data
dt_search_KBest.fit(X_train, y_train)

# Get the best R² score and parameters
best_r2_KBest = dt_search_KBest.best_score_
best_params_KBest = dt_search_KBest.best_params_

# Output results
scores_dt_KBest = {
    "best_r2_KBest": best_r2_KBest,
    "best_params_KBest": best_params_KBest
}

scores_dt_KBest

Fitting 5 folds for each of 360 candidates, totalling 1800 fits


{'best_r2_KBest': np.float64(0.7551382801488181),
 'best_params_KBest': {'columntransformer__num_pipe__simpleimputer__fill_value': 10,
  'columntransformer__num_pipe__simpleimputer__strategy': 'median',
  'decisiontreeregressor__max_depth': 6,
  'decisiontreeregressor__min_samples_leaf': 3,
  'decisiontreeregressor__min_samples_split': 4,
  'selectkbest__k': 50}}

In [27]:
y_pred_dt_KBest = dt_search_KBest.predict(X_test)

Let's see how our models perform with these 50 "best" features:

In [28]:
assessment_df.loc['Decision Tree KBest','MAE'] = mean_absolute_error(y_true = y_test, y_pred = y_pred_dt_KBest)
assessment_df.loc['Decision Tree KBest','RMSE'] = root_mean_squared_error(y_true = y_test, y_pred = y_pred_dt_KBest)
assessment_df.loc['Decision Tree KBest','MAPE'] = mean_absolute_percentage_error(y_true = y_test, y_pred = y_pred_dt_KBest)
assessment_df.loc['Decision Tree KBest','R2_score'] = r2_score(y_true = y_test, y_pred = y_pred_dt_KBest)
assessment_df

Unnamed: 0,MAE,RMSE,MAPE,R2_score
Decision Tree,24659.692079,41641.250104,0.134126,0.719401
K Neighbors,27889.113725,44463.870311,0.152491,0.680071
Decision Tree KBest,26549.676503,44115.381472,0.14367,0.685067


#### 6.1.2.&nbsp; K Nearest Neighbors Regressor

In [30]:
knn_full_pipeline_KBest = make_pipeline(full_preprocessing, 
                                 MinMaxScaler(), 
                                 SelectKBest(score_func=f_regression),
                                 KNeighborsRegressor())

# The parameter grid for the regressor
param_grid_knn = {
    "columntransformer__num_pipe__simpleimputer__strategy": ["mean", "median"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    "selectkbest__k": range(10, 61, 10),
    "kneighborsregressor__n_neighbors": range(1, 13, 3),
    "kneighborsregressor__weights": ["uniform", "distance"],
    "kneighborsregressor__leaf_size": range(2, 5, 2)
}

knn_search_KBest = GridSearchCV(
    knn_full_pipeline_KBest,
    param_grid_knn,
    cv=5,
    verbose=1,
    scoring="r2"
    #scoring="neg_mean_squared_error"  # Replace with "r2" or other metrics as needed
)

# Fitting the pipeline to the training data
knn_search_KBest.fit(X_train, y_train)

# Get the best R² score and parameters
best_r2_KBest = knn_search_KBest.best_score_
best_params_KBest = knn_search_KBest.best_params_

# Output results
scores_knn_KBest = {
    "best_r2_KBest": best_r2_KBest,
    "best_params_KBest": best_params_KBest
}

scores_knn_KBest

Fitting 5 folds for each of 192 candidates, totalling 960 fits


{'best_r2_KBest': np.float64(0.771777812626909),
 'best_params_KBest': {'columntransformer__num_pipe__simpleimputer__fill_value': 10,
  'columntransformer__num_pipe__simpleimputer__strategy': 'mean',
  'kneighborsregressor__leaf_size': 4,
  'kneighborsregressor__n_neighbors': 10,
  'kneighborsregressor__weights': 'distance',
  'selectkbest__k': 10}}

In [31]:
y_pred_knn_KBest = knn_search_KBest.predict(X_test)

In [32]:
assessment_df.loc['K Neighbors KBest','MAE'] = mean_absolute_error(y_true = y_test, y_pred = y_pred_knn_KBest)
assessment_df.loc['K Neighbors KBest','RMSE'] = root_mean_squared_error(y_true = y_test, y_pred = y_pred_knn_KBest)
assessment_df.loc['K Neighbors KBest','MAPE'] = mean_absolute_percentage_error(y_true = y_test, y_pred = y_pred_knn_KBest)
assessment_df.loc['K Neighbors KBest','R2_score'] = r2_score(y_true = y_test, y_pred = y_pred_knn_KBest)
assessment_df

Unnamed: 0,MAE,RMSE,MAPE,R2_score
Decision Tree,24659.692079,41641.250104,0.134126,0.719401
K Neighbors,27889.113725,44463.870311,0.152491,0.680071
Decision Tree KBest,26549.676503,44115.381472,0.14367,0.685067
K Neighbors KBest,19523.700067,31102.436851,0.110415,0.843459


The performance of K Nearest Neighbors Regressor models has increased significantly by tuning the value of `K`.

### 6.2.&nbsp;Recursive Feature Elimination

Recursive Feature Elimination (RFE) is an automatic feature selection technique that efficiently identifies the most relevant features from a dataset. It begins by training a chosen model on all features and recording its performance. Then, it iteratively removes the least important features based on their importance rankings, continuously evaluating model performance. As long as the performance remains steady or improves, RFE keeps dropping features one by one. Once the performance starts to decline, it stops and saves the model, eliminating the need for manual or fine-tuning decisions on the number of features to keep.

However, it's essential to use a model capable of computing "feature importances" for RFE to be effective, and tree-based models are generally suitable for this purpose, unlike models like KNN. In scikit-learn, you can check if the model has an attribute called `feature_importances_`.

#### 6.2.1.&nbsp; Decision Tree Regressor

In [37]:
dt_full_pipeline_RFE = make_pipeline(full_preprocessing, 
                                 MinMaxScaler(), 
                                 RFECV(estimator=DecisionTreeRegressor(), step=1, cv=KFold(5)),
                                 DecisionTreeRegressor())

# The parameter grid for the regressor
param_grid_dt_RFE = {
    "columntransformer__num_pipe__simpleimputer__strategy": ["median"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    #"rfecv__estimator__max_depth": range(3, 10, 2),
    #"rfecv__estimator__min_samples_split": [2, 3],
    #"rfecv__estimator__min_samples_leaf": range(2, 5),
    "decisiontreeregressor__max_depth": [6],
    "decisiontreeregressor__min_samples_split": [4],
    "decisiontreeregressor__min_samples_leaf": [3]
}

dt_search_RFE = GridSearchCV(
    dt_full_pipeline_RFE,
    param_grid_dt_RFE,
    cv=5,
    verbose=1,
    scoring="r2"
    
)

# Fitting the pipeline to the training data
dt_search_RFE.fit(X_train, y_train)

# Get the best R² score and parameters
best_r2_RFE = dt_search_RFE.best_score_
best_params_RFE = dt_search_RFE.best_params_

# Output results
scores_dt_RFE = {
    "best_r2_RFE": best_r2_RFE,
    "best_params_RFE": best_params_RFE
}

scores_dt_RFE

Fitting 5 folds for each of 1 candidates, totalling 5 fits


{'best_r2_RFE': np.float64(0.7662560278273575),
 'best_params_RFE': {'columntransformer__num_pipe__simpleimputer__fill_value': 10,
  'columntransformer__num_pipe__simpleimputer__strategy': 'median',
  'decisiontreeregressor__max_depth': 6,
  'decisiontreeregressor__min_samples_leaf': 3,
  'decisiontreeregressor__min_samples_split': 4}}

In [38]:
y_pred_dt_RFE = dt_search_RFE.predict(X_test)

In [39]:
assessment_df.loc['Decision Tree RFE','MAE'] = mean_absolute_error(y_true = y_test, y_pred = y_pred_dt_RFE)
assessment_df.loc['Decision Tree RFE','RMSE'] = root_mean_squared_error(y_true = y_test, y_pred = y_pred_dt_RFE)
assessment_df.loc['Decision Tree RFE','MAPE'] = mean_absolute_percentage_error(y_true = y_test, y_pred = y_pred_dt_RFE)
assessment_df.loc['Decision Tree RFE','R2_score'] = r2_score(y_true = y_test, y_pred = y_pred_dt_RFE)
assessment_df

Unnamed: 0,MAE,RMSE,MAPE,R2_score
Decision Tree,24659.692079,41641.250104,0.134126,0.719401
K Neighbors,27889.113725,44463.870311,0.152491,0.680071
Decision Tree KBest,26549.676503,44115.381472,0.14367,0.685067
K Neighbors KBest,19523.700067,31102.436851,0.110415,0.843459
Decision Tree RFE,26102.146558,42872.915799,0.143453,0.702556


#### 6.2.2.&nbsp; K Nearest Neighbors Regressor

In [40]:
knn_full_pipeline_RFE = make_pipeline(full_preprocessing, 
                                 MinMaxScaler(), 
                                 RFECV(estimator=DecisionTreeRegressor(), step=1, cv=KFold(5)),
                                 KNeighborsRegressor())

# The parameter grid for the regressor
param_grid_knn_RFE = {
    "columntransformer__num_pipe__simpleimputer__strategy": ["mean"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    #"rfecv__estimator__max_depth": range(3, 10, 2),
    #"rfecv__estimator__min_samples_split": [2, 3],
    #"rfecv__estimator__min_samples_leaf": range(2, 5),
    "kneighborsregressor__n_neighbors": [10],
    "kneighborsregressor__weights": ["distance"],
    "kneighborsregressor__leaf_size": [4]
}

knn_search_RFE = GridSearchCV(
    knn_full_pipeline_RFE,
    param_grid_knn_RFE,
    cv=5,
    verbose=1,
    scoring="r2"
    #scoring="neg_mean_squared_error"  # Replace with "r2" or other metrics as needed
)

# Fitting the pipeline to the training data
knn_search_RFE.fit(X_train, y_train)

# Get the best R² score and parameters
best_r2_RFE = knn_search_RFE.best_score_
best_params_RFE = knn_search_RFE.best_params_

# Output results
scores_knn_RFE = {
    "best_r2_RFE": best_r2_RFE,
    "best_params_RFE": best_params_RFE
}

scores_knn_RFE

Fitting 5 folds for each of 1 candidates, totalling 5 fits


{'best_r2_RFE': np.float64(0.6737828513362515),
 'best_params_RFE': {'columntransformer__num_pipe__simpleimputer__fill_value': 10,
  'columntransformer__num_pipe__simpleimputer__strategy': 'mean',
  'kneighborsregressor__leaf_size': 4,
  'kneighborsregressor__n_neighbors': 10,
  'kneighborsregressor__weights': 'distance'}}

In [41]:
y_pred_knn_RFE = knn_search_RFE.predict(X_test)

In [42]:
assessment_df.loc['K Neighbors RFE','MAE'] = mean_absolute_error(y_true = y_test, y_pred = y_pred_knn_RFE)
assessment_df.loc['K Neighbors RFE','RMSE'] = root_mean_squared_error(y_true = y_test, y_pred = y_pred_knn_RFE)
assessment_df.loc['K Neighbors RFE','MAPE'] = mean_absolute_percentage_error(y_true = y_test, y_pred = y_pred_knn_RFE)
assessment_df.loc['K Neighbors RFE','R2_score'] = r2_score(y_true = y_test, y_pred = y_pred_knn_RFE)
assessment_df

Unnamed: 0,MAE,RMSE,MAPE,R2_score
Decision Tree,24659.692079,41641.250104,0.134126,0.719401
K Neighbors,27889.113725,44463.870311,0.152491,0.680071
Decision Tree KBest,26549.676503,44115.381472,0.14367,0.685067
K Neighbors KBest,19523.700067,31102.436851,0.110415,0.843459
Decision Tree RFE,26102.146558,42872.915799,0.143453,0.702556
K Neighbors RFE,28462.869059,45915.133769,0.15734,0.658846


It should be noted that RFE can detect valuable non-linear interactions between multiple features, as it checks how features work in combination with each other, while SelectKBest relies on univariate selection (checking features individually with the target).

> **Note:** When you encounter methods like `get_feature_names_out()` in pre-made notebooks, it might seem as if you needed to know them in advance. However, what's important is to be aware that Scikit-Learn transformer objects often store valuable information after being fitted. To access such information, always refer to the documentation for available attributes and methods, and search for what you need. In such a case, a simple search like "which features are selected in SelectKBest" would also yield helpful results.

### 6.3.&nbsp;Select from model

[SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html?highlight=selectfrommodel#sklearn.feature_selection.SelectFromModel) leverages the importance scores provided by a model to select the most relevant features from a given dataset. The process involves training a model on the entire feature set, obtaining the feature importances or coefficients from the model, and then selecting the features based on a specified threshold. If you don't include a threshold, SelectFromModel will automatically select the best features for you, using the model's inherent feature importance ranking. This approach is **particularly useful** for models that inherently provide feature importances, such as **tree-based models or linear models**, allowing us to focus on the most influential features and improve model performance while reducing complexity.
> **Note:** SelectFromModel may seem similar to RFE since both methods use inherent feature importance scores, but they operate differently. Unlike RFE, SelectFromModel solely concentrates on the top features based on their individual predictive power, without considering interactions between features.

#### 6.3.1.&nbsp; Decision Tree Regressor

In [44]:
dt_full_pipeline_SFM = make_pipeline(full_preprocessing, 
                                 MinMaxScaler(), 
                                 SelectFromModel(estimator=RandomForestRegressor(n_estimators=100)),
                                 DecisionTreeRegressor())

# The parameter grid for the regressor
param_grid_dt_SFM = {
    "columntransformer__num_pipe__simpleimputer__strategy": ["mean", "median"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    "decisiontreeregressor__max_depth": range(6, 11, 2),
    "decisiontreeregressor__min_samples_split": [2, 4],
    "decisiontreeregressor__min_samples_leaf": [2, 3, 4]
}

dt_search_SFM = GridSearchCV(
    dt_full_pipeline_SFM,
    param_grid_dt_SFM,
    cv=5,
    verbose=1,
    scoring="r2"
    
)

# Fitting the pipeline to the training data
dt_search_SFM.fit(X_train, y_train)

# Get the best R² score and parameters
best_r2_SFM = dt_search_SFM.best_score_
best_params_SFM = dt_search_SFM.best_params_

# Output results
scores_dt_SFM = {
    "best_r2_SFM": best_r2_SFM,
    "best_params_SFM": best_params_SFM
}

scores_dt_SFM

Fitting 5 folds for each of 36 candidates, totalling 180 fits


{'best_r2_SFM': np.float64(0.7597560672362771),
 'best_params_SFM': {'columntransformer__num_pipe__simpleimputer__fill_value': 10,
  'columntransformer__num_pipe__simpleimputer__strategy': 'mean',
  'decisiontreeregressor__max_depth': 6,
  'decisiontreeregressor__min_samples_leaf': 2,
  'decisiontreeregressor__min_samples_split': 2}}

In [45]:
y_pred_dt_SFM = dt_search_SFM.predict(X_test)

In [46]:
assessment_df.loc['Decision Tree SFM','MAE'] = mean_absolute_error(y_true = y_test, y_pred = y_pred_dt_SFM)
assessment_df.loc['Decision Tree SFM','RMSE'] = root_mean_squared_error(y_true = y_test, y_pred = y_pred_dt_SFM)
assessment_df.loc['Decision Tree SFM','MAPE'] = mean_absolute_percentage_error(y_true = y_test, y_pred = y_pred_dt_SFM)
assessment_df.loc['Decision Tree SFM','R2_score'] = r2_score(y_true = y_test, y_pred = y_pred_dt_SFM)
assessment_df

Unnamed: 0,MAE,RMSE,MAPE,R2_score
Decision Tree,24659.692079,41641.250104,0.134126,0.719401
K Neighbors,27889.113725,44463.870311,0.152491,0.680071
Decision Tree KBest,26549.676503,44115.381472,0.14367,0.685067
K Neighbors KBest,19523.700067,31102.436851,0.110415,0.843459
Decision Tree RFE,26102.146558,42872.915799,0.143453,0.702556
K Neighbors RFE,28462.869059,45915.133769,0.15734,0.658846
Decision Tree SFM,24418.841663,36446.63031,0.137916,0.785042


#### 6.3.2.&nbsp; K Nearest Neighbors Regressor

In [47]:
knn_full_pipeline_SFM = make_pipeline(full_preprocessing, 
                                 MinMaxScaler(), 
                                 SelectFromModel(estimator=RandomForestRegressor(n_estimators=100)),
                                 KNeighborsRegressor())

# The parameter grid for the regressor
param_grid_knn_SFM = {
    "columntransformer__num_pipe__simpleimputer__strategy": ["mean", "median"],
    "columntransformer__num_pipe__simpleimputer__fill_value": [10],
    "kneighborsregressor__n_neighbors": [8, 10],
    "kneighborsregressor__weights": ["distance"],
    "kneighborsregressor__leaf_size": [2, 3]
}

knn_search_SFM = GridSearchCV(
    knn_full_pipeline_SFM,
    param_grid_knn_SFM,
    cv=5,
    verbose=1,
    scoring="r2"
    #scoring="neg_mean_squared_error"  # Replace with "r2" or other metrics as needed
)

# Fitting the pipeline to the training data
knn_search_SFM.fit(X_train, y_train)

# Get the best R² score and parameters
best_r2_SFM = knn_search_SFM.best_score_
best_params_SFM = knn_search_SFM.best_params_

# Output results
scores_knn_SFM = {
    "best_r2_KBest": best_r2_SFM,
    "best_params_KBest": best_params_SFM
}

scores_knn_SFM

Fitting 5 folds for each of 8 candidates, totalling 40 fits


{'best_r2_KBest': np.float64(0.7617670832232244),
 'best_params_KBest': {'columntransformer__num_pipe__simpleimputer__fill_value': 10,
  'columntransformer__num_pipe__simpleimputer__strategy': 'median',
  'kneighborsregressor__leaf_size': 3,
  'kneighborsregressor__n_neighbors': 8,
  'kneighborsregressor__weights': 'distance'}}

In [48]:
y_pred_knn_SFM = knn_search_SFM.predict(X_test)

In [49]:
assessment_df.loc['K Neighbors SFM','MAE'] = mean_absolute_error(y_true = y_test, y_pred = y_pred_knn_SFM)
assessment_df.loc['K Neighbors SFM','RMSE'] = root_mean_squared_error(y_true = y_test, y_pred = y_pred_knn_SFM)
assessment_df.loc['K Neighbors SFM','MAPE'] = mean_absolute_percentage_error(y_true = y_test, y_pred = y_pred_knn_SFM)
assessment_df.loc['K Neighbors SFM','R2_score'] = r2_score(y_true = y_test, y_pred = y_pred_knn_SFM)
assessment_df

Unnamed: 0,MAE,RMSE,MAPE,R2_score
Decision Tree,24659.692079,41641.250104,0.134126,0.719401
K Neighbors,27889.113725,44463.870311,0.152491,0.680071
Decision Tree KBest,26549.676503,44115.381472,0.14367,0.685067
K Neighbors KBest,19523.700067,31102.436851,0.110415,0.843459
Decision Tree RFE,26102.146558,42872.915799,0.143453,0.702556
K Neighbors RFE,28462.869059,45915.133769,0.15734,0.658846
Decision Tree SFM,24418.841663,36446.63031,0.137916,0.785042
K Neighbors SFM,21636.568465,37141.903263,0.11653,0.776763


We see how algorithms can react differently to the same preprocessing and feature selection strategy. It highlights the importance of exploring various approaches to find the most suitable one for specific datasets and models.