# 3. Feature selection

Feature selection is a process of selecting a subset of relevant features (or attributes) from the original set of features in a dataset. The goal of feature selection is to choose the most relevant and important features, thereby reducing dimensionality, and improving model performance.

### 3.1 SelectKBest (Filter Method)

selectKBest is a filter-based method that evaluates each feature based on a statistical test (in this case, f_classif) and selects the top k features that have the highest correlation with the target variable.
In our case, we selected the top 7 features.

In [11]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the dataset
dfs = pd.read_csv('Housing_cleaned.csv')

# List of non-numeric columns that need manual conversion
non_numeric_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'furnishingstatus', 'prefarea']

# Convert non-numeric columns to numeric using LabelEncoder
label_encoder = LabelEncoder()
for col in non_numeric_columns:
    dfs[col] = label_encoder.fit_transform(data[col])

# X includes all columns except the target column 'price'
X = dfs.iloc[:, 1:]  # All features
y = (dfs['price'] > dfs['price'].median()).astype(int)  # Binary classification (1 if price increases, 0 otherwise)

# Select the top 7 features using SelectKBest
selector = SelectKBest(score_func=f_classif, k=7)
X_new = selector.fit_transform(X, y)

# Print the selected features
selected_features = X.columns[selector.get_support()]
print("Selected Features (SelectKBest):", selected_features)

Selected Features (SelectKBest): Index(['area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'airconditioning', 'prefarea'],
      dtype='object')


### 3.2 Variance Threshold (Filter Method)

This method removes all features whose variance doesn’t meet the given threshold. Low-variance features are considered less informative for predicting the target.
we set the variance threshold to 0.2, meaning features with low variability (less distinct information) are removed.

In [12]:
from sklearn.feature_selection import VarianceThreshold
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the dataset
dfs = pd.read_csv('Housing_cleaned.csv')

# List of non-numeric columns that need manual conversion
non_numeric_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'furnishingstatus', 'prefarea']

# Convert non-numeric columns to numeric using LabelEncoder
label_encoder = LabelEncoder()
for col in non_numeric_columns:
    dfs[col] = label_encoder.fit_transform(data[col])

# X includes all columns except the target column 'price'
X = dfs.iloc[:, 1:]  # All features
y = (dfs['price'] > dfs['price'].median()).astype(int)  # Binary classification (1 if price increases, 0 otherwise)

# Apply Variance Threshold
selector = VarianceThreshold(threshold=0.2)  # Keep all features with non-zero variance
X_new = selector.fit_transform(X)

# Print the selected features
selected_features = X.columns[selector.get_support()]
print("Selected Features (Variance Threshold):", selected_features)

Selected Features (Variance Threshold): Index(['area', 'bedrooms', 'bathrooms', 'stories', 'basement',
       'airconditioning', 'parking', 'furnishingstatus'],
      dtype='object')


### 3.3 Recursive Feature Elimination (RFE)


RFE is a wrapper method that iteratively removes the least important features, based on a model’s performance, to select the best subset of features.
We used Logistic Regression with 2000 iterations as the underlying model and selected the top 7 features.

In [13]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the dataset
dfs = pd.read_csv('Housing_cleaned.csv')

# List of non-numeric columns that need manual conversion
non_numeric_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'furnishingstatus', 'prefarea']

# Convert non-numeric columns to numeric using LabelEncoder
label_encoder = LabelEncoder()
for col in non_numeric_columns:
    dfs[col] = label_encoder.fit_transform(dfs[col])

# X includes all columns except the target column 'price'
X = dfs.iloc[:, 1:]  # All features
y = (dfs['price'] > dfs['price'].median()).astype(int)  # Binary classification (1 if price increases, 0 otherwise)

# Use RFE with Logistic Regression as the underlying model
model = LogisticRegression(max_iter=2000)
rfe = RFE(model, n_features_to_select=7)  # Select top 7 features
X_new = rfe.fit_transform(X, y)

# Print the selected features
selected_features = X.columns[rfe.get_support()]
print("Selected Features (RFE):", selected_features)

Selected Features (RFE): Index(['bathrooms', 'stories', 'mainroad', 'guestroom', 'basement',
       'airconditioning', 'prefarea'],
      dtype='object')


### 3.4   L1 Regularization (Lasso - Embedded Method)

Lasso (Least Absolute Shrinkage and Selection Operator) uses L1 regularization, which adds a penalty for large coefficients in the model, causing some to shrink to zero.
Features with zero coefficients are removed, while those with non-zero coefficients are retained.


In [15]:
from sklearn.linear_model import Lasso
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the dataset
dfs = pd.read_csv('Housing_cleaned.csv')

# List of non-numeric columns that need manual conversion
non_numeric_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'furnishingstatus', 'prefarea']

# Convert non-numeric columns to numeric using LabelEncoder
label_encoder = LabelEncoder()
for col in non_numeric_columns:
    dfs[col] = label_encoder.fit_transform(data[col])

# X includes all columns except the target column 'price'
X = dfs.iloc[:, 1:]  # All features
y = (dfs['price'] > dfs['price'].median()).astype(int)  # Binary classification (1 if price increases, 0 otherwise)

# Apply L1 Regularization (Lasso)
model = Lasso(alpha=0.1, max_iter=1000)
model.fit(X, y)

# Print the selected features based on non-zero coefficients
selected_features = X.columns[model.coef_ != 0]
print("Selected Features (Lasso):", selected_features)

Selected Features (Lasso): Index(['area', 'stories'], dtype='object')


### 3.5 Hybrid Method (SelectKBest + RFE)

This is a combination of both the filter (SelectKBest) and wrapper (RFE) methods. First, SelectKBest reduces the feature space by selecting the top 10 features. Then, RFE further reduces the feature set by selecting the best 7 features based on model performance.

In [16]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the dataset
dfs = pd.read_csv('Housing_cleaned.csv')

# List of non-numeric columns that need manual conversion
non_numeric_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'furnishingstatus', 'prefarea']

# Convert non-numeric columns to numeric using LabelEncoder
label_encoder = LabelEncoder()
for col in non_numeric_columns:
    dfs[col] = label_encoder.fit_transform(dfs[col])

# X includes all columns except the target column 'price'
X = dfs.iloc[:, 1:]  # All features
y = (dfs['price'] > dfs['price'].median()).astype(int)  # Binary classification (1 if price increases, 0 otherwise)

# Step 1: Use SelectKBest (Filter Method) to reduce feature space
selector = SelectKBest(score_func=f_classif, k=10)  # Select the top 10 features
X_filtered = selector.fit_transform(X, y)

# Step 2: Use RFE (Wrapper Method) to find the best candidate subset
model = LogisticRegression(max_iter=1000)
rfe = RFE(model, n_features_to_select=7)  # Further reduce to 7 features
X_new = rfe.fit_transform(X_filtered, y)

# Print the selected features
selected_features_kbest = X.columns[selector.get_support()]
selected_features_rfe = selected_features_kbest[rfe.get_support()]
print("Selected Features after Hybrid FS (SelectKBest + RFE):", selected_features_rfe)


Selected Features after Hybrid FS (SelectKBest + RFE): Index(['bedrooms', 'bathrooms', 'stories', 'mainroad', 'guestroom',
       'airconditioning', 'prefarea'],
      dtype='object')


Best Method for Our Data
Best Method:

Hybrid Method (SelectKBest + RFE)
Reason:

The Hybrid Method provides a balance between feature relevance and interaction. SelectKBest reduces the number of irrelevant features based on their correlation with the target, while RFE ensures that the selected features are optimized for model performance.
The selected features using the hybrid method combine the strengths of both techniques, offering the best compromise between simplicity and accuracy for this dataset. This approach provides a robust feature set that captures both statistical significance and model-based importance.