# Feature Selection

- Feature selection is the process of choosing the most relevant features (input variables) from your dataset that contribute the most to predicting the target variable.
- We remove irrelevant/redundant features
- It’s different from feature extraction (like PCA), which creates new transformed features rather than selecting existing ones.

### Why its done:
- It is implemented to reduce overfitting
- It improves the training speed and increases accuracy

# Types of Feature Selection Methods

## 1) Filter Methods
- They work by applying statistical tests to measure the corerlation or relevance of each feature to the target variable
- Features are ranked and selected based on scores like correlation, mutual information, chi-square, etc
- This works well as a pre-processing step before modelling
- This method ignores considering each feature individually on its own merit

## 2) Wrapper Methods
- Treats feature selection as a search problem
- We train different models on different subsets of the data and evaluate the model performance for each subset
- The goal is to find the optimal subset of features that gives the best preditive performance
- This is computationally very expensive and may overfit on small datasets

### Common Techniques:
- Forward Selection → start with no features, add one at a time
- Backward Elimination → start with all features, remove one at a time
- Recursive Feature Elimination (RFE) → iteratively train a model, remove least important features

## 3) Embedded Models
- Feature selection takes place during model training
- The model learns which features are most important using weights, regularization or built in feature importance

### Common Techniques
- Lasso Regression (L1 regularization) → shrinks some coefficients to zero, effectively removing them
- Ridge Regression (L2 regularization) → reduces weight but doesn’t remove features
- Decision Trees / Random Forest / XGBoost feature importance

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [10]:
%pip install feature_engine
from feature_engine.selection import DropConstantFeatures

    click (>=7.0<=8.1.*)
          ~~~~~~^

[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: C:\Users\meetb\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Collecting feature_engine
  Using cached feature_engine-1.9.3-py3-none-any.whl.metadata (10 kB)
Using cached feature_engine-1.9.3-py3-none-any.whl (229 kB)
Installing collected packages: feature_engine
Successfully installed feature_engine-1.9.3
Note: you may need to restart the kernel to use updated packages.


In [14]:
data = pd.read_csv('./data/dataset_1.csv')
data.head()

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.


Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_292,var_293,var_294,var_295,var_296,var_297,var_298,var_299,var_300,target
0,0,0,0.0,0.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
1,0,0,0.0,3.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
2,0,0,0.0,5.88,0.0,0,0,0,0,0,...,0.0,0,0,3,0,0,0,0.0,67772.7216,0
3,0,0,0.0,14.1,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
4,0,0,0.0,5.76,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0


In [15]:
x = data.drop('target', axis=1)
y = data['target']

# Number of features before removing constant features
print(f'Number of features before removing constant features: {x.shape[1]}')

# tol hyperparameter is used to set the threshold for identifying constant features
# tol=0 means only features with the same value in all observations will be removed
# tol=0.01 means features with 99% or more of the same value will be removed
se = DropConstantFeatures()
x_transformed = se.fit_transform(x)
# Number of features after removing constant features
print(f'Number of features after removing constant features: {x_transformed.shape[1]}')


Number of features before removing constant features: 300
Number of features after removing constant features: 267


In [17]:
# Setting threshold to 0.9 to remove quasi-constant features
se = DropConstantFeatures(tol=0.9)
x_transformed = se.fit_transform(x)
# Number of features after removing quasi-constant features
print(f'Number of features after removing quasi-constant features: {x_transformed.shape[1]}')
# Checking if the removed features are indeed quasi-constant
removed_features = list(set(x.columns) - set(x_transformed.columns))
print(f'Removed features: {removed_features}')
# Checking the percentage of the most frequent value in the removed features
for feature in removed_features:
    most_frequent = x[feature].value_counts(normalize=True).max()
    print(f'Feature: {feature}, Most frequent value percentage: {most_frequent}')

Number of features after removing quasi-constant features: 50
Removed features: ['var_33', 'var_282', 'var_112', 'var_229', 'var_248', 'var_134', 'var_219', 'var_163', 'var_244', 'var_101', 'var_258', 'var_197', 'var_78', 'var_28', 'var_116', 'var_294', 'var_45', 'var_232', 'var_247', 'var_177', 'var_195', 'var_165', 'var_237', 'var_226', 'var_274', 'var_130', 'var_264', 'var_260', 'var_153', 'var_3', 'var_241', 'var_10', 'var_58', 'var_235', 'var_99', 'var_2', 'var_233', 'var_109', 'var_61', 'var_95', 'var_245', 'var_59', 'var_96', 'var_63', 'var_200', 'var_168', 'var_30', 'var_159', 'var_119', 'var_204', 'var_291', 'var_105', 'var_70', 'var_162', 'var_236', 'var_239', 'var_238', 'var_202', 'var_143', 'var_263', 'var_107', 'var_180', 'var_249', 'var_34', 'var_206', 'var_280', 'var_1', 'var_47', 'var_228', 'var_300', 'var_108', 'var_171', 'var_42', 'var_97', 'var_252', 'var_53', 'var_176', 'var_243', 'var_124', 'var_36', 'var_54', 'var_137', 'var_224', 'var_290', 'var_25', 'var_65', 'v