Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. This is important because it can improve the performance of a machine learning model, reduce overfitting, and decrease the computational cost. Feature selection helps in focusing on the most relevant data, thereby enhancing the model's accuracy and efficiency.

Types of Feature Selection:
1)Filter Methods:

Variance Threshold: Removes features with low variance.
Correlation Matrix with Heatmap: Identifies highly correlated features and removes the least significant ones.
Statistical Tests: Uses statistical methods (e.g., Chi-Square, ANOVA) to select features based on their relationships with the target variable.

2)Wrapper Methods:

Recursive Feature Elimination (RFE): Recursively removes the least important features based on model performance.
Forward Selection: Starts with no features and adds one feature at a time that improves the model the most.
Backward Elimination: Starts with all features and removes the least significant features one at a time.

3)Embedded Methods:

Lasso (L1 Regularization): Adds a penalty equal to the absolute value of the magnitude of coefficients, effectively shrinking some coefficients to zero.
Ridge (L2 Regularization): Adds a penalty equal to the square of the magnitude of coefficients but does not shrink them to zero.
Tree-Based Methods: Uses decision tree algorithms (e.g., Random Forest, Gradient Boosting) to measure feature importance.
Each method has its own advantages and is suited for different scenarios depending on the nature of the data and the problem at hand.

filter methods:

i) variance threshold :Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [1]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import VarianceThreshold
import pandas as pd

In [2]:
iris=load_iris()
X=iris.data
y=iris.target
df=pd.DataFrame(iris.data,columns=iris.feature_names)
df['target']=iris.target

In [3]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [4]:
selector=VarianceThreshold(0.3)
X_reduced=selector.fit_transform(X)

In [5]:
X_reduced.shape

(150, 3)

ii) Correlation Matrix with Heatmap: Identifies highly correlated features and removes the least significant ones.


In [6]:
import pandas as pd
import numpy as np

# Sample data: 5 samples, 4 features
data = {
    'Feature1': [0, 0, 0, 0, 0],
    'Feature2': [2, 1, 1, 1, 1],
    'Feature3': [0, 4, 1, 0, 1],
    'Feature4': [3, 3, 3, 3, 3]
}

df = pd.DataFrame(data)

# Calculate the correlation matrix
corr_matrix = df.corr()

# Display the correlation matrix
print("Correlation Matrix:\n", corr_matrix)



# Set a threshold for high correlation
threshold = 0.9

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Find columns with correlation above the threshold
to_drop = [column for column in corr_matrix.columns if any(corr_matrix[column] > threshold)]

# Drop highly correlated features
df_reduced = df.drop(columns=to_drop)

print("Original DataFrame:\n", df)
print("Reduced DataFrame:\n", df_reduced)


Correlation Matrix:
           Feature1  Feature2  Feature3  Feature4
Feature1       NaN       NaN       NaN       NaN
Feature2       NaN  1.000000 -0.408248       NaN
Feature3       NaN -0.408248  1.000000       NaN
Feature4       NaN       NaN       NaN       NaN
Original DataFrame:
    Feature1  Feature2  Feature3  Feature4
0         0         2         0         3
1         0         1         4         3
2         0         1         1         3
3         0         1         0         3
4         0         1         1         3
Reduced DataFrame:
    Feature1  Feature4
0         0         3
1         0         3
2         0         3
3         0         3
4         0         3


iii)ANOVA (Analysis of Variance):
Used for categorical target variables.
Measures the difference between groups and the variance within groups.
Example: In a classification problem, ANOVA can be used to determine if there are significant differences between the means of different classes for a feature.
iv)Mutual Information:
Measures the mutual dependence between two variables.
Can be used for both classification and regression problems.
Example: Higher mutual information indicates a stronger dependency between the feature and the target variable.

In [7]:
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Sample data: 5 samples, 4 features
X = np.array([
    [0, 2, 0, 3],
    [0, 1, 4, 3],
    [0, 1, 1, 3],
    [0, 1, 0, 3],
    [0, 1, 1, 3]
])
y = np.array([0, 1, 0, 1, 0])  # Target variable (binary classification)

# ANOVA (f_classif)
anova_selector = SelectKBest(score_func=f_classif, k=2)
X_anova_selected = anova_selector.fit_transform(X, y)

print("ANOVA Selected features:\n", X_anova_selected)
print("ANOVA Scores:\n", anova_selector.scores_)

# Mutual Information
mi_selector = SelectKBest(score_func=mutual_info_classif, k=2)
X_mi_selected = mi_selector.fit_transform(X, y)

print("Mutual Information Selected features:\n", X_mi_selected)
print("Mutual Information Scores:\n", mi_selector.scores_)


ANOVA Selected features:
 [[2 0]
 [1 4]
 [1 1]
 [1 0]
 [1 1]]
ANOVA Scores:
 [       nan 0.60000175 0.73846143        nan]


  f = msb / msw


Mutual Information Selected features:
 [[0 3]
 [4 3]
 [1 3]
 [0 3]
 [1 3]]
Mutual Information Scores:
 [0.         0.         0.         0.13333333]


v)Chi-Squres : The Chi-Square test is a statistical test used to determine if there is a significant association between two categorical variables. In the context of feature selection, it is often used to select features that have the strongest relationship with the target variable in classification problems.

In [8]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2

# Sample data: 5 samples, 4 categorical features
data = {
    'Feature1': [0, 1, 0, 1, 0],
    'Feature2': [2, 1, 2, 1, 2],
    'Feature3': [1, 1, 0, 0, 1],
    'Feature4': [3, 3, 3, 3, 3]
}
X = pd.DataFrame(data)
y = np.array([0, 1, 0, 1, 0])  # Target variable (binary classification)

# Initialize the Chi-Square feature selector
chi2_selector = SelectKBest(chi2, k=2)

# Fit and transform the data
X_kbest = chi2_selector.fit_transform(X, y)

# Display the selected features
print("Original features:\n", X)
print("Selected features:\n", X_kbest)
print("Chi-Square Scores:\n", chi2_selector.scores_)
print("Selected feature indices:", chi2_selector.get_support(indices=True))


Original features:
    Feature1  Feature2  Feature3  Feature4
0         0         2         1         3
1         1         1         1         3
2         0         2         0         3
3         1         1         0         3
4         0         2         1         3
Selected features:
 [[0 2]
 [1 1]
 [0 2]
 [1 1]
 [0 2]]
Chi-Square Scores:
 [3.         0.75       0.05555556 0.        ]
Selected feature indices: [0 1]
