<div align="justify">

In data science many times we encounter vast of features present in a dataset. But it is not necessary all features contribute equally in prediction that's where feature selection comes. It involves selecting a subset of relevant features from the original feature set to reduce the feature space while improving the model’s performance by reducing computational power. When a dataset has too many features, some may be irrelevant or add noise which can slow down training process and reduce accuracy but it helps in building simpler, faster and more accurate models and also helps in reducing overfitting.

</div>

<div align="center">


![](../../images/image4.png)

![](../../images/image5.png)

![](../../images/image6.png)

![](../../images/image7.png)

</div>

<div align="justify">

There are various algorithms used for feature selection and are grouped into three main categories and each one has its own strengths and trade-offs depending on the use case.

</div>

### __1. Filter Methods__

<div align="justify">

Filter methods evaluate each feature independently with target variable. Feature with high correlation with target variable are selected as it means this feature has some relation and can help us in making predictions. These methods are used in the preprocessing phase to remove irrelevant or redundant features based on statistical tests (correlation) or other criteria.

<div align="center">

![](../../images/filter-methods-implementation.png)

_Filter Methods Implementation_

</div>

<strong>Advantages</strong>

- Quickly evaluate features without training the model.
- Good for removing redundant or correlated features.

<strong>Limitations:</strong> These methods don't consider feature interactions so they may miss feature combinations that improve model performance.

Some techniques used are:  

- __Information Gain:__ It is defined as the amount of information provided by the feature for identifying the target value and measures reduction in the entropy values. Information gain of each attribute is calculated considering the target values for feature selection.
- __Chi-square test:__ It is generally used to test the relationship between categorical variables. It compares the observed values from different attributes of the dataset to its expected value.
- __Fisher’s Score:__ It selects each feature independently according to their scores under Fisher criterion leading to a suboptimal set of features. The larger the Fisher’s score is, the better is the selected feature.
- __Pearson’s Correlation Coefficient:__ It is a measure of quantifying the association between the two continuous variables and the direction of the relationship with its values ranging from -1 to 1.
- __Variance Threshold:__ It is an approach where all features are removed whose variance doesn’t meet the specific threshold. By default this method removes features having zero variance. The assumption made using this method is higher variance features are likely to contain more information.
- __Mean Absolute Difference:__ It is a method is similar to variance threshold method but the difference is there is no square in this method. This method calculates the mean absolute difference from the mean value.
- __Dispersion ratio:__ It is defined as the ratio of the Arithmetic mean (AM) to that of Geometric mean (GM) for a given feature. Its value ranges from +1 to infinity as AM ≥ GM for a given feature. Higher dispersion ratio implies a more relevant feature.

</div>

### __2. Wrapper Methods__

<div align="justify">

Wrapper methods are also referred as greedy algorithms that train algorithm. They use different combination of features and compute relation between these subset features and target variable and based on conclusion addition and removal of features are done. Stopping criteria for selecting the best subset are usually pre-defined by the person training the model such as when the performance of the model decreases or a specific number of features are achieved.

<div align="center">

![](../../images/wrapper-methods-implementation.png)

_Wrapper Methods Implementation_

</div>

<strong>Advantages</strong>

- Can lead to better model performance since they evaluate feature subsets in the context of the model.
- They can capture feature dependencies and interactions.

<strong>Limitations:</strong> They are computationally more expensive than filter methods especially for large datasets.

Some techniques used are:

- __Forward selection:__ This method is an iterative approach where we initially start with an empty set of features and keep adding a feature which best improves our model after each iteration. The stopping criterion is till the addition of a new variable does not improve the performance of the model.
- __Backward elimination:__ This method is also an iterative approach where we initially start with all features and after each iteration, we remove the least significant feature. The stopping criterion is till no improvement in the performance of the model is observed after the feature is removed.
- __Recursive elimination:__ Recursive elimination is a greedy method that selects features by recursively removing the least important ones. It trains a model, ranks features based on importance and eliminates them one by one until the desired number of features is reached.

</div>

### __3. Embedded Methods__

<div align="justify">

Embedded methods perform feature selection during the model training process. They combine the benefits of both filter and wrapper methods. Feature selection is integrated into the model training allowing the model to select the most relevant features based on the training process dynamically.

<div align="center">

![](../../images/embedded-methods-implementation.png)

_Embedded Methods Implementation_

</div>

__Advantages__

- More efficient than wrapper methods because the feature selection process is embedded within model training.
- Often more scalable than wrapper methods.

__Limitations:__ Works with a specific learning algorithm so the feature selection might not work well with other models

Some techniques used are:

- __L1 Regularization (Lasso):__ A regression method that applies L1 regularization to encourage sparsity in the model. Features with non-zero coefficients are considered important.
- __Decision Trees and Random Forests:__ These algorithms naturally perform feature selection by selecting the most important features for splitting nodes based on criteria like Gini impurity or information gain.
- __Gradient Boosting:__ Like random forests gradient boosting models select important features while building trees by prioritizing features that reduce error the most.

</div>

### __Choosing the Right Feature Selection Method__

<div align="justify">

Choice of feature selection method depends on several factors:

- __Dataset Size:__ Filter methods are often preferred for very large datasets due to their speed.
- __Feature Interactions:__ Wrapper and embedded methods are better for capturing complex feature interactions.
- __Model Type:__ Some methods like Lasso and decision trees are more suitable for certain models like linear models or tree-based models.

With these feature selection methods we can easily improve performance of our model and reduce its computational cost.

</div>