### What is Feature Selection ?

- Feature selection is the process of reducing the number of input variables when developing a predictive model.

- It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

- Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables.

- As such, it can be challenging for a machine learning practitioner to select an appropriate statistical measure for a dataset when performing filter-based feature selection.

- We will discover how to choose statistical measures for filter-based feature selection with numerical and categorical data.

- When it comes to implementation of feature selection in Pandas, Numerical and Categorical features are to be treated differently. Here we will first discuss about Numeric feature selection. Hence before implementing the following methods, we need to make sure that the DataFrame only contains Numeric features. Also, the following methods are discussed for regression problem, which means both the input and output variables are continuous in nature.

### Approaches for Feature Selection

There are generally three methods for feature selection:

 **1. Filter Method**
 
 **2. Wrapper Method**
 
 **3. Embedded Method**
 
 **4. Univariate Selection**
 
 **5. Feature Importance**

- **Filter methods** use statistical calculation to evaluate the relevance of the predictors outside of the predictive models and keep only the predictors that pass some criterion.  Considerations when choosing filter methods are the types of data involved, both in predictors and outcome — either numerical or categorical.

- **Wrapper methods** evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.  Generally three directions of procedures are possible — forward (starts with 1 predictor and adds more iteratively), backward (starts with all predictors and eliminates one-by-one iteratively), and step-wise (bi-directional).

- **Embedded methods** are models where the feature selection procedure occurs naturally in the course of the model fitting process. Put simply, this method integrates the feature selection algorithm as part of the machine learning algorithm. The most typical embedded technique is tree based algorithm, which includes decision tree and random forest. The general idea of feature selection is decided at the splitting node based on information gain. Other exemplars of embedded methods are the LASSO with the L1 penalty and Ridge with the L2 penalty for constructing a linear model.

![84353IMAGE1.png](attachment:84353IMAGE1.png)

## Filter methods 

- As the name suggest, in this method, you filter and take only the subset of the relevant features. The model is built after selecting the features. The filtering here is done using correlation matrix and it is most commonly done using Pearson correlation.

**The correlation coefficient has values between -1 to 1**
- A value closer to 0 implies weaker correlation (exact 0 implying no correlation)
- A value closer to 1 implies stronger positive correlation
- A value closer to -1 implies stronger negative correlation

![FS.png](attachment:FS.png)

## Wrapper Method

- A wrapper method needs one machine learning algorithm and uses its performance as evaluation criteria. This means, you feed the features to the selected Machine Learning algorithm and based on the model performance you add/remove the features. This is an iterative and computationally expensive process but it is more accurate than the filter method.
- There are different wrapper methods such as **Backward Elimination**, **Forward Selection**, **Bidirectional Elimination** and **RFE**. We will discuss **Backward Elimination** and **RFE** here.

![2.1.2.4.1.2-Wrapper-Methods.png](attachment:2.1.2.4.1.2-Wrapper-Methods.png)

### i. Backward Elimination

- As the name suggest, we feed all the possible features to the model at first. We check the performance of the model and then iteratively remove the worst performing features one by one till the overall performance of the model comes in acceptable range.


- The performance metric used here to evaluate feature performance is pvalue. If the pvalue is above 0.05 then we remove the feature, else we keep it.


- We will first run one iteration here just to get an idea of the concept and then we will run the same code in a loop, which will give the final set of features. Here we are using OLS model which stands for **“Ordinary Least Squares”**. This model is used for performing linear regression.

### ii. RFE (Recursive Feature Elimination)

- The Recursive Feature Elimination (RFE) method works by recursively removing attributes and building a model on those attributes that remain. It uses accuracy metric to rank the feature according to their importance. The RFE method takes the model to be used and the number of required features as input. It then gives the ranking of all the variables, 1 being most important. It also gives its support, True being relevant feature and False being irrelevant feature.

## Embedded Method

- Embedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration. Regularization methods are the most commonly used embedded methods which penalize a feature given a coefficient threshold.



![2.1.2.4.1.3-Embedded-Methods-1.png](attachment:2.1.2.4.1.3-Embedded-Methods-1.png)

### We can use two types of Embedded Methods


**i.Lasso Regression (uses L1 regularization)**

**ii.Ridge Regression Regression (uses L2 regularization)**


## Ridge Regression

- Ridge Regression can be used to create a regularized model where constraints are put in the algorithm to penalise the coefficients for being too large thus preventing the model from becoming too complex and causing overfitting. In very simple terms, it adds a penalty α∥w∥ to the equation where w is the vector of model coefficients, ∥⋅∥ is  L2 norm and α is a tunable free parameter. Thus making the whole equation to look like-

![image007-4.png](attachment:image007-4.png)

## Lasso Regression

- Lasso works same as Ridge but performs L1 regularization which adds a penalty α∑ni=1|wi| to the loss function thereby adding penalty equivalent to the absolute value of the magnitude of coefficients rather than the square of the coefficients (used in L2) making the weak features to have zero as coefficients. In a way, by using L1 regularization, Lasso performs an automatic feature selection where the features with 0 as the value of coefficients are dropped. Again the value of lambda is important because if the value is too large then it ends up dropping a lot of variables (by causing the coefficients that are a bit small to become 0) making the model too generalized causing the model to underfit.

![image013-2.png](attachment:image013-2.png)

## Univariate Selection

- Statistical tests can be used to select those features that have the strongest relationship with the output variable.

- The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

- For example the chi-squared (chi²) statistical test for non-negative features to select the best features from the Tips Dataset.

## Feature Importance

- You can get the feature importance of each feature of your dataset by using the feature importance property of the model.
- Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.
- Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the  features for the dataset.

![1_A_qcphMdN5uWILh8LXYhAw.png](attachment:1_A_qcphMdN5uWILh8LXYhAw.png)