# <p style='text-align: center;'> Feature Selection </p>

### Feature Selection:
- Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the optimal features from the input dataset.


<b> Three methods are used for the feature selection:
    
   1. Filters Methods
   2. Wrappers Methods
   3. Embedded Methods

### 1. Filters Methods:

- In this method, the dataset is filtered, and a subset that contains only the relevant features is taken.


- Identifying the unimportant features based on common-sense/ statistics.


- Filter methods applied before modelling only.


<b> Some common techniques of filters method are:
    
   - Missing Value Ratio
   - Low Variance Filter
   - High Correlation Filter
   - Chi-Square Test
   - ANOVA
   - Information Gain, etc.

    
We are going with one by one, as follows:

<b> Missing Value Ratio:
    
- This technique sets a threshold level for missing values. If a variable exceeds the threshold, it’s dropped.
    
    
- Suppose you’re given a dataset. What would be your first step? You would naturally want to explore the data first before building model. While exploring the data, you find that your dataset has some missing values. Now what? You will try to find out the reason for these missing values and then impute them or drop the variables entirely which have missing values (using appropriate methods). What if we have too many missing values (say more than 50%)? Should we impute the missing values or drop the variable? I would prefer to drop the variable since it will not have much information. However, this isn’t set in stone. We can set a threshold value and if the percentage of missing values in any variable is more than that threshold, we will drop the variable.
    
    
<b> Low Variance Filter:
    
- Like the Missing Value Ratio technique, the Low Variance Filter works with a threshold. However, in this case, it’s testing data columns. The method calculates the variance of each variable. All data columns with variances falling below the threshold are dropped since low variance features don’t affect the target variable.
    
    
- Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance. So, we need to calculate the variance of each variable we are given. Then drop the variables having low variance as compared to other variables in our dataset. The reason for doing this, as I mentioned above, is that variables with a low variance will not affect the target variable.
    
    
<b> High Correlation Filter:
    
- This method applies to two variables carrying the same information, thus potentially degrading the model. In this method, we identify the variables with high correlation and use the Variance Inflation Factor (VIF) to choose one. You can remove variables with a higher value (VIF > 5).
    
    
- We can calculate the correlation between independent numerical variables that are numerical in nature. If the correlation coefficient crosses a certain threshold value, we can drop one of the variables (dropping a variable is highly subjective and should always be done keeping the domain in mind).
        
    
<b> Chi-Square Test:
    
- The Chi-Square test is a statistical procedure for determining the difference between observed and expected data. This test can also be used to determine whether it correlates to the categorical variables in our data
    
    
<b> ANOVA:
    
- ANOVA checks whether there is equal variance between groups of categorical feature with respect to the numerical response. If there is equal variance between groups, it means this feature has no impact on the response and hence it (the categorical variable) cannot be considered for model training.
    
    
<b> Information Gain:
   
- Information gain can also be used for feature selection, by evaluating the gain of each variable in the context of the target variable.
    
    
- Information Gain (IG) is a popular filter model and technique used in feature weight scoring and to determine the maximum entropy value. However, as a basic technique, IG is still open to further research and development in feature selection.

### 2. Wrappers Methods:
- The wrapper method has the same goal as the filter method, but it takes a machine learning model for its evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The performance decides whether to add those features or remove to increase the accuracy of the model. This method is more accurate than the filtering method but complex to work.


- Wrappers Methods applied during (while) modelling only.

<b> Some common techniques of Wrappers Methods are:
    
   - Backward Feature Elimination
   - Forward Feature Selection
   - Bi-directional Elimination
    
    
<b> Backward Feature Elimination:
    
- This five-step technique defines the optimal number of features required for a machine learning algorithm by choosing the best model performance and the maximum tolerable error rate.
    
    
Follow the below steps to understand and use the ‘Backward Feature Elimination’ technique:

   - We first take all the n variables present in our dataset and train the model using them
   - We then calculate the performance of the model
   - Now, we compute the performance of the model after eliminating each variable (n times), i.e., we drop one variable every time and train the model on the remaining n-1 variables
   - We identify the variable whose removal has produced the smallest (or no) change in the performance of the model, and then drop that variable
   - Repeat this process until no variable can be dropped
    
    
This method can be used when building Linear Regression or Logistic Regression models.
    
    
<b> Forward Feature Selection:
    
This is the opposite process of the Backward Feature Elimination we saw above. Instead of eliminating features, we try to find the best features which improve the performance of the model. This technique works as follows:

- We start with a single feature. Essentially, we train the model n number of times using each feature separately
- The variable giving the best performance is selected as the starting variable
- Then we repeat this process and add one variable at a time. The variable that produces the highest increase in performance is retained
- We repeat this process until no significant improvement is seen in the model’s performance
    
    
**NOTE:** Both Backward Feature Elimination and Forward Feature Selection are time consuming and computationally expensive.They are practically only used on datasets that have a small number of input variables.
    
    
<b> Bi-directional Elimination:
    
- This method uses both forward selection and backward elimination technique simultaneously to reach one unique solution.
    
    
- Bidirectional elimination: which is essentially a forward selection procedure but with the possibility of deleting a selected variable at each stage, as in the backward elimination, when there are correlations between variables. It is often used as a default approach.

### 3. Embedded Methods:
- Embedded methods check the different training iterations of the machine learning model and evaluate the importance of each feature.


- It is one of feature selection technique, where we identify the important features are applying the model/algorithm.


- Embedded Methods after modelling only.

<b> Some common techniques of Embedded Methods are:
    
   - LASSO
   - Elastic Net
   - Ridge Regression
   - Decision Trees
   - Random Forest, etc.
    
    
<b> LASSO:
    
- LASSO, short for Least Absolute Shrinkage and Selection Operator, is a statistical formula whose main purpose is the feature selection and regularization of data models.
    
    
- Identify the features, which has co-efficient is equal to zero, i.e. co-efficient=0
    
    
<b> Elastic Net:
    
- Elastic net is a penalized linear regression model that includes both the L1 and L2 penalties during training. Using the terminology from “The Elements of Statistical Learning,” a hyperparameter “alpha” is provided to assign how much weight is given to each of the L1 and L2 penalties.
    
    
- Identify the features, which has co-efficient is equal to zero or toward zero, but not exactly zero.
    
    
<b> Ridge Regression:
    
- It essentially penalizes the least squares loss by applying a ridge penalty on the regression coefficients. The ridge penalty shrinks the regression coefficient estimate toward zero, but not exactly zero.
    
    
<b> Decision Trees:
    
- Decision trees are a popular supervised learning algorithm that splits data into homogenous sets based on input variables. This approach solves problems like data outliers, missing values, and identifying significant variables.
    
    
<b> Random Forest:
    
- This method is like the decision tree strategy. However, in this case, we generate a large set of trees (hence "forest") against the target variable. Then we find feature subsets with the help of each attribute’s usage statistics of each attribute.
    
    
- Random Forest is one of the most widely used algorithms for feature selection. We need to convert the data into numeric form by applying one hot encoding, as Random Forest (Scikit-Learn Implementation) takes only numeric inputs. Let’s also drop the ID variables as these are just unique numbers and hold no significant importance for us currently. 
    
    
- In Random Forest first we need to fit the model, after fitting the model by using feature_importance_ we can get importance features.