# Feature Selection and Data Visualization

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.

Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

<img src='files/img/featureselection.jpg'>

## Introduction

One of the best ways I use to learn machine learning is by benchmarking myself against the best data scientists in competitions. It gives you a lot of insight into how you perform against the best on a level playing field.

Initially, I used to believe that machine learning is going to be all about algorithms - know which one to apply when and you will come out on top. When I got there, I realized that was not the case - the winners were using the same algorithms which a lot of other people were using.

Next, I thought surely these people would have better/superior machines. I discover that is not the case. I saw competitions being won using a MacBook Air, which is not the best computational machine. Over time, I realized that there are 2 things which distinguish winners from others in most of the cases: __Feature Creation__ and __Feature Selection__.

In other words, it boils down to creating variables which capture hidden business insights and then making the right choices about which variable to choose for your predictive models! Sadly or thankfully, both these skills require a ton of practice. There is also some art involved in creating new features - some people have a knack of finding trends where other people struggle.

## Importance of Feature Selection in Machine Learning

The importance of feature selection can best be recognized when you are dealing with a dataset that contains a vast number of features. This type of dataset is often referred to as a _high dimensional_ dataset. Now, with this high dimensionality, comes a lot of problems such as - this high dimensionality will significantly increase the training time of your machine learning model, it can make your model very complicated which in turn may lead to Overfitting.

Often in a high dimensional feature set, there remains several features which are redundant, meaning these features are nothing but extensions of the other essential features. These redundant features do not effectively contribute to the model training as well. So, clearly, there is a need to extract the most important and the most relevant features for a dataset in order to get the most effective predictive modeling performance.

_"The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data."_

- [An Introduction to Variable and Feature Selection](https://www.datacamp.com/community/tutorials/feature-selection-python)

Now let's understand the difference between __dimensionality reduction__ and __feature selection__.

Sometimes, feature selection is mistaken with dimensionality reduction. But they are different. Feature selection is different from dimensionality reduction. Both methods tend to reduce the number of attributes in the dataset, but a dimensionality reduction method does so by creating new combinations of attributes (sometimes known as feature transformation), whereas feature selection methods include and exclude attributes present in the data without changing them.

Some examples of dimensionality reduction methods are Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis, etc.

Let me summarize the importance of feature selection for you:
-  It enables the machine learning algorithm to train faster
-  It reduces the complexity of a model and makes it easier to interpret.
-  It improves the accuracy of a model if the right subset is chosen.
-  It reduces Overfitting.

In the next section, you will study the different types of general feature selection methods - Filter methods, Wrapper methods, and Embedded methods.

## Filter Methods

The following image best describes filter-based feature selection methods:

<img src='files/img/filter.png'>

Filter method relies on the general uniqueness of the data to be evaluated and pick feature subset, not including any mining algorithm. Filter method uses the exact assessment criterion which includes distance, information, dependency and consistency. The filter method uses the principal criteria of ranking technique and uses the rank ordering method for variable selection. The reason for using the ranking method is simplicity, produce excellent and relevant features. The ranking method will filter out irrelevant features before classification process starts.

Filter methods are generally used as a data preprocessing step. The selection of features is independent of any machine learning algorithm. Features give rank on the basis of statistical scores which tend to determine the features' correlation with the outcome variable. Correlation is a heavily contextual term, and it varies from work to work. You can refer to the following table for defining correlation coefficients for different types of data (in this case continuous and categorical).

<img src='files/img/correlation.png'>

-  __Pearson's Correlation__: It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from -1 to +1. Pearson's correlation is given as:

$$\rho_{X, Y} = \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y}$$

-  __LDA__: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable.
-  __ANOVA__: ANOVA stands for Analysis of Variance. It is similar to LDA except for the fact that it is operated using one or more categorical independent features and one continuous dependent feature. It provides a statistical test of whether the means of several groups are equal or not.
-  __Chi-Square__: It is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.

One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you must deal with multicollinearity of features as well before training models for your data.

## Wrapper Methods

Like filter methods, let me give you a same kind of info-graphic which will help you to understand wrapper methods better:

<img src='files/img/wrapper.png'>

In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem. These methods are usually computationally very expensive.

Some typical examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.

-  __Forward Selection__: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variables does not improve the performance of the model.
-  __Backward Elimination__: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.
-  __Combination of forward selection and backward elimination__: The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.
-  __Recursive Feature Elimination__: Recursive feature elimination performs a greedy search to find the best performing feature subset. It iteratively creates models and determines the best or the worst performing feature at each iteration. It constructs the subsequent models with the left features until all the features are explored. It then ranks the features based on the order of their elimination. In the worst case, if a dataset contains $N$ number of features RFE will do a greedy search for $2^N$ combinations of features.

One of the best ways for implementing feature selection with wrapper methods is to use Boruta package that finds the importance of a feature by creating shadow features.

It works in the following steps:
1. Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features).
2. Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.
3. At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z-score than the maximum Z-score of its shadow features) and constantly removes features which are deemed highly unimportant.
4. Finally, the algorithm stops either when all features get confirmed or rejected or it reaches a specified limit of random forest runs.

For more information on the implementation of Boruta package, you can refer to this [article](https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/).

For the implementation of Boruta in python, you can refer to this [article](http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/).