<h2 id="Contents">Contents<a href="#Contents"></a></h2>
        <ol>
        <li><a class="" href="#Feature-Selection">Feature Selection</a></li>
<ol><li><a class="" href="#Removing-features-with-low-variance">Removing features with low variance</a></li>
<li><a class="" href="#Univariate-feature-selection">Univariate feature selection</a></li>
<li><a class="" href="#Recursive-feature-elimination">Recursive feature elimination</a></li>
<li><a class="" href="#Feature-selection-using-SelectFromModel">Feature selection using SelectFromModel</a></li>
<ol><li><a class="" href="#Example-1:-SVC-Based">Example 1: SVC Based</a></li>
<li><a class="" href="#Example-2:-Tree-Based">Example 2: Tree Based</a></li>
</ol><li><a class="" href="#Sequential-Feature-Selection">Sequential Feature Selection</a></li>
<ol><li><a class="" href="#Forward-SFS">Forward SFS</a></li>
<li><a class="" href="#Backward-SFS">Backward SFS</a></li>
</ol><li><a class="" href="#Feature-selection-as-part-of-a-pipeline">Feature selection as part of a pipeline</a></li>
</ol><li><a class="" href="#Classes">Classes</a></li>
<ol><li><a class="" href="#VarianceThreshold">VarianceThreshold</a></li>
<ol><li><a class="" href="#Parameters-of-VarianceThreshold">Parameters of VarianceThreshold</a></li>
<li><a class="" href="#Attributes-of-VarianceThreshold">Attributes of VarianceThreshold</a></li>
</ol><li><a class="" href="#SelectKBest">SelectKBest</a></li>
<ol><li><a class="" href="#Parameters-of-SelectKBest">Parameters of SelectKBest</a></li>
<li><a class="" href="#Attributes-of-SelectKBest">Attributes of SelectKBest</a></li>
</ol><li><a class="" href="#SelectPercentile">SelectPercentile</a></li>
<ol><li><a class="" href="#Parameters-of-SelectPercentile">Parameters of SelectPercentile</a></li>
<li><a class="" href="#Attributes-of-SelectPercentile">Attributes of SelectPercentile</a></li>
</ol><li><a class="" href="#RFE">RFE</a></li>
<ol><li><a class="" href="#Parameters-of-RFE">Parameters of RFE</a></li>
<li><a class="" href="#Attributes-of-RFE">Attributes of RFE</a></li>
</ol><li><a class="" href="#SelectFromModel">SelectFromModel</a></li>
<ol><li><a class="" href="#Parameters-of-SelectFromModel">Parameters of SelectFromModel</a></li>
<li><a class="" href="#Attributes-of-SelectFromModel">Attributes of SelectFromModel</a></li>
</ol><li><a class="" href="#SequentialFeatureSelector">SequentialFeatureSelector</a></li>
<ol><li><a class="" href="#Parameters-of-SequentialFeatureSelector">Parameters of SequentialFeatureSelector</a></li>
<li><a class="" href="#Attributes-of-SequentialFeatureSelector">Attributes of SequentialFeatureSelector</a></li>
</ol>

# Feature Selection

Sklearn has a number of tools for feature selection available in the module `feature_selection`.

## Removing features with low variance

`VarianceThreshold` is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by
$$\mathrm{Var}[X] = p(1 - p)$$

So, we can set a threshold of `0.8*(1-0.8)`:

In [1]:
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

## Univariate feature selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the `transform` method:

* **`SelectKBest`** removes all but the  highest scoring features

* **`SelectPercentile`** removes all but a user-specified highest scoring percentage of features

* using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.

* **`GenericUnivariateSelect`** allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

In [3]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X, y = load_iris(return_X_y=True)
print(X.shape)

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape

(150, 4)


(150, 2)

These objects take as input a scoring function that returns univariate scores and p-values (or only scores for `SelectKBest` and `SelectPercentile`):

* For regression: *f_regression*, *mutual_info_regression*

* For classification: *chi2*, *f_classif*, *mutual_info_classif*


## Recursive feature elimination

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (`RFE`) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute (such as `coef_`, `feature_importances_`) or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

`RFECV` performs `RFE` in a cross-validation loop to find the optimal number of features.

In [6]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
print(X.shape)
rfe = RFE(estimator=LogisticRegression(max_iter=1000), n_features_to_select=2, step=1)
rfe.fit(X, y)
print(rfe.support_) 
print(rfe.ranking_)
X_new = rfe.transform(X)
X_new.shape

(150, 4)
[False False  True  True]
[3 2 1 1]


(150, 2)

## Feature selection using SelectFromModel

`SelectFromModel` is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as `coef_`, `feature_importances_`) or via an `importance_getter` callable after fitting. The features are considered unimportant and removed if the corresponding importance of the feature values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”. In combination with the threshold criteria, one can use the `max_features` parameter to set a limit on the number of features to select.

### Example 1: SVC Based

In [7]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
print(X.shape)

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
X_new.shape

(150, 4)


(150, 3)

>With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected. With Lasso, the higher the alpha parameter, the fewer features selected.

### Example 2: Tree Based

In [8]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
print(X.shape)

clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
print(clf.feature_importances_)  

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape 

(150, 4)
[0.10769421 0.07282498 0.37546013 0.44402068]


(150, 2)

## Sequential Feature Selection

Sequential Feature Selection (SFS) is available in the `SequentialFeatureSelector` transformer. SFS can be either forward or backward.

### Forward SFS

Forward-SFS is a greedy procedure that iteratively finds the best new feature to add to the set of selected features. Concretely, we initially start with zero feature and find the one feature that maximizes a cross-validated score when an estimator is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. The procedure stops when the desired number of selected features is reached, as determined by the `n_features_to_select` parameter.

### Backward SFS

Backward-SFS follows the same idea but works in the opposite direction: instead of starting with no feature and greedily adding features, we start with all the features and greedily remove features from the set. The `direction` parameter controls whether forward or backward SFS is used.

>In general, forward and backward selection do not yield equivalent results. Also, one may be much faster than the other depending on the requested number of selected features: if we have 10 features and ask for 7 selected features, forward selection would need to perform 7 iterations while backward selection would only need to perform 3.

SFS differs from `RFE` and `SelectFromModel` in that it does not require the underlying model to expose a `coef_` or `feature_importances_` attribute. It may however be slower considering that more models need to be evaluated, compared to the other approaches. For example in backward selection, the iteration going from *m* features to *m - 1* features using k-fold cross-validation requires fitting *m * k* models, while `RFE` would require only a single fit, and `SelectFromModel` always just does a single fit and requires no iterations.

## Feature selection as part of a pipeline

Feature selection is usually used as a pre-processing step before doing the actual learning. The recommended way to do this in scikit-learn is to use a `Pipeline`:

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf

Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(penalty='l1'))),
                ('classification', RandomForestClassifier())])

# Classes

## `VarianceThreshold`

### Parameters of `VarianceThreshold`

* **threshold**: float, (default=0.0)
  
  Features with a training-set variance lower than this threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.

### Attributes of `VarianceThreshold`

* **variances_**: array-like, shape (n_features,)
    
    The variance of each feature in the training set.

## `SelectKBest`

### Parameters of `SelectKBest`

* **score_func**: callable, defualt=f_classif
  
    Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif. The default function only works with classification tasks.
* **k**: int or "all", default=10
    
    Number of top features to select. The “all” option bypasses selection, for use in a parameter search.

### Attributes of `SelectKBest`

* **scores_**: array-like, shape (n_features,)
    
    The scores associated with each feature.
* **pvalues_**: array-like, shape (n_features,)
    
    The p-values associated with each feature.

## `SelectPercentile`

### Parameters of `SelectPercentile`

* **score_func**: callable, default=f_classif
  
    Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif. The default function only works with classification tasks.
* **percentile**: int, default=10
  
    Percentile of features to keep.

Available callable score functions are:
* For regression: f_regression, mutual_info_regression

* For classification: chi2, f_classif, mutual_info_classif

### Attributes of `SelectPercentile`

* **scores_**: array-like, shape (n_features,)
    
    The scores associated with each feature.
* **pvalues_**: array-like, shape (n_features,)
    
    The p-values associated with each feature.

## `RFE`

### Parameters of `RFE`

* **estimator**: estimator object.
  
    A supervised learning estimator with a `fit` method that provides information about feature importance (e.g. `coef_`, `feature_importances_`).
* **n_features_to_select**: int or float default=None
  
    The number of features to select. If None, half of the features are selected. If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.
* **step**: int or float, default=1
  
    If greater than or equal to 1, then `step` corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then `step` corresponds to the percentage (rounded down) of features to remove at each iteration.

### Attributes of `RFE`

* **classes_**: array-like, shape (n_classes,)
    
    Classes labels available when `estimator` is a classifier.
* **estimator_**: estimator object.
    
    The underlying estimator.
* **n_features_**: int
  
    The number of features.
* **ranking_**: array-like, shape (n_features,)
    
    The feature ranking, such that `ranking_[i] `corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1.
* **support_**: array-like, shape (n_features,)
        
    The mask of selected features.

## `SelectFromModel`

### Parameters of `SelectFromModel`

* **estimator**: estimator object.
  
    The base estimator from which the transformer is built. This can be both a fitted (if `prefit` is set to `True`) or a non-fitted estimator. The estimator should have a `feature_importances_` or coef_ `attribute` after fitting. Otherwise, the `importance_getter` parameter should be used.
* **threshold** : str or float, default=None

    The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.
* **prefit**: bool, default=False
  
    Whether a prefit model is expected to be passed into the constructor directly or not. If `True`, `estimator` must be a fitted estimator. If Fal`se, `estimator` is fitted and updated by calling `fit` and `partial_fit`, respectively.
* **norm_order**: non-zero int, inf, -inf, default=1

    Order of the norm used to filter the vectors of coefficients below threshold in the case where the `coef_` attribute of the estimator is of dimension 2.
* **max_features** :int, callable, default=None

    The maximum number of features to select.

    If an integer, then it specifies the maximum number of features to allow.

    If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of `max_feaures(X)`.

    If `None`, then all features are kept.
    To only select based on `max_features`, set `threshold=-np.inf`
* **importance_getter**: str or callable, default=’auto’

    If ‘auto’, uses the feature importance either through a coef_ attribute or feature_importances_ attribute of estimator.

    Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter).

### Attributes of `SelectFromModel`

* **estimator_**: estimator object.
    
    The base estimator from which the transformer is built. This attribute exist only when `fit` has been called.

    * If `prefit=True`, it is a deep copy of estimator.

    * If `prefit=False`, it is a clone of estimator and fit on the data passed to `fit` or `partial_fit`.
* **max_features_**: int 
  
    Maximum number of features calculated during fit. Only defined if the max_features is not None.

    * If `max_features` is an int, then `max_features_ = max_features`.

    * If `max_features` is a callable, then `max_features_ = max_features(X)`.
* **threshold_** float

    Threshold value used for feature selection.

## `SequentialFeatureSelector`

### Parameters of `SequentialFeatureSelector`

* **estimator**: estimator instance
    
    An unfitted estimator.

* **n_features_to_select**: “auto”, int or float, default=’warn’
  
    If "auto", the behaviour depends on the `tol` parameter:

    * if `tol` is not None, then features are selected until the score improvement does not exceed `tol`.

    * otherwise, half of the features are selected.

    If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.
* **tol**: float, default=None
  
  If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing. tol is enabled only when `n_features_to_select` is "auto".
* **direction**: {‘forward’, ‘backward’}, default=’forward’
  
    If ‘forward’, features are added. If ‘backward’, features are removed.
* **scoring**:  str, callable, list/tuple or dict, default=None
  
    A single str or a callable to evaluate the predictions on the test set.

    If None, the estimator’s score method is used.
* **cv**: int, cross-validation generator or an iterable, default=None
    
    * Determines the cross-validation splitting strategy. Possible inputs for `cv` are:

    * `None`, to use the default 5-fold cross validation,

    * integer, to specify the number of folds in a `(Stratified)KFold`,

    * CV splitter,

    * An iterable yielding (train, test) splits as arrays of indices.
* **n_jobs**

### Attributes of `SequentialFeatureSelector`

* **n_features_**: int
  
   Number of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.
* **n_features_to_select_**: int
  
   The number of features that were selected.
* **support_**: array-like, shape (n_features,)
  
   The mask of selected features.