# Feature Selection

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

## 1.Variance Features

## Variance
In pure statistics, Variance is the squared deviation of a variable from its mean.

It’s calculated by mean of square minus square of mean

Var(X)=  E[(X- mu )^2] 

In [1]:
pip install statistics

Collecting statistics
  Downloading https://files.pythonhosted.org/packages/bb/3a/ae99a15e65636559d936dd2159d75af1619491e8cb770859fbc8aa62cef6/statistics-1.0.3.5.tar.gz
Building wheels for collected packages: statistics
  Building wheel for statistics (setup.py) ... [?25l[?25hdone
  Created wheel for statistics: filename=statistics-1.0.3.5-cp27-none-any.whl size=7454 sha256=17b6eb1fada5a2d9d5c8296a20e9d903fb1c4fc87b91b174dbc99f8d2982b974
  Stored in directory: /root/.cache/pip/wheels/75/55/90/73aa7662bfb4565b567618547a275f01372a678ca92ecd64f3
Successfully built statistics
Installing collected packages: statistics
Successfully installed statistics-1.0.3.5


In [2]:
import statistics 
  
sample = [2.74, 1.23, 2.63, 2.22, 3, 1.98]
  
print("Variance of sample set is " ,statistics.variance(sample))

('Variance of sample set is ', 0.40924000000000005)


In [3]:
sample2 = [1.5,1.6,1.7,1.55,1.64]

print("Variance of sample set is % s"%(statistics.variance(sample2))) 

Variance of sample set is 0.00602


In [4]:
sample3 = [.5,15,157,1505,264]

print("Variance of sample set is % s"%(statistics.variance(sample3))) 

Variance of sample set is 401427.7


## Remove low var features

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.


As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. We can think of finding a variance threshold; and the variance of such variables is given by:

Var(X) = p*(1-p)

So we choose the threshold as: 0.8*(1-0.8)

In [8]:
from sklearn.feature_selection import VarianceThreshold
import numpy as np

X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]])

sel = VarianceThreshold(threshold=(0.8*0.2))

sel.fit(X)

X2 = sel.fit_transform(X)

In [9]:
X

array([[0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 1],
       [0, 1, 0],
       [0, 1, 1]])

In [10]:
X2

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

In [53]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
X[:15]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2]])

In [27]:
sel = VarianceThreshold(threshold=(.3))

sel.fit(X)

X2 = sel.fit_transform(X)
X2[:15]

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2],
       [5.4, 1.7, 0.4],
       [4.6, 1.4, 0.3],
       [5. , 1.5, 0.2],
       [4.4, 1.4, 0.2],
       [4.9, 1.5, 0.1],
       [5.4, 1.5, 0.2],
       [4.8, 1.6, 0.2],
       [4.8, 1.4, 0.1],
       [4.3, 1.1, 0.1],
       [5.8, 1.2, 0.2]])


## Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests (One dependant variable statistical test). Scikit-learn exposes feature selection routines as objects that implement the transform method:
* SelectKBest removes all but the k highest scoring features; detects the highest scoring features by statistical tests.

In [40]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 ## Stastical test chosen

In [36]:
X, y = load_iris(return_X_y=True)

X[:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [37]:
sel = SelectKBest(mutual_info_regression, k=2)

sel.fit(X, y)

SelectKBest(k=2,
      score_func=<function mutual_info_regression at 0x7f7c85a188d0>)

In [38]:
X2 = sel.transform(X)
X2[:15]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.7, 0.4],
       [1.4, 0.3],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.5, 0.1],
       [1.5, 0.2],
       [1.6, 0.2],
       [1.4, 0.1],
       [1.1, 0.1],
       [1.2, 0.2]])

In [39]:
sel.scores_

array([0.54027185, 0.20064398, 0.98538293, 0.99355373])

I also encourage you to check out other statistical tests and read about them:

For regression: f_regression, mutual_info_regression

For classification: chi2, f_classif, mutual_info_classif

## Recursive feature elimination - RFE

Recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and with which it can be predict how much each fearure affects the output.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

1) Choosing a model to train on the features.

2) Model helps to assign importance of features.

3) Low importance features are eliminated

4) The procdure is repeated until reached the desire number of features.

In [41]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

In [61]:
X,y = load_iris(return_X_y=True)

In [62]:
m = RFECV(RandomForestClassifier(), scoring='accuracy')

In [63]:
m.fit(X, y)

RFECV(cv='warn',
   estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='accuracy', step=1,
   verbose=0)

In [67]:
X[:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [68]:
X2 = m.transform(X)
X2[:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [69]:
m.score(X,y)  ## Model can not have interpretation

1.0

In [70]:
X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]])
y = np.array([0,0,0,1,0,1])

In [71]:
m.fit(X, y)

RFECV(cv='warn',
   estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='accuracy', step=1,
   verbose=0)

In [72]:
X

array([[0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 1],
       [0, 1, 0],
       [0, 1, 1]])

In [73]:
m.transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

In [74]:
m.score(X,y)

1.0

## Feature selection using SelectFromModel

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter.

In [75]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

In [80]:
X,y = load_iris(return_X_y=True)

In [81]:
m = SelectFromModel(LinearSVC(C=0.01, penalty='l1', dual=False))

m.fit(X, y)



SelectFromModel(estimator=LinearSVC(C=0.01, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.0001,
     verbose=0),
        max_features=None, norm_order=1, prefit=False, threshold=None)

In [83]:
X2 = m.transform(X)
X2[:15]

array([[5.1, 3.5, 1.4],
       [4.9, 3. , 1.4],
       [4.7, 3.2, 1.3],
       [4.6, 3.1, 1.5],
       [5. , 3.6, 1.4],
       [5.4, 3.9, 1.7],
       [4.6, 3.4, 1.4],
       [5. , 3.4, 1.5],
       [4.4, 2.9, 1.4],
       [4.9, 3.1, 1.5],
       [5.4, 3.7, 1.5],
       [4.8, 3.4, 1.6],
       [4.8, 3. , 1.4],
       [4.3, 3. , 1.1],
       [5.8, 4. , 1.2]])

In [84]:
X[:15]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2]])