The simplest form of selecting features would be to remove features with very 
low variance. If the features have a very low variance (i.e. very close to 0), they 
are close to being constant and thus, do not add any value to any model at all. It 
would just be nice to get rid of them and hence lower the complexity. Please note 
that the variance also depends on scaling of the data. Scikit-learn has an 
implementation for VarianceThreshold that does precisely this.

from sklearn.feature_selection import VarianceThreshold

data = ...

var_thresh = VarianceThreshold(threshold=0.1)

transformed_data = var_thresh.fit_transform(data)

#### transformed data will have all columns with variance less than 0.1 removed

We can also remove features which have a high correlation. For calculating the 
correlation between different numerical features, you can use the Pearson 
correlation. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
# fetch a regression dataset
data = fetch_california_housing()
X = data["data"]
col_names = data["feature_names"]
y = data["target"]
# convert to pandas dataframe
df = pd.DataFrame(X, columns=col_names)
# introduce a highly correlated column
df.loc[:, "MedInc_Sqrt"] = df.MedInc.apply(np.sqrt)
# get correlation matrix (pearson)
df.corr()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_Sqrt
MedInc,1.0,-0.119034,0.326895,-0.06204,0.004834,0.018766,-0.079809,-0.015176,0.984329
HouseAge,-0.119034,1.0,-0.153277,-0.077747,-0.296244,0.013191,0.011173,-0.108197,-0.132797
AveRooms,0.326895,-0.153277,1.0,0.847621,-0.072213,-0.004852,0.106389,-0.02754,0.326688
AveBedrms,-0.06204,-0.077747,0.847621,1.0,-0.066197,-0.006181,0.069721,0.013344,-0.06691
Population,0.004834,-0.296244,-0.072213,-0.066197,1.0,0.069863,-0.108785,0.099773,0.018415
AveOccup,0.018766,0.013191,-0.004852,-0.006181,0.069863,1.0,0.002366,0.002476,0.015266
Latitude,-0.079809,0.011173,0.106389,0.069721,-0.108785,0.002366,1.0,-0.924664,-0.084303
Longitude,-0.015176,-0.108197,-0.02754,0.013344,0.099773,0.002476,-0.924664,1.0,-0.015569
MedInc_Sqrt,0.984329,-0.132797,0.326688,-0.06691,0.018415,0.015266,-0.084303,-0.015569,1.0


#### We see that the feature MedInc_Sqrt has a very high correlation with MedInc. We can thus remove one of them.

### Univariate feature selection
is nothing but a scoring of each feature against a given target. Mutual information, ANOVA F-test and chi2 are some of the most popular methods for univariate feature selection. There are two ways of using these in scikitlearn.

- SelectKBest: It keeps the top-k scoring features
- SelectPercentile: It keeps the top features which are in a percentage specified by the user

It must be noted that you can use chi2 only for data which is non-negative in nature. 
This is a particularly useful feature selection technique in natural language 
processing when we have a bag of words or tf-idf based features. It’s best to create 
a wrapper for univariate feature selection that you can use for almost any new 
problem

In [11]:
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import SelectPercentile
class UnivariateFeatureSelction:
    def __init__(self, n_features, problem_type, scoring):
        """
 Custom univariate feature selection wrapper on
 different univariate feature selection models from
 scikit-learn.
 :param n_features: SelectPercentile if float else SelectKBest
 :param problem_type: classification or regression
 :param scoring: scoring function, string
        """
 # for a given problem type, there are only
 # a few valid scoring methods
 # you can extend this with your own custom
 # methods if you wish
        if problem_type == "classification":
            valid_scoring = {"f_classif": f_classif,"chi2": chi2,"mutual_info_classif": mutual_info_classif}
        else:
            valid_scoring = {"f_regression": f_regression,"mutual_info_regression": mutual_info_regression}
 
 # raise exception if we do not have a valid scoring method
        if scoring not in valid_scoring:
            raise Exception("Invalid scoring function")
 
 # if n_features is int, we use selectkbest
 # if n_features is float, we use selectpercentile
 # please note that it is int in both cases in sklearn
        if isinstance(n_features, int):
            self.selection = SelectKBest(valid_scoring[scoring],k=n_features)
        elif isinstance(n_features, float):
            self.selection = SelectPercentile(valid_scoring[scoring],percentile=int(n_features * 100))
        else:
            raise Exception("Invalid type of feature")
 
 # same fit function
    def fit(self, X, y):
        return self.selection.fit(X, y)
 
 # same transform function
    def transform(self, X):
        return self.selection.transform(X)
 
 # same fit_transform function
    def fit_transform(self, X, y):
        return self.selection.fit_transform(X, y)


In [12]:
#use the class

ufs = UnivariateFeatureSelction(n_features=0.1, problem_type="regression", scoring="f_regression")
ufs.fit(X, y)
X_transformed = ufs.transform(X)