<center><font size="7">Feature selection</font></center>

<font size="4">Copyright Christophe PERE 02/05/2017</font>

<b><i>What's the goal of this study ? </i></b>

Simple, import a dataframe with many columns and determine the importance of each ones or the columns corresponding to a number passed in the feature selection, e.g. if I put 3 in a PCA (a detail will see below) the algorithm compute a PCA and determine which 3 columns are the more important to comput the data. 

Having a good understanding of feature selection/ranking can be a great asset for a data scientist or machine learning practitioner. A good grasp of these methods leads to better performing models, better understanding of the underlying structure and characteristics of the data and leads to better intuition about the algorithms that underlie many machine learning models.
There are in general two reasons why feature selection is used:
1. Reducing the number of features, to reduce overfitting and improve the generalization of models.
2. To gain a better understanding of the features and their relationship to the response variables.

These two goals are often at odds with each other and thus require different approaches: depending on the data at hand a feature selection method that is good for goal (1) isn’t necessarily good for goal (2) and vice versa. What seems to happen often though is that people use their favourite method (or whatever is most conveniently accessible from their tool of choice) indiscriminately, especially methods more suitable for (1) for achieving (2).

At the same time, feature selection is not particularly thoroughly covered in machine learning or data mining textbooks, partly because they are often looked at as natural side effects of learning algorithms that don’t require separate coverage.

<span style="color:blue"><font size="5">1 - Univariate feature selection</font></span>

Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. These methods are simple to run and understand and are in general particularly good for gaining a better understanding of data (but not necessarily for optimizing the feature set for better generalization). There are lot of different options for univariate selection.

In [1]:
%matplotlib inline 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# ---- First, import the adequat modules -----------------------------------------------------------------------------
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import feature_selection

In [None]:
df = pd.read_csv('myfile.csv',sep=';')                        # import a csv file 

In [3]:
# Define the number k = 1,2,3...., n_features-1. It represent the number of features 
# y1 is the data to be fit 
model = SelectKBest(score_func=chi2, k=4)                     # determine the model by chi2 computation with k=4 
fit = model.fit(df, y1)                                       # use the function fit 

NameError: name 'df' is not defined

In [None]:
np.set_printoptions(precision=3)                              # precision of the results
print(fit.scores_)                                            # score for each column in the df (same order)
features = fit.transform(df)                                  # reduce the dataframe to the parameters space 
# summarize selected features
print(features[0:5,:])

The univiariate method permit to determine the number of important feature passed by the factor k. 

<span style="color:blue"><font size="5">2 - Recursive feature elimination (RFE)</font></span>

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.

The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

In [2]:
# ---- First, import the adequat modules -----------------------------------------------------------------------------
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()
rfe = RFE(model, 3)                                       # 3 is the number of top features
fit = rfe.fit(df, y1)

In [None]:
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached. RFECV performs RFE in a cross-validation loop to find the optimal number of features.

<span style="color:blue"><font size="5">3 - Principal Component Analysis (PCA) </font></span>

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In the example below, we use PCA and select 3 principal components.

Learn more about the PCA class in scikit-learn by reviewing the PCA API. Dive deeper into the math behind PCA on the Principal Component Analysis Wikipedia article.

In [None]:
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(df)
# summarize components
#print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.explained_variance_ratio_)
print(fit.components_)

You can see that the transformed dataset (3 principal components) bare little resemblance to the source data.

<span style="color:blue"><font size="5">4 - Imcremental PCA </font></span>

Incremental principal component analysis (IPCA) is typically used as a replacement for principal component analysis (PCA) when the dataset to be decomposed is too large to fit in memory. IPCA builds a low-rank approximation for the input data using an amount of memory which is independent of the number of input data samples. It is still dependent on the input data features, but changing the batch size allows for control of memory usage.

In [4]:
from sklearn.decomposition import IncrementalPCA

In [None]:
ipca = IncrementalPCA(n_components=3, batch_size=3)
ipca.fit(df)

In [None]:
ipca.transform(df)

In [None]:
print(ipca.explained_variance_ratio_)
print(ipca.components_)

<span style="color:blue"><font size="5">5 - Feature Importance</font></span>

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.
In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.

In [3]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
model = ExtraTreesClassifier()
model.fit(df, y1)
print(model.feature_importances_)                           # return the weight of each column 

You can see that we are given an importance score for each attribute where the larger score the more important the attribute.

<span style="color:blue"><font size="5">6 - SelectFromModel</font></span>

Use SelectFromModel meta-transformer along with Lasso to select the best couple of features from the dataset.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

In [None]:
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(df, y1)

In [None]:
model = SelectFromModel(lsvc, prefit=True)