![logo](https://user-images.githubusercontent.com/59526258/124226124-27125b80-db3b-11eb-8ba1-488d88018ebb.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful,
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Feature Selection
Feature selection is a **process where you select a number of features in your data that contribute most to the prediction** or remove the irrelevant and insignificant features. This helps to improve generalization, reduce overfitting, and even improve accuracy of model in some cases. It also saves the computational resources needed as you can train with smaller set of features.

There are three types of feature selection: **Filter methods** (univariate statistics, Pearson correlation, variance thresholding), **Wrapper methods** (forward, backward, and exhaustive selection), and **Embedded methods** (Lasso, Ridge, Decision Tree). 

We will go into an explanation of each with examples below.

In [2]:
# load data
import pandas as pd
from sklearn.datasets import load_wine

# models
from sklearn.svm import LinearSVC
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier

# filter
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, SelectFromModel

# wrapper
from sklearn.feature_selection import RFE, SequentialFeatureSelector
from mlxtend.feature_selection import ExhaustiveFeatureSelector

# embedded
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import ExtraTreesClassifier

In [3]:
# Load data
wine_data = load_wine()
df = pd.DataFrame(data=wine_data.data,
                  columns=wine_data.feature_names)

# Adding the target variable
df["target"] = wine_data.target
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [4]:
# Input and output features
X = df.drop("target", axis=1)
y = df["target"]

In [5]:
# Check data shape
X.shape

(178, 13)

We have 178 samples and 13 variables in the dataset.

## Filter Methods

### Univariate Statistics
Univariate feature selection works by selecting the best features based on univariate statistical tests. 

Scikit-learn univariate feature selection:

- `SelectKBest` removes all but the  highest scoring features

- `SelectPercentile` removes all but a user-specified highest scoring percentage of features using common univariate statistical tests for each feature: false positive rate `SelectFpr`, false discovery rate `SelectFdr`, or family wise error `SelectFwe`.

- `GenericUnivariateSelect` allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

These objects take as input a scoring function that returns univariate scores and p-values (or only scores for `electKBest` and `SelectPercentile`):

- For regression: `f_regression`, `mutual_info_regression`
- For classification: `chi2`, `f_classif`, `mutual_info_classif`

We will show an example of `SelectKBest` below.

In [6]:
#TODO : Use SlectKBest to select 4 features
kbest = 
kbest.

# Feature scores
print("Feature score: ", kbest.scores_)

new_features = kbest.transform(X)

new_features[:5, :]

Feature score:  [5.44549882e+00 2.80686046e+01 7.43380598e-01 2.93836955e+01
 4.50263809e+01 1.56230759e+01 6.33343081e+01 1.81548480e+00
 9.36828307e+00 1.09016647e+02 5.18253981e+00 2.33898834e+01
 1.65400671e+04]


array([[ 127.  ,    3.06,    5.64, 1065.  ],
       [ 100.  ,    2.76,    4.38, 1050.  ],
       [ 101.  ,    3.24,    5.68, 1185.  ],
       [ 113.  ,    3.49,    7.8 , 1480.  ],
       [ 118.  ,    2.69,    4.32,  735.  ]])

### Pearson Correlation Coefficient
Correlation is a measure of the linear relationship of 2 or more variables. We would assume that the **good variables** are **highly correlated** with the target. Also, sometimes we would want to remove either one of the two variables that are highly correlated. 
<br><br>
<div align="center">
  <img alt="Several sets of (x, y) points, with the correlation coefficient of x and y for each set." src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1920px-Correlation_examples2.svg.png" width="400" height="200"><br>
  <sup>Sample datasets and their pearson correlation coefficients.<sup>
</div>
      
We will show an example that drop the variable which has a lower correlation coefficient value with the target variable. We need to set an absolute value, for example, 0.4 as the threshold for selecting the variables.

In [7]:
# Pearson correlation coefficient
corr = df.corr()["target"].sort_values(ascending=False)[1:]

# Absolute for positive values
abs_corr = abs(corr)

#TODO : Threshold for features to keep
new_features = 
new_features

# new_df = df[new_features.index]
# new_df.head()

hue                             0.617369
proline                         0.633717
total_phenols                   0.719163
od280/od315_of_diluted_wines    0.788230
flavanoids                      0.847498
Name: target, dtype: float64

### Variance Threshold
`VarianceThreshold` is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

In [8]:
#TODO : VarianceThreshold to keep features with variance higher than 0.7 
vt = 
new_features = vt.

print(new_features.shape)
new_features[:5, :]

(178, 6)


array([[   1.71,   15.6 ,  127.  ,    3.06,    5.64, 1065.  ],
       [   1.78,   11.2 ,  100.  ,    2.76,    4.38, 1050.  ],
       [   2.36,   18.6 ,  101.  ,    3.24,    5.68, 1185.  ],
       [   1.95,   16.8 ,  113.  ,    3.49,    7.8 , 1480.  ],
       [   2.59,   21.  ,  118.  ,    2.69,    4.32,  735.  ]])

### Feature Importance


*Scikit-learn* provides `SelectFromModel` which is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as `coef_`, `feature_importances_`).

L1-based models and tree-based models can be used along with `SelectFromModel` to select the non-zero coefficients and discard irrelevant features. 

1. L1-based feature selection
- Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. 

In [9]:
#TODO : Define a LinearSVC model with "l1" penalty
lsvc = 

#TODO : Select features from L1-based model with SelectFromModel
model = 
new_features = model.

print(new_features.shape)
new_features[:5, :]

(178, 5)


array([[  15.6 ,  127.  ,    3.06,    5.64, 1065.  ],
       [  11.2 ,  100.  ,    2.76,    4.38, 1050.  ],
       [  18.6 ,  101.  ,    3.24,    5.68, 1185.  ],
       [  16.8 ,  113.  ,    3.49,    7.8 , 1480.  ],
       [  21.  ,  118.  ,    2.69,    4.32,  735.  ]])

2. Tree-based feature selection
- Tree-based estimators (see the *sklearn.tree* module and forest of trees in the *sklearn.ensemble* module) can be used to compute impurity-based feature importances.

In [10]:
#TODO : Define a ExtraTreesClassifier model
clf = ExtraTreesClassifier(n_estimators=50, random_state=1).fit(X, y)

#TODO : Select features from tree-based model with SelectFromModel
model = 
new_features = model.

print("Feature importance: ", clf.feature_importances_, end="\n\n")
print(new_features.shape)
new_features[:5, :]

Feature importance:  [0.10609269 0.03175796 0.02467003 0.04995935 0.03373059 0.06044565
 0.12563847 0.0200969  0.02791315 0.13603443 0.08649171 0.12532612
 0.17184295]

(178, 6)


array([[1.423e+01, 3.060e+00, 5.640e+00, 1.040e+00, 3.920e+00, 1.065e+03],
       [1.320e+01, 2.760e+00, 4.380e+00, 1.050e+00, 3.400e+00, 1.050e+03],
       [1.316e+01, 3.240e+00, 5.680e+00, 1.030e+00, 3.170e+00, 1.185e+03],
       [1.437e+01, 3.490e+00, 7.800e+00, 8.600e-01, 3.450e+00, 1.480e+03],
       [1.324e+01, 2.690e+00, 4.320e+00, 1.040e+00, 2.930e+00, 7.350e+02]])

## Wrapper Methods

### Recursive Feature Elimination (RFE)
The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. 

First, the estimator is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [11]:
# Defining model to build
lin_reg = LinearRegression()

#TODO : Create a RFE and select 6 features
rfe = 
rfe.

# Summarize the selection of the attributes
print("Num Features: %s" % (rfe.n_features_))
print("Selected Features: %s" % (rfe.support_))
print("Feature Ranking: %s" % (rfe.ranking_))

new_features = rfe.transform(X)
new_features[:5, :]

Num Features: 6
Selected Features: [ True False  True False False  True  True False False False  True  True
 False]
Feature Ranking: [1 5 1 3 8 1 1 2 6 4 1 1 7]


array([[14.23,  2.43,  2.8 ,  3.06,  1.04,  3.92],
       [13.2 ,  2.14,  2.65,  2.76,  1.05,  3.4 ],
       [13.16,  2.67,  2.8 ,  3.24,  1.03,  3.17],
       [14.37,  2.5 ,  3.85,  3.49,  0.86,  3.45],
       [13.24,  2.87,  2.8 ,  2.69,  1.04,  2.93]])

### Forward Feature Selection
The procedure starts with an empty set of features. The best of the original features is determined and added to the reduced set. 

In [12]:
#TODO : Create a forward feature selector and select 4 features
ffs = 
ffs.

new_features = ffs.transform(X)
new_features[:5, :]

array([[  15.6 ,    3.06,    5.64, 1065.  ],
       [  11.2 ,    2.76,    4.38, 1050.  ],
       [  18.6 ,    3.24,    5.68, 1185.  ],
       [  16.8 ,    3.49,    7.8 , 1480.  ],
       [  21.  ,    2.69,    4.32,  735.  ]])

### Backward Feature Elimination
 The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.

In [13]:
#TODO : Create a backward feature selector and select 4 features
bfs = 
bfs.

new_features = bfs.transform(X)
new_features[:5, :]

array([[  15.6 ,    3.06,    5.64, 1065.  ],
       [  11.2 ,    2.76,    4.38, 1050.  ],
       [  18.6 ,    3.24,    5.68, 1185.  ],
       [  16.8 ,    3.49,    7.8 , 1480.  ],
       [  21.  ,    2.69,    4.32,  735.  ]])

### Exhaustive Feature Selection
This is a brute-force evaluation of each feature subset. It tries every possible combination of the variables and returns the best performing subset but also take longer time.

In [14]:
%%time
knn = KNeighborsClassifier(n_neighbors=3)

#TODO : Create an exhaustive feature selector and select 4 features
efs = 
efs.

new_features = efs.transform(X)
new_features[:5, :]

Features: 1079/1079

Wall time: 9.47 s


array([[14.23,  2.8 ,  3.06,  5.64],
       [13.2 ,  2.65,  2.76,  4.38],
       [13.16,  2.8 ,  3.24,  5.68],
       [14.37,  3.85,  3.49,  7.8 ],
       [13.24,  2.8 ,  2.69,  4.32]])

## Embedded Methods

### LASSO (Least Absolute Shrinkage and Selection Operator)
This type of regularization (L1) can lead to zero coefficients. Lasso selects the only some feature while reduces the coefficients of others to zero. 
<div align="center">
  <img alt="" src="https://user-images.githubusercontent.com/79887667/134531133-7bd90082-ace5-47f9-a4b4-20ab6fca1505.png" width="400" height="200"><br>
  <sup>Cost function for Lasso regression<sup>
</div>

In [15]:
#TODO : Train a Lasso model which includes feature selection internally
lasso = 
lasso.

# Perform feature selection
new_features = [feature for feature, weight in zip(
    X.columns.values, lasso.coef_) if weight != 0]

print(len(new_features))
new_features

3


['alcalinity_of_ash', 'color_intensity', 'proline']

In order to better understand the effect of regularization, here is a helper function that will  print out the function fit by the regression model.

In [16]:
# A helper method for pretty-printing the coefficients
def pretty_print_coefs(coefs, names=None, sort=False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key=lambda x: -np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                      for coef, name in lst)

In [17]:
print("Lasso model:", pretty_print_coefs(lasso.coef_))

Lasso model: 0.0 * X0 + 0.0 * X1 + 0.0 * X2 + 0.004 * X3 + 0.0 * X4 + -0.0 * X5 + -0.0 * X6 + 0.0 * X7 + -0.0 * X8 + 0.068 * X9 + -0.0 * X10 + -0.0 * X11 + -0.002 * X12


### Tree-based Model

In [18]:
#TODO : Train an ExtraTreesClassifier which calculate the feature importances internally
model = ExtraTreesClassifier(n_estimators=10, random_state=1)
model.fit(X, y)

print(model.feature_importances_)

# Perform feature selection
new_features = [feature for feature, weight in zip(
    X.columns.values, model.feature_importances_) if weight > 0.1]

print(len(new_features))
new_features

[0.12173001 0.04046114 0.0191245  0.03167711 0.02425761 0.0941524
 0.0882276  0.0408133  0.0304245  0.11278127 0.09579309 0.10365887
 0.1968986 ]
4


['alcohol', 'color_intensity', 'od280/od315_of_diluted_wines', 'proline']