# Feature Selection Techniques

In machine learning, feature selection is the process of choosing a subset of input features that contribute the most to the output feature for use in model construction. Feature selection is substantially important if we have datasets with high dimensionality (i.e., large number of features). High-dimensional datasets are not preferred because they have lengthy training time and have high risk of overfitting. Feature selection helps to mitigate these problems by selecting features that have high importance to the model, such that the data dimensionality can be reduced without much loss of the total information. Some benefits of feature selection are:

- Reduce training time
- Reduce the risk of overfitting
- Potentially increase model's performance 
- Reduce model's complexity such that interpretation becomes easier

## Table of Contents

1. Filter Methods
2. Wrapper Methods
3. Embedded Methods

Before we get started, let's import the necessary Python libraries.

In [1]:
import pandas as pd
import numpy as np

## 1) Filter Methods

<img src='https://www.analyticsvidhya.com/wp-content/uploads/2016/11/Filter_1.png'>

In filter methods, features are selected independently from any machine algorithms. Filter methods generally use a specific criteria, such as scores in statistical test and variances, to rank the importance of individual features.

They are generally effective in computation time.

The main weakness of filter methods is that they do not consider the relationships among features. That's why they are mainly used as the pre-processing step of any feature selection pipeline. We will discuss a lot of types of filter selection methods:

1. Feature Selection for Regression (Numeric Output)
    - Pearson’s Correlation Coefficient using `f_regression()`
    - Mutual Information for regression using `mutual_info_regression()`
2. Feature Selection for Classification (Categorical Output)
    - ANOVA for numeric feature using `f_classif()`
    - Chi-squared for categorical feature using `chi2()`
    - Mutual Information for classification using `mutual_info_classif()`

In [2]:
from sklearn.feature_selection import f_regression, mutual_info_regression
from sklearn.feature_selection import f_classif, chi2, mutual_info_classif
from sklearn.feature_selection import SelectKBest, SelectPercentile

### 1. Feature Selection for Regression (Numeric Output)

#### 1.1 Pearson’s Correlation Coefficient with `f_regression()`

In [3]:
df = pd.read_csv('../dastasets/housing.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [4]:
df = pd.get_dummies(df, columns=['ocean_proximity'], drop_first=True)
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,1,0,0,0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,1,0,0,0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,1,0,0,0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,1,0,0,0


In [5]:
df.dropna(inplace=True)

In [6]:
x = df.drop('median_house_value', axis=1)
y = df['median_house_value']

In [7]:
all_features = x.columns
all_features

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity_INLAND', 'ocean_proximity_ISLAND',
       'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN'],
      dtype='object')

In [8]:
selector = SelectKBest(k=7, score_func=f_regression)

In [9]:
selector.fit(x, y)

SelectKBest(k=7, score_func=<function f_regression at 0x0000020B66E50C18>)

In [10]:
selector.scores_

array([4.21952209e+01, 4.36553651e+02, 2.34089601e+02, 3.69570525e+02,
       5.05631731e+01, 1.30857779e+01, 8.64023259e+01, 1.83988964e+04,
       6.27683156e+03, 1.13133885e+01, 5.40400961e+02, 4.10703675e+02])

In [11]:
pd.DataFrame(selector.scores_, index=x.columns)

Unnamed: 0,0
longitude,42.195221
latitude,436.553651
housing_median_age,234.089601
total_rooms,369.570525
total_bedrooms,50.563173
population,13.085778
households,86.402326
median_income,18398.896423
ocean_proximity_INLAND,6276.831559
ocean_proximity_ISLAND,11.313389


In [12]:
selected_features_idx = selector.get_support(indices=True)
selected_features_idx

array([ 1,  2,  3,  7,  8, 10, 11], dtype=int64)

In [13]:
selected_features = all_features[selected_features_idx]
selected_features

Index(['latitude', 'housing_median_age', 'total_rooms', 'median_income',
       'ocean_proximity_INLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'],
      dtype='object')

In [14]:
x[selected_features]

Unnamed: 0,latitude,housing_median_age,total_rooms,median_income,ocean_proximity_INLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,37.88,41.0,880.0,8.3252,0,1,0
1,37.86,21.0,7099.0,8.3014,0,1,0
2,37.85,52.0,1467.0,7.2574,0,1,0
3,37.85,52.0,1274.0,5.6431,0,1,0
4,37.85,52.0,1627.0,3.8462,0,1,0
...,...,...,...,...,...,...,...
20635,39.48,25.0,1665.0,1.5603,1,0,0
20636,39.49,18.0,697.0,2.5568,1,0,0
20637,39.43,17.0,2254.0,1.7000,1,0,0
20638,39.43,18.0,1860.0,1.8672,1,0,0


#### 1.2 Mutual Information for regression using `mutual_info_regression()`

In [15]:
df = pd.read_csv('../dastasets/housing.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [16]:
df = pd.get_dummies(df, columns=['ocean_proximity'], drop_first=True)
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,1,0,0,0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,1,0,0,0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,1,0,0,0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,1,0,0,0


In [17]:
df.dropna(inplace=True)

In [18]:
x = df.drop('median_house_value', axis=1)
y = df['median_house_value']

In [19]:
all_features = x.columns
all_features

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity_INLAND', 'ocean_proximity_ISLAND',
       'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN'],
      dtype='object')

In [20]:
selector = SelectPercentile(percentile=50, score_func=mutual_info_regression)

In [21]:
selector.fit(x, y)

SelectPercentile(percentile=50,
                 score_func=<function mutual_info_regression at 0x0000020B671E08B8>)

In [22]:
selector.scores_

array([0.40081176, 0.37188996, 0.03226773, 0.04197172, 0.02486006,
       0.0224751 , 0.02987377, 0.38944072, 0.20221641, 0.00197068,
       0.01275666, 0.0139668 ])

In [23]:
pd.DataFrame(selector.scores_, index=x.columns)

Unnamed: 0,0
longitude,0.400812
latitude,0.37189
housing_median_age,0.032268
total_rooms,0.041972
total_bedrooms,0.02486
population,0.022475
households,0.029874
median_income,0.389441
ocean_proximity_INLAND,0.202216
ocean_proximity_ISLAND,0.001971


In [24]:
selected_features_idx = selector.get_support(indices=True)
selected_features_idx

array([0, 1, 2, 3, 7, 8], dtype=int64)

In [25]:
selected_features = all_features[selected_features_idx]
selected_features

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'median_income', 'ocean_proximity_INLAND'],
      dtype='object')

In [26]:
x[selected_features]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,median_income,ocean_proximity_INLAND
0,-122.23,37.88,41.0,880.0,8.3252,0
1,-122.22,37.86,21.0,7099.0,8.3014,0
2,-122.24,37.85,52.0,1467.0,7.2574,0
3,-122.25,37.85,52.0,1274.0,5.6431,0
4,-122.25,37.85,52.0,1627.0,3.8462,0
...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,1.5603,1
20636,-121.21,39.49,18.0,697.0,2.5568,1
20637,-121.22,39.43,17.0,2254.0,1.7000,1
20638,-121.32,39.43,18.0,1860.0,1.8672,1


### 2. Feature Selection for Classification (Categorical Output)

#### 2.1 ANOVA for numeric feature using `f_classif()`

In [27]:
from sklearn.datasets import load_breast_cancer

In [28]:
data = load_breast_cancer()
x = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

In [29]:
x

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [30]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [31]:
all_features = x.columns
all_features

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [32]:
selector = SelectPercentile(percentile=50, score_func=f_classif)

In [33]:
selector.fit(x, y)

SelectPercentile(percentile=50)

In [34]:
selector.scores_

array([6.46981021e+02, 1.18096059e+02, 6.97235272e+02, 5.73060747e+02,
       8.36511234e+01, 3.13233079e+02, 5.33793126e+02, 8.61676020e+02,
       6.95274435e+01, 9.34592949e-02, 2.68840327e+02, 3.90947023e-02,
       2.53897392e+02, 2.43651586e+02, 2.55796780e+00, 5.32473391e+01,
       3.90144816e+01, 1.13262760e+02, 2.41174067e-02, 3.46827476e+00,
       8.60781707e+02, 1.49596905e+02, 8.97944219e+02, 6.61600206e+02,
       1.22472880e+02, 3.04341063e+02, 4.36691939e+02, 9.64385393e+02,
       1.18860232e+02, 6.64439606e+01])

In [35]:
pd.DataFrame(selector.scores_, index=x.columns)

Unnamed: 0,0
mean radius,646.981021
mean texture,118.096059
mean perimeter,697.235272
mean area,573.060747
mean smoothness,83.651123
mean compactness,313.233079
mean concavity,533.793126
mean concave points,861.67602
mean symmetry,69.527444
mean fractal dimension,0.093459


In [36]:
selected_features_idx = selector.get_support(indices=True)
selected_features_idx

array([ 0,  2,  3,  5,  6,  7, 10, 12, 13, 20, 22, 23, 25, 26, 27],
      dtype=int64)

In [37]:
selected_features = all_features[selected_features_idx]
selected_features

Index(['mean radius', 'mean perimeter', 'mean area', 'mean compactness',
       'mean concavity', 'mean concave points', 'radius error',
       'perimeter error', 'area error', 'worst radius', 'worst perimeter',
       'worst area', 'worst compactness', 'worst concavity',
       'worst concave points'],
      dtype='object')

In [38]:
x[selected_features]

Unnamed: 0,mean radius,mean perimeter,mean area,mean compactness,mean concavity,mean concave points,radius error,perimeter error,area error,worst radius,worst perimeter,worst area,worst compactness,worst concavity,worst concave points
0,17.99,122.80,1001.0,0.27760,0.30010,0.14710,1.0950,8.589,153.40,25.380,184.60,2019.0,0.66560,0.7119,0.2654
1,20.57,132.90,1326.0,0.07864,0.08690,0.07017,0.5435,3.398,74.08,24.990,158.80,1956.0,0.18660,0.2416,0.1860
2,19.69,130.00,1203.0,0.15990,0.19740,0.12790,0.7456,4.585,94.03,23.570,152.50,1709.0,0.42450,0.4504,0.2430
3,11.42,77.58,386.1,0.28390,0.24140,0.10520,0.4956,3.445,27.23,14.910,98.87,567.7,0.86630,0.6869,0.2575
4,20.29,135.10,1297.0,0.13280,0.19800,0.10430,0.7572,5.438,94.44,22.540,152.20,1575.0,0.20500,0.4000,0.1625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,142.00,1479.0,0.11590,0.24390,0.13890,1.1760,7.673,158.70,25.450,166.10,2027.0,0.21130,0.4107,0.2216
565,20.13,131.20,1261.0,0.10340,0.14400,0.09791,0.7655,5.203,99.04,23.690,155.00,1731.0,0.19220,0.3215,0.1628
566,16.60,108.30,858.1,0.10230,0.09251,0.05302,0.4564,3.425,48.55,18.980,126.70,1124.0,0.30940,0.3403,0.1418
567,20.60,140.10,1265.0,0.27700,0.35140,0.15200,0.7260,5.772,86.22,25.740,184.60,1821.0,0.86810,0.9387,0.2650


#### 2.2 Chi-squared for categorical feature using `chi2()`

In [39]:
df = pd.read_csv('../dastasets/tennis.csv')
df

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes
5,rainy,cool,normal,True,no
6,overcast,cool,normal,True,yes
7,sunny,mild,high,False,no
8,sunny,cool,normal,False,yes
9,rainy,mild,normal,False,yes


In [40]:
df = pd.get_dummies(df, columns=df.columns, drop_first=True)
df

Unnamed: 0,outlook_rainy,outlook_sunny,temp_hot,temp_mild,humidity_normal,windy_True,play_yes
0,0,1,1,0,0,0,0
1,0,1,1,0,0,1,0
2,0,0,1,0,0,0,1
3,1,0,0,1,0,0,1
4,1,0,0,0,1,0,1
5,1,0,0,0,1,1,0
6,0,0,0,0,1,1,1
7,0,1,0,1,0,0,0
8,0,1,0,0,1,0,1
9,1,0,0,1,1,0,1


In [41]:
x = df.drop('play_yes', axis=1)
y = df['play_yes']

In [42]:
all_features = x.columns
all_features

Index(['outlook_rainy', 'outlook_sunny', 'temp_hot', 'temp_mild',
       'humidity_normal', 'windy_True'],
      dtype='object')

In [43]:
selector = SelectKBest(k=3, score_func=chi2)

In [44]:
selector.fit(x, y)

SelectKBest(k=3, score_func=<function chi2 at 0x0000020B66E50798>)

In [45]:
selector.scores_

array([0.04      , 1.28444444, 0.35555556, 0.01481481, 1.4       ,
       0.53333333])

In [46]:
pd.DataFrame(selector.scores_, index=x.columns)

Unnamed: 0,0
outlook_rainy,0.04
outlook_sunny,1.284444
temp_hot,0.355556
temp_mild,0.014815
humidity_normal,1.4
windy_True,0.533333


In [47]:
selected_features_idx = selector.get_support(indices=True)
selected_features_idx

array([1, 4, 5], dtype=int64)

In [48]:
selected_features = all_features[selected_features_idx]
selected_features

Index(['outlook_sunny', 'humidity_normal', 'windy_True'], dtype='object')

In [49]:
x[selected_features]

Unnamed: 0,outlook_sunny,humidity_normal,windy_True
0,1,0,0
1,1,0,1
2,0,0,0
3,0,0,0
4,0,1,0
5,0,1,1
6,0,1,1
7,1,0,0
8,1,1,0
9,0,1,0


#### 2.3 Mutual Information for classification using `mutual_info_classif()`

In [50]:
from sklearn.datasets import load_breast_cancer

In [51]:
data = load_breast_cancer()
x = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

In [52]:
x

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [53]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [54]:
all_features = x.columns
all_features

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [55]:
selector = SelectPercentile(percentile=80, score_func=mutual_info_classif)

In [56]:
selector.fit(x, y)

SelectPercentile(percentile=80,
                 score_func=<function mutual_info_classif at 0x0000020B671E0A68>)

In [57]:
selector.scores_

array([0.36790479, 0.09499555, 0.40225126, 0.36073177, 0.08191249,
       0.21232085, 0.37377277, 0.43948957, 0.06509397, 0.01025557,
       0.2480213 , 0.00081942, 0.27476618, 0.33969713, 0.01475167,
       0.07627261, 0.11499053, 0.12640156, 0.0169378 , 0.0375776 ,
       0.45177571, 0.11982822, 0.47518323, 0.46470383, 0.09422606,
       0.22663482, 0.31507528, 0.43587429, 0.09398874, 0.06876209])

In [58]:
pd.DataFrame(selector.scores_, index=x.columns)

Unnamed: 0,0
mean radius,0.367905
mean texture,0.094996
mean perimeter,0.402251
mean area,0.360732
mean smoothness,0.081912
mean compactness,0.212321
mean concavity,0.373773
mean concave points,0.43949
mean symmetry,0.065094
mean fractal dimension,0.010256


In [59]:
selected_features_idx = selector.get_support(indices=True)
selected_features_idx

array([ 0,  1,  2,  3,  4,  5,  6,  7, 10, 12, 13, 15, 16, 17, 20, 21, 22,
       23, 24, 25, 26, 27, 28, 29], dtype=int64)

In [60]:
selected_features = all_features[selected_features_idx]
selected_features

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'radius error', 'perimeter error', 'area error',
       'compactness error', 'concavity error', 'concave points error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [61]:
x[selected_features]

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,radius error,perimeter error,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,1.0950,8.589,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.5435,3.398,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.7456,4.585,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.4956,3.445,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.7572,5.438,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,1.1760,7.673,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.7655,5.203,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.4564,3.425,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.7260,5.772,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


## 2) Wrapper Methods

<img src='https://www.analyticsvidhya.com/wp-content/uploads/2016/11/Wrapper_1.png'>

Wrapper methods try to find a subset of features that yield the best performance for a model by training, evaluating, and comparing the model with different combinations of features.

Wrapper methods enable the detection of relationships among features. However, they can be computationally expensive, especially if the number of features is high. The risk of overfitting is also high if the number of instances in the dataset is insufficient. 

There are some diferrences between filter and wrapper methods:
- Filter methods do not incorporate a machine learning model in order to determine if a feature is good or bad whereas wrapper methods use a machine learning model and train it the feature to decide if it is essential or not.
- Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally costly, and in the case of massive datasets, wrapper methods are not the most effective feature selection method to consider.
- Filter methods may fail to find the best subset of features in situations when there is not enough data to model the statistical correlation of the features, but wrapper methods can always provide the best subset of features because of their exhaustive nature.
- Using features from wrapper methods in your final machine learning model can lead to overfitting as wrapper methods already train machine learning models with the features and it affects the true power of learning. But the features from filter methods will not lead to overfitting in most of the cases.

We will discuss avery populer wrapper method:

**Recursive Feature Elimination `RFE()`**

### Recursive Feature Elimination

Recursive feature elimination performs a greedy search to find the best performing feature subset. It iteratively creates models and determines the best or the worst performing feature at each iteration. It constructs the subsequent models with the left features until all the features are explored. It then ranks the features based on the order of their elimination.

* Uses an external estimator to calculate weights of features
* First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. 
* Then, the least important features are pruned from current set of features. 
* That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [62]:
df = pd.read_csv("../dastasets/diabetes.csv")
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [63]:
x = df.drop('Outcome', axis=1)
y = df['Outcome']

In [64]:
all_features = x.columns
all_features

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')

In [65]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

In [66]:
model = DecisionTreeClassifier()
selector = RFE(estimator=model, n_features_to_select=3)

In [67]:
selector.fit(x, y)

RFE(estimator=DecisionTreeClassifier(), n_features_to_select=3)

In [68]:
selector.get_support(indices=True)

array([1, 5, 6], dtype=int64)

In [69]:
selected_features_idx = selector.get_support(indices=True)
selected_features_idx

array([1, 5, 6], dtype=int64)

In [70]:
selected_features = all_features[selected_features_idx]
selected_features

Index(['Glucose', 'BMI', 'DiabetesPedigreeFunction'], dtype='object')

In [71]:
x[selected_features]

Unnamed: 0,Glucose,BMI,DiabetesPedigreeFunction
0,148,33.6,0.627
1,85,26.6,0.351
2,183,23.3,0.672
3,89,28.1,0.167
4,137,43.1,2.288
...,...,...,...
763,101,32.9,0.171
764,122,36.8,0.340
765,121,26.2,0.245
766,126,30.1,0.349


## 3) Embedded Methods

<img src='https://www.analyticsvidhya.com/wp-content/uploads/2016/11/Embedded_1.png'>

Embedded methods combine the strong points of filter and wrapper methods by taking advantage of machine algorithms that have their own built-in feature selection process. They integrate a feature selection step as a part of the training process (i.e., feature selection and training process are performed simultaneously). Embedded methods generally have a more efficient process than wrapper methods because they eliminate the need to retrain every single subset of features being examined. Some of machine algorithms that can be used for feature selection are:
- LASSO regression
- Ridge regression
- Decision tree
- Random forest
- Support vector machine

In the next section, we will focus on feature selection using random forest.

In [72]:
df = pd.read_csv("../dastasets/diabetes.csv")
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [73]:
x = df.drop('Outcome', axis=1)
y = df['Outcome']

In [74]:
all_features = x.columns
all_features

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')

In [75]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

In [76]:
rfc = RandomForestClassifier(random_state=0, criterion='gini') # Use gini criterion to define feature importance

In [77]:
selector = SelectFromModel(estimator=rfc)

In [78]:
selector.fit(x, y)

SelectFromModel(estimator=RandomForestClassifier(random_state=0))

In [79]:
selector.get_support(indices=True)

array([1, 5, 6, 7], dtype=int64)

In [80]:
selected_features_idx = selector.get_support(indices=True)
selected_features_idx

array([1, 5, 6, 7], dtype=int64)

In [81]:
selected_features = all_features[selected_features_idx]
selected_features

Index(['Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age'], dtype='object')

In [82]:
x[selected_features]

Unnamed: 0,Glucose,BMI,DiabetesPedigreeFunction,Age
0,148,33.6,0.627,50
1,85,26.6,0.351,31
2,183,23.3,0.672,32
3,89,28.1,0.167,21
4,137,43.1,2.288,33
...,...,...,...,...
763,101,32.9,0.171,63
764,122,36.8,0.340,27
765,121,26.2,0.245,30
766,126,30.1,0.349,47


# Great Work!