# Machine Learning with Python Cookbook
# Ch 10: Dimensionality Reduction Using Feature *Selection*

## 10.1 Thresholding Numerical Feature Variance
You have a set of numerical features and want to remove those w/low variance (i.e. little information)

### Select a subset of features with variances above a given threshold:

In [1]:
# Load libraries
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold

In [2]:
# Import some data to play with
iris = datasets.load_iris()

In [3]:
# Create features and target
features = iris.data
target = iris.target

In [4]:
# Create thresholder
thresholder = VarianceThreshold(threshold=.5)

In [5]:
# Create high variance feature matrix
features_high_variance = thresholder.fit_transform(features)

In [6]:
# View high variance feature matrix
features_high_variance[0:3]

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2]])

We can see the variance for each feature using `variances_`:

In [7]:
# View variances
thresholder.fit(features).variances_

array([0.68112222, 0.18871289, 3.09550267, 0.57713289])

## 10.2 Thresholding Binary Feature Variance
You have a set of binary categorical features and want to remove those with low variance

### Select a subset of features with a Bernoulli random variable variance above a given threshold:

In [8]:
# Load library
from sklearn.feature_selection import VarianceThreshold

In [9]:
# Create feature matrix with:
# Feature 0: 80% class 0
# Feature 1: 80% class 1
# Feature 2: 60% class 0, 40% class 1
features = [[0, 1, 0], 
            [0, 1, 1], 
            [0, 1, 0], 
            [0, 1, 1], 
            [1, 0, 0]]

In [10]:
# Run threshold by variance
thresholder = VarianceThreshold(threshold=(.75 * (1 - .75)))
thresholder.fit_transform(features)

array([[0],
       [1],
       [0],
       [1],
       [0]])

In [11]:
(.75 * (1 - .75))

0.1875

In [12]:
thresholder.fit(features).variances_

array([0.16, 0.16, 0.24])

## 10.3 Handling Highly Correlated Features
You have a feature matrix and suspect some features are highly correlated

### Use a correlation matrix to check for highly correlated features. If they exist, consider dropping one of them:

In [13]:
# Load libraries
import pandas as pd
import numpy as np

In [14]:
# Create feature matrix with two highly correlated features
features = np.array([[1, 1, 1], 
                     [2, 2, 0], 
                     [3, 3, 1], 
                     [4, 4, 0], 
                     [5, 5, 1], 
                     [6, 6, 0], 
                     [7, 7, 1], 
                     [8, 7, 0], 
                     [9, 7, 1]])

In [15]:
# Convert feature matrix into DF
df = pd.DataFrame(features)

In [16]:
# Create correlation matrix
corr_matrix = df.corr().abs()

In [17]:
corr_matrix

Unnamed: 0,0,1,2
0,1.0,0.976103,0.0
1,0.976103,1.0,0.034503
2,0.0,0.034503,1.0


In [18]:
df.corr()

Unnamed: 0,0,1,2
0,1.0,0.976103,0.0
1,0.976103,1.0,-0.034503
2,0.0,-0.034503,1.0


In [19]:
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), 
                                  k=1).astype(np.bool))

In [20]:
upper

Unnamed: 0,0,1,2
0,,0.976103,0.0
1,,,0.034503
2,,,


In [21]:
# Find index of feature columns with correlation > 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

In [22]:
to_drop

[1]

In [23]:
# Drop features
newDF = df.drop(df.columns[to_drop], axis=1)
newDF.head(3)

Unnamed: 0,0,2
0,1,1
1,2,0
2,3,1


## 10.4 Removing Irrelevant Features for Classification
You have a categorical target vector and want to remove uninformative features

### If the features are categorical, calculate a chi-square ($\chi^2$) statistic between each feature and the target vector:

In [24]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif

In [25]:
# Load data
iris = load_iris()
features = iris.data
target = iris.target

In [26]:
# Convert to categorical data by converting data to integers
features = features.astype(int)

In [27]:
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
features_kbest = chi2_selector.fit_transform(features, target)

In [28]:
# Show results
print(f"Original # of features: {features.shape[1]}")
print(f"Reduced # of features: {features_kbest.shape[1]}")

Original # of features: 4
Reduced # of features: 2


### If the features are quantitative, compute the ANOVA F-value between each feature and the target vector:

In [29]:
# Select two features with highest F-values
fvalue_selector = SelectKBest(f_classif, k=2)
features_kbest = fvalue_selector.fit_transform(features, target)

In [30]:
# Show results
print(f"Original # of features: {features.shape[1]}")
print(f"Reduced # of features: {features_kbest.shape[1]}")

Original # of features: 4
Reduced # of features: 2


### Instead of selecting a specific number of features, we can also use `SelectPercentile` to select the top *n* percent of features:

In [31]:
# Load library
from sklearn.feature_selection import SelectPercentile

In [32]:
# Select top 75% of features with highest F-values
fvalue_selector = SelectPercentile(f_classif, percentile=75)
features_kbest = fvalue_selector.fit_transform(features, target)

In [33]:
# Show results
print(f"Original # of features: {features.shape[1]}")
print(f"Reduced # of features: {features_kbest.shape[1]}")

Original # of features: 4
Reduced # of features: 3


## 10.5 Recursively Eliminating Features
You want to automatically select the best features to keep

### Use sklearn's `RFECV` to conduct recursive feature elimination (RFE) using cross-validation (CV):
That is, repeatedly train a model, each time removing a feature until model performance (e.g. accuracy) becomes worse.

In [34]:
# Load libraries
import warnings
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn import datasets, linear_model

In [35]:
# Suppress an annoying but harmless warning
# warnings.filterwarnings(action="ignore", module="scipy",
#                         message="^internal gelsd")

In [36]:
# Generate features matrix, target vector, and the true coefficients
features, target = make_regression(n_samples = 10000,
                                   n_features = 100,
                                   n_informative = 2,
                                   random_state = 1)

In [37]:
# Create a linear regression
ols = linear_model.LinearRegression()

In [38]:
# Recursively eliminate features
rfecv = RFECV(estimator=ols, step=1, scoring="neg_mean_squared_error")
rfecv.fit(features, target)
rfecv.transform(features)

array([[ 0.00850799,  0.7031277 ],
       [-1.07500204,  2.56148527],
       [ 1.37940721, -1.77039484],
       ...,
       [-0.80331656, -1.60648007],
       [ 0.39508844, -1.34564911],
       [-0.55383035,  0.82880112]])

In [39]:
# Number of best features
rfecv.n_features_

2

In [40]:
# Which categories are best
rfecv.support_

array([False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

In [41]:
# Rank features best (1) to worst
rfecv.ranking_

array([93, 64, 74, 57, 66,  1, 80, 75, 44, 25, 52, 84, 36, 77, 14, 24, 94,
       56, 12,  4, 72, 90, 41, 63, 31, 82,  5, 39,  9, 69, 33, 51, 29,  6,
       34, 62, 50, 78, 19,  1, 59, 38, 20, 87, 68, 89, 83,  7, 47, 28, 86,
       30, 54, 15, 43, 17,  8, 11, 40, 85,  3, 23, 48, 35, 18,  2, 70, 71,
       60, 45, 96, 46, 65, 58, 53, 32, 81, 10, 61, 99, 97, 26, 95, 13, 37,
       98, 67, 91, 73, 55, 27, 16, 42, 92, 49, 22, 76, 21, 88, 79])