
FEATURE SELECTION TECHNIQUES AND WHY FEATURE SELECTION IS IMPORTANT.
Importance of feature selection.
1. Helps prevent overfitting in models.
2. Will increase efficiency of the model.
3. Reduce training time.
4. Help boost generalization of models.
5. Minimize collinearity while enhancing interpretability.
6. Helps avoid the hectiness of dimensionality.

In [1]:
#Univariate selection.
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [2]:
path = 'pima-indians-diabetes.csv'
df = read_csv(path)
array = df.values

#Separate array into input and output components.
X = array[:,0:8]
Y = array[:,8]

test = SelectKBest(score_func = chi2, k=4)
fit = test.fit(X,Y)
set_printoptions(precision = 2)
print(fit.scores_)

featured_data = fit.transform(X)
print("\n Featured Data: \n", featured_data[0:4])

[ 110.73 1406.59   17.5    51.01 2219.4   127.67    5.36  178.01]

 Featured Data: 
 [[ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


For feature selection, we will use the sciKit Learn (popularly known as Sklearn).
In this tutorial we will use sklearn to:
1. Remove features with low variance.
2. Select Kbest features.
3. Select feature by other model.

In [3]:
#Remove low variance features.
import sklearn.feature_selection as fs

# X is you feature matrix
var = fs.VarianceThreshold(threshold=0.2)
#Threshold allows you to control your variance.
var.fit(X)
X_trans = var.transform(X)

In [1]:
# Let's see an actaual implementation of VarianceThreshold.
import sklearn.feature_selection as fs
import numpy as np 

X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1,
                                                                      1]])
var = fs.VarianceThreshold(threshold=0.2)
var.fit(X)
X_trans = var.transform(X)
print("The original data")
print(X)
print("The processed data by variance threshold")
print(X_trans)

The original data
[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 1]
 [0 1 0]
 [0 1 1]]
The processed data by variance threshold
[[0 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [1 1]]


In [13]:
import sklearn.datasets as datasets
sklearn.feature_selection import SelectKBest, f_classif
X, y = datasets.make_classification(n_samples=300, n_features=10, n_informative=4)
# choose the f_classif as the metric and K is 3
bk = SelectKBest(f_classif,k=3)
bk.fit(X, y)
X_trans = bk.transform(X)

In [1]:
import sklearn.feature_selection as fs
import sklearn.datasets as datasets
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

X, y = datasets.make_classification(n_samples=500,
                                    n_features=20,
                                    n_informative=8,
                                    random_state=42)

f1_list = []
for k in range(1, 15):
    bk = SelectKBest(f_classif,k=k)
    bk.fit(X, y)
    X_trans = bk.transform(X)
    train_x, test_x, train_y, test_y = train_test_split(X_trans,
                                                        y,
                                                        test_size=0.2,
                                                        random_state=42)
    lr = LogisticRegression()
    lr.fit(train_x, train_y)
    y_pred = lr.predict(test_x)
    f1 = metrics.f1_score(test_y, y_pred)
    f1_list.append(f1)

fig, axe = plt.subplots(dpi = 300)
axe.plot(range(1, 15), f1_list)
axe.set_xlabel("best k features")
axe.set_ylabel("F1-score")
fig.savefig("output/img.png")
plt.close(fig)