### Feature Selection For Machine Learning in Python with scikit-learn
Irrelevant or partially relevant features can negatively impact model performance. Select those features in your data that contribute most to the prediction variable to **Reduces Overfitting, Improves Accuracy, Reduces Training Time**

Data set: SPECTF heart data intended for binary classification task.
267 instances (train+test) are descibed by 45 attributes (44 continuous independent  + 1 binary dependent).
All fields are numeric and there is no header line.
Source: http://archive.ics.uci.edu/ml/machine-learning-databases/spect/

In [6]:
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression
names = ["class","F1R", "F1S","F2R","F2S","F3R","F3S","F4R","F4S","F5R","F5S","F6R","F6S","F7R","F7S","F8R","F8S","F9R",
         "F9S","F10R","F10S","F11R","F11S","F12R","F12S","F13R","F13S","F14R","F14S","F15R","F15S","F16R","F16S",
         "F17R","F17S","F18R","F18S","F19R","F19S","F20R","F20S","F21R","F21S","F22R","F22S"] # 44 featires names
### read train dataset
filename_train = 'SPECTF.train.txt'
dataframe_train = read_csv(filename_train, names=names)
array_train = dataframe_train.values # convert to arrays from pandas.DataFrame
print("Shape of train set: %s" % str(array_train.shape))
k_full = array_train.shape[1]-1
X_train = array_train[:, 1:45]
Y_train = array_train[:, 0:1]
### read test dataset
filename_test = 'SPECTF.test.txt'
dataframe_test = read_csv(filename_test, names=names)
array_test = dataframe_test.values # convert to arrays from pandas.DataFrame
print("Shape of test set: %s" % str(array_test.shape))
X_test = array_test[:, 1:45]
Y_test = array_test[:, 0:1]


Shape of train set: (187, 45)
Shape of test set: (80, 45)


#### 1. Univariate Selection
(Chi-Squared) Statistical test for non-negative features

In [14]:
# feature extraction
k_trunkated=int(round(k_full*0.90)) # leave 90% of features
train = SelectKBest(score_func=chi2, k=k_trunkated) # choose half top-features with the heights Chi-Squared statistics
fit = train.fit(X_train, Y_train)
# summarize scores
set_printoptions(precision=3)
print(fit.scores_) # Chi-Squared statistics for each feature - https://www.khanacademy.org/math/ap-statistics/chi-square-tests/chi-square-goodness-fit/v/chi-square-statistic
features = fit.transform(X_train)

# feature extraction
features_test = fit.transform(X_test)
# apply Logistac model to the full feature set
model_full = LogisticRegression()
model_full.fit(X_train, Y_train.ravel())
result_full = model_full.score(X_test, Y_test.ravel())
print("Accuracy using %d features in Logistic model: %.2f%%" % (k_full,(result_full*100.0)))

# apply Logistac model to the most important features
model_trunkated = LogisticRegression()
model_trunkated.fit(features, Y_train.ravel())
result_trunkated = model_trunkated.score(features_test, Y_test.ravel())
print("Accuracy using %d features in Logistic model: %.2f%%" % (k_trunkated, (result_trunkated*100.0)))
print("Eliminating %d%% of features increased model accuracy by %.1f%%" % (int(round((k_full-k_trunkated)*100/k_full)), (result_trunkated-result_full)*100.0))

[ 0.253  1.573  0.286  2.263  7.039 14.417  1.732  4.195  3.29   3.636
  0.247  1.527  1.74   5.17   9.4   11.533  1.367  0.903  2.395  0.973
  0.809  0.302  2.775  3.092 21.587 30.918  0.531  0.189  5.724  9.958
  0.295  3.155  1.06   3.024  2.381  6.533  0.676  1.18  12.682 18.267
 29.944 45.701 32.785 39.676]
Accuracy using 44 features in Logistic model: 52.50%
Accuracy using 40 features in Logistic model: 60.00%
Eliminating 9% of features increased model accuracy by 7.5%


  if np.issubdtype(mask.dtype, np.int):


#### 2. Recursive Feature Elimination (RFE)
It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

In [15]:
from sklearn.feature_selection import RFE

model_full = LogisticRegression()
k_full = array_train.shape[1]-1
k_truncated=int(round(k_full*0.9))
rfe = RFE(model_full, k_truncated)
fit = rfe.fit(X_train, Y_train.ravel())
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
features = fit.transform(X_train)
features_test = fit.transform(X_test)
print(features_test.shape)

# apply Logistac model to the full feature set
model_full.fit(X_train, Y_train.ravel())
result_full = model_full.score(X_test, Y_test.ravel())
print("Accuracy using %d features in Logistic model: %.2f%%" % (k_full,(result_full*100.0)))

# apply Logistac model to the most important features
model_truncated = LogisticRegression()
model_truncated.fit(features, Y_train.ravel())
result_truncated = model_truncated.score(features_test, Y_test.ravel())
print("Accuracy using %d features in Logistic model: %.2f%%" % (k_truncated, (result_truncated*100.0)))
print("Eliminating %d%% of features increased model accuracy by %.2f%%" % (int(round((k_full-k_truncated)*100/k_full)), (result_truncated-result_full)*100.0))


Num Features: 40
Selected Features: [ True  True  True  True  True  True  True  True  True  True  True False
  True  True  True  True  True  True False  True  True  True False  True
  True False  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True]
Feature Ranking: [1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 5 1 1 1 3 1 1 4 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1]
(80, 40)
Accuracy using 44 features in Logistic model: 52.50%
Accuracy using 40 features in Logistic model: 53.75%
Eliminating 9% of features increased model accuracy by 1.25%


  if np.issubdtype(mask.dtype, np.int):


#### 3. Principal Component Analysis
It uses linear algebra to transform the dataset into a compressed form. (Dimensionality Reduction Technique)

In [20]:
from sklearn.decomposition import PCA
k_full = array_train.shape[1]-1
k_truncated=int(round(k_full*0.8))
pca = PCA(n_components=k_truncated)
fit = pca.fit(X_train)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
#print(fit.components_)
features = fit.transform(X_train)
features_test = fit.transform(X_test)

# apply Logistac model to the full feature set
model_full = LogisticRegression()
model_full.fit(X_train, Y_train.ravel())
result_full = model_full.score(X_test, Y_test.ravel())
print("Accuracy using %d features in Logistic model: %.2f%%" % (k_full,(result_full*100.0)))

# apply Logistac model to the most important features
model_truncated = LogisticRegression()
model_truncated.fit(features, Y_train.ravel())
result_truncated = model_truncated.score(features_test, Y_test.ravel())
print("Accuracy using %d features in Logistic model: %.2f%%" % (k_truncated, (result_truncated*100.0)))
print("Eliminating %d%% of features increased model accuracy by %.2f%%" % (int(round((k_full-k_truncated)*100/k_full)), (result_truncated-result_full)*100.0))


Explained Variance: [0.4   0.149 0.072 0.049 0.037 0.033 0.028 0.024 0.021 0.02  0.017 0.015
 0.013 0.011 0.01  0.009 0.009 0.008 0.007 0.007 0.006 0.005 0.005 0.004
 0.004 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002]
Accuracy using 44 features in Logistic model: 52.50%
Accuracy using 35 features in Logistic model: 62.50%
Eliminating 20% of features increased model accuracy by 10.00%


#### 4. Feature Importance
It fits a number of randomized decision trees (Extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting

In [23]:
from sklearn.ensemble import ExtraTreesClassifier
print(X_train.shape)
model_full = ExtraTreesClassifier()
model_full.fit(X_train, Y_train.ravel())
print(model_full.n_features_)
print(model_full.feature_importances_)
features = fit.transform(X_train)
features_test = fit.transform(X_test)

(187, 44)
44
[0.031 0.033 0.022 0.016 0.027 0.024 0.022 0.023 0.037 0.041 0.019 0.006
 0.015 0.008 0.027 0.007 0.014 0.022 0.018 0.026 0.039 0.012 0.036 0.015
 0.013 0.042 0.035 0.01  0.015 0.018 0.007 0.03  0.026 0.013 0.022 0.001
 0.005 0.024 0.013 0.022 0.055 0.042 0.031 0.034]
