## Ranking and selecting features

In this example, we'll exemplify some of scikit-learn's ranking functions used to score the importance of features. We'll reuse the running example, the Adult dataset that we used in the first exercise.

In [1]:
def read_names(filename):
    names = []
    types = []
    with open(filename) as f:
        for l in f:
            if l[0] == '|' or ':' not in l:
                continue
            cols = l.split(':')
            names.append(cols[0])
            if cols[1].startswith(' continuous.'):
                types.append(float)
            else:
                types.append(str)
    return names, types

def read_data(filename, col_names, col_types):
    X = []
    Y = []
    with open(filename) as f:
        for l in f:
            cols = l.strip('\n.').split(', ')
            if len(cols) < len(col_names):
                continue
            X.append( { n:t(c) for n, t, c in zip(col_names, col_types, cols) } )
            Y.append(cols[-1])
    return X, Y

col_names, col_types = read_names('datasets/adult.names')

Xtrain, Ytrain = read_data('datasets/adult.data', col_names, col_types)
Xtest, Ytest = read_data('datasets/adult.test', col_names, col_types)

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/adult.names'

As you might recall, the instances in this dataset consist of several features describing each individual.

In [2]:
Xtrain[0]

{'age': 39.0,
 'capital-gain': 2174.0,
 'capital-loss': 0.0,
 'education': 'Bachelors',
 'education-num': 13.0,
 'fnlwgt': 77516.0,
 'hours-per-week': 40.0,
 'marital-status': 'Never-married',
 'native-country': 'United-States',
 'occupation': 'Adm-clerical',
 'race': 'White',
 'relationship': 'Not-in-family',
 'sex': 'Male',
 'workclass': 'State-gov'}

We first convert the training set into numerical vectors.

In [3]:
import numpy as np

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer()
dv.fit(Xtrain)

X_vec = dv.transform(Xtrain)

The first scoring function we'll investigate is called the [mutual information](https://en.wikipedia.org/wiki/Mutual_information). [Here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) is the description from scikit-learn about how this scoring function works.

(To see the formula used to compute the mutual information score, see the [description](https://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html) in the book *Introduction to Information Retrieval* by Manning and Schütze.)

We apply the scoring function to all the features, and we then print the top 10 high-scoring features. Please refer back to the perceptron example in the previous lecture for an explanation about the step where we sort the features by importance.

In [4]:
from sklearn.feature_selection import mutual_info_classif

feature_scores = mutual_info_classif(X_vec, Ytrain)

for score, fname in sorted(zip(feature_scores, dv.get_feature_names()), reverse=True)[:10]:
    print(fname, score)

fnlwgt 0.393714477479
marital-status=Married-civ-spouse 0.105432234254
capital-gain 0.0833823721234
relationship=Husband 0.0808768411074
age 0.0687725396789
education-num 0.0648722276268
marital-status=Never-married 0.0619507241042
hours-per-week 0.0422833222022
relationship=Own-child 0.0382161042027
capital-loss 0.0369804845104


The second scoring function uses the so-called $F$-statistic in an [ANOVA test](https://en.wikipedia.org/wiki/Analysis_of_variance).

As you can see, there is an overlap between the top-10 list produced by this scorer and the previous list, but they are not identical.

In [5]:
from sklearn.feature_selection import f_classif

feature_scores = f_classif(X_vec, Ytrain)[0]

for score, fname in sorted(zip(feature_scores, dv.get_feature_names()), reverse=True)[:10]:
    print(fname, score)

marital-status=Married-civ-spouse 8025.84206159
relationship=Husband 6240.01827621
education-num 4120.09577971
marital-status=Never-married 3674.20014657
age 1886.70731372
hours-per-week 1813.38628222
relationship=Own-child 1794.15748936
capital-gain 1709.15006374
sex=Female 1593.10790745
sex=Male 1593.10790745


Yet another feature scoring function. It is based on the well-known [$\chi^2$ statistical test](https://en.wikipedia.org/wiki/Chi-squared_test).

In [6]:
from sklearn.feature_selection import chi2

feature_scores = chi2(X_vec, Ytrain)[0]

for score, fname in sorted(zip(feature_scores, dv.get_feature_names()), reverse=True)[:10]:
    print(fname, score)

capital-gain 82192467.1415
capital-loss 1372145.8902
fnlwgt 171147.682865
age 8600.61182156
hours-per-week 6476.40899593
marital-status=Married-civ-spouse 3477.51587745
relationship=Husband 3114.94154603
education-num 2401.4217772
marital-status=Never-married 2218.52197657
relationship=Own-child 1435.87301604


In practice when we'd like to use feature selection in scikit-learn, we just plug a selector into our pipeline. `SelectKBest` and `SelectPercentile` are the most common selectors. They use a feature scoring function (such as the ones above) to rank the features; by default, the `f_classif` scoring function is used.

In [7]:
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.tree import DecisionTreeClassifier

pipeline = make_pipeline(
        DictVectorizer(),
        SelectKBest(k=100), # or SelectPercentile(...)
        DecisionTreeClassifier()
)