<h1>Feature Importances</h1>

<p>
    We often wants to know if every feature contribute equally to building a model, and if not, which subset of features should we use?<br/>
    <strong>Which is what we call feature selection.</strong>
</p>

<p>
    Mean decrease impurity.<br/>
    Recall that a random forest consists of many decision trees, and that for each tree, the node is chosen to split the dataset based on maximum decrease in impurity, typically either Gini impurity or entropy in classification.<br/>
    Thus for a tree, it can be computed how much impurity each feature decreases in a tree.<br/>
    And then for a forest, the impurity decrease from each feature can be averaged.<br/>
    Consider this measure a metric of importance of each feature, we then can rank and select the features according to feature importance.
</p>

<p>
    Scikit-learn provides a <strong>feature_importances_</strong> variable with the model, which shows the relative importance of each feature. The scores are scaled down so that the sum of all scores is 1.
</p>

In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
cancer_data = load_breast_cancer()
df = pd.DataFrame(cancer_data['data'], columns=cancer_data['feature_names'])
df['target'] = cancer_data['target']

X = df[cancer_data.feature_names].values
y = df['target'].values

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, random_state=111)
rf = RandomForestClassifier(n_estimators=15, random_state=111)
rf.fit(X_train, y_train)

ft_imp = pd.Series(rf.feature_importances_, index=cancer_data.feature_names).sort_values(ascending=False)
print(ft_imp.head(10))
 

worst radius           0.162142
worst area             0.136277
mean concave points    0.132861
mean radius            0.089364
mean area              0.087997
worst perimeter        0.068513
mean concavity         0.065367
worst concavity        0.060150
radius error           0.030149
area error             0.023505
dtype: float64


<p>
    From the output, we can see that among all features, <strong>worst radius</strong> is most important (0.31), followed by <strong>worst area</strong> and <strong>worst concave points</strong>.
</p>

<strong>Note! In regression, we calculate the feature importance using variance instead.</strong>

<h3>New Model on Selected Features</h3>

<p>Why should we perform feature selection?</p>
<ul>
    <li>it enables us to train a model faster</li>
    <li>it reduces the complexity of a model thus makes it easier to interpret</li>
    <li>if the right subset is chosen, it can improve the accuracy of a model</li>
</ul>

<strong>Choosing the right subset often relies on domain knowledge, some art, and a bit of luck.</strong>
<p>
    In our dataset, we happen to notice that features with "worst" seem to have higher importances. As a result we are going to build a new model below with the selected features and see if it improves accuracy.
</p>
<p>We first find the features whose names include the word "worst":</p>

In [2]:
worst_cols = [col for col in df.columns if 'worst' in col]
print(worst_cols)

['worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']


<p>
    There are ten such features. Now we create another dataframe with the selected features, followed by a train test split with the same random state.
</p>

In [3]:
X_worst = df[worst_cols]
X_train, X_test, y_train, y_test = train_test_split(X_worst, y, random_state=101)

<p>At the end, we fit the model and output the accuracy.</p>

In [4]:
rf = RandomForestClassifier(random_state=101)
rf.fit(X_train, y_train)
print(rf.score(X_test, y_test))

0.965034965034965


<strong>
    Here we weren't able to improve the accuracy by much using a subset of features.<br/>
    But giving that we only used a third of the total features and removed some noise and highly correlated features, we get the advantage of building a better model using less features that will be more pronounced when the sample size is large.
</strong>

<strong>Some conclusions around feature selection</strong>
<ul>
    <li>There is no best feature selection method, at least not universally.</li>
    <li>Instead, we must discover what works best for the specific problem and leverage the domain expertise to build a good model.</li>
    <li>Scikit-learn provides an easy way to discover the feature importances.</li>
</ul>