New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added ImportanceVisualizer for tree-based models #195

Closed
wants to merge 2 commits into
base: develop
from

Conversation

Projects
None yet
3 participants
@tsterbak

tsterbak commented Apr 11, 2017

This PR follows #194 and adds:

  • basic implementation of the tree-based feature importance as shown in sklearn examples

TODO

  • create an example and visualize the output
  • make sure only used for tree-based models (RandomForest, ExtraTrees, GradientBoosting, DecisionTree, ...)
  • unit tests
  • is the Visualizer in the right place?
  • all module rules satisfied
  • proper inheritance

@bbengfort bbengfort added the ready label Apr 11, 2017

@bbengfort

This comment has been minimized.

Show comment
Hide comment
@bbengfort

bbengfort Apr 12, 2017

Member

@tsterbak would you mind posting an example and the resulting visualization in the comment thread?

As for the visualizer being in the right place, it absolutely is. However, we're planning on refactoring the classifier module into a package (similar to regressor and cluster); so it would actually make it easier if you could put your visualizer in a file called tree.py that is a sibling to classifier.py, and then I'll move it into the classifier directory when we refactor (which will probably occur right after we merge your PR). That way you can also write your tests in test_tree.py.

I'll address the class hierarchy in the other issue since you brought it up there.

Looks good so far!

Member

bbengfort commented Apr 12, 2017

@tsterbak would you mind posting an example and the resulting visualization in the comment thread?

As for the visualizer being in the right place, it absolutely is. However, we're planning on refactoring the classifier module into a package (similar to regressor and cluster); so it would actually make it easier if you could put your visualizer in a file called tree.py that is a sibling to classifier.py, and then I'll move it into the classifier directory when we refactor (which will probably occur right after we merge your PR). That way you can also write your tests in test_tree.py.

I'll address the class hierarchy in the other issue since you brought it up there.

Looks good so far!

@ianozsvald

This comment has been minimized.

Show comment
Hide comment
@ianozsvald

ianozsvald Jun 22, 2017

Contributor

At the risk of adding noise - I'll note that commonly I create the following code to look for the most important features in a RandomForest:
image

df_feature_importances = pd.DataFrame(list(zip(clf.feature_importances_, X.columns)), columns=['importance', 'feature'])
df_feature_importances = df_feature_importances.sort_values('importance', ascending=False).set_index('feature')
nbr_features = 20

fig, ax = plt.subplots(figsize=(8,6)); 
df_feature_importances[:nbr_features].plot(kind="barh", title="Feature importances", legend=None, ax=ax)
plt.gca().invert_yaxis()
plt.xlabel("Importance")

# remove right/top border to make things slightly neater
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

# visual tidy-up to make left axis small values slightly easier to read
# offset left and bottom axis
ax.spines['bottom'].set_position(('axes', -0.05))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('axes', -0.05))

In #194 the sklearn visualisation was noted http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html and this adds a standard deviation to each bar. A measure of spread sounds very useful - I'm not sure if assuming a Gaussian is right (I only have a weak opinion on this). Can we guarantee a Gaussian distribution for feature importances? Might it be skewed? Maybe a boxplot per feature is actually a better thing (as it is non-parametric and really quick)? I'm wondering if occasional outliers might skew the distribution and make it non-Normal?

Looking at a problem I'm working on right now, with a RandomForest with 58 estimators, extracting the most important feature (X314 in the picture above) and plotting each tree's feature importance for this feature I get:
image

but if I take the second most important feature then the mean and standard distribution would paint the wrong summarised picture (as a set of my estimators think this feature has very low value indeed):
image

vals = pd.Series([tree.feature_importances_[288] for tree in clf.estimators_])
vals.plot(kind="hist", title="Mean importance for feature 288 'X315' == {:0.2f}".format(vals.mean()))

Edit - Olivier Grisel of sklearn agrees that using percentiles, if we want to show a measure of spread, makes more sense: https://mail.python.org/pipermail/scikit-learn/2017-June/001615.html

I also wonder if some additional tree information (e.g. n_estimators, max_depth) might be a useful addition to the plot, perhaps in the title?

Contributor

ianozsvald commented Jun 22, 2017

At the risk of adding noise - I'll note that commonly I create the following code to look for the most important features in a RandomForest:
image

df_feature_importances = pd.DataFrame(list(zip(clf.feature_importances_, X.columns)), columns=['importance', 'feature'])
df_feature_importances = df_feature_importances.sort_values('importance', ascending=False).set_index('feature')
nbr_features = 20

fig, ax = plt.subplots(figsize=(8,6)); 
df_feature_importances[:nbr_features].plot(kind="barh", title="Feature importances", legend=None, ax=ax)
plt.gca().invert_yaxis()
plt.xlabel("Importance")

# remove right/top border to make things slightly neater
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

# visual tidy-up to make left axis small values slightly easier to read
# offset left and bottom axis
ax.spines['bottom'].set_position(('axes', -0.05))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('axes', -0.05))

In #194 the sklearn visualisation was noted http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html and this adds a standard deviation to each bar. A measure of spread sounds very useful - I'm not sure if assuming a Gaussian is right (I only have a weak opinion on this). Can we guarantee a Gaussian distribution for feature importances? Might it be skewed? Maybe a boxplot per feature is actually a better thing (as it is non-parametric and really quick)? I'm wondering if occasional outliers might skew the distribution and make it non-Normal?

Looking at a problem I'm working on right now, with a RandomForest with 58 estimators, extracting the most important feature (X314 in the picture above) and plotting each tree's feature importance for this feature I get:
image

but if I take the second most important feature then the mean and standard distribution would paint the wrong summarised picture (as a set of my estimators think this feature has very low value indeed):
image

vals = pd.Series([tree.feature_importances_[288] for tree in clf.estimators_])
vals.plot(kind="hist", title="Mean importance for feature 288 'X315' == {:0.2f}".format(vals.mean()))

Edit - Olivier Grisel of sklearn agrees that using percentiles, if we want to show a measure of spread, makes more sense: https://mail.python.org/pipermail/scikit-learn/2017-June/001615.html

I also wonder if some additional tree information (e.g. n_estimators, max_depth) might be a useful addition to the plot, perhaps in the title?

@bbengfort

This comment has been minimized.

Show comment
Hide comment
@bbengfort

bbengfort Mar 1, 2018

Member

@tsterbak I'd like to merge this as an undocumented feature, then open up an issue to take this to an end, let me know if you have any thoughts.

Member

bbengfort commented Mar 1, 2018

@tsterbak I'd like to merge this as an undocumented feature, then open up an issue to take this to an end, let me know if you have any thoughts.

@bbengfort

This comment has been minimized.

Show comment
Hide comment
@bbengfort
Member

bbengfort commented Mar 2, 2018

See #58

@tsterbak

This comment has been minimized.

Show comment
Hide comment
@tsterbak

tsterbak Mar 2, 2018

@bbengfort I guess this makes sense. I would like to finish it since it is laying around for nearly a year now... I'm not sure what needs to be done. Maybe an issue would be the right fit.

tsterbak commented Mar 2, 2018

@bbengfort I guess this makes sense. I would like to finish it since it is laying around for nearly a year now... I'm not sure what needs to be done. Maybe an issue would be the right fit.

@bbengfort bbengfort referenced this pull request Mar 3, 2018

Merged

FeatureImportances Visualizer #317

2 of 5 tasks complete
@bbengfort

This comment has been minimized.

Show comment
Hide comment
@bbengfort

bbengfort Mar 3, 2018

Member

@tsterbak I've actually quickly put this together with tests and documentation and fitting in with the new structure; see #317 -- this is not a tree specific decorator but could use tree specific things like the standard deviation or percentiles that @ianozsvald mentioned and things like depth or n_estimators. I'll post this in the #194 issue that you started and perhaps you could subclass this and take it to the end.

Member

bbengfort commented Mar 3, 2018

@tsterbak I've actually quickly put this together with tests and documentation and fitting in with the new structure; see #317 -- this is not a tree specific decorator but could use tree specific things like the standard deviation or percentiles that @ianozsvald mentioned and things like depth or n_estimators. I'll post this in the #194 issue that you started and perhaps you could subclass this and take it to the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment