Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
added ImportanceVisualizer for tree-based models #195
This PR follows #194 and adds:
@tsterbak would you mind posting an example and the resulting visualization in the comment thread?
As for the visualizer being in the right place, it absolutely is. However, we're planning on refactoring the classifier module into a package (similar to regressor and cluster); so it would actually make it easier if you could put your visualizer in a file called tree.py that is a sibling to classifier.py, and then I'll move it into the classifier directory when we refactor (which will probably occur right after we merge your PR). That way you can also write your tests in test_tree.py.
I'll address the class hierarchy in the other issue since you brought it up there.
Looks good so far!
In #194 the sklearn visualisation was noted http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html and this adds a standard deviation to each bar. A measure of spread sounds very useful - I'm not sure if assuming a Gaussian is right (I only have a weak opinion on this). Can we guarantee a Gaussian distribution for feature importances? Might it be skewed? Maybe a boxplot per feature is actually a better thing (as it is non-parametric and really quick)? I'm wondering if occasional outliers might skew the distribution and make it non-Normal?
Looking at a problem I'm working on right now, with a RandomForest with 58 estimators, extracting the most important feature (X314 in the picture above) and plotting each tree's feature importance for this feature I get:
but if I take the second most important feature then the mean and standard distribution would paint the wrong summarised picture (as a set of my estimators think this feature has very low value indeed):
Edit - Olivier Grisel of sklearn agrees that using percentiles, if we want to show a measure of spread, makes more sense: https://mail.python.org/pipermail/scikit-learn/2017-June/001615.html
I also wonder if some additional tree information (e.g. n_estimators, max_depth) might be a useful addition to the plot, perhaps in the title?
@tsterbak I've actually quickly put this together with tests and documentation and fitting in with the new structure; see #317 -- this is not a tree specific decorator but could use tree specific things like the standard deviation or percentiles that @ianozsvald mentioned and things like depth or n_estimators. I'll post this in the #194 issue that you started and perhaps you could subclass this and take it to the end.