-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explain_prediction for XGBoost #117
Conversation
It turned out that they go as ABCABC, not as AABBCC
Also show them separately from BIAS (in a hacky way for now)
It can not originate in the FeatureUnion
So they can be used by other code that wishes to support libraries that provide sklearn wrappers.
And remove FIXMEs in text: xgboost really does poorly with non-binary vectorizer and a small number of examples.
Current coverage is 97.26% (diff: 98.03%)@@ master #117 diff @@
==========================================
Files 34 34
Lines 1641 1792 +151
Methods 0 0
Messages 0 0
Branches 315 342 +27
==========================================
+ Hits 1593 1743 +150
+ Misses 25 24 -1
- Partials 23 25 +2
|
Uh, just discovered an important problem about how missing values are handled. And also it would be nice to support XGBRegressor too. |
The only way to know if the value is missing is to check X. But this needs correctly set XGBClassifier.missing.
This way weights will always sum to score
@kmike this is ready for review, sorry for posting it too early - I made everything much more complicated initially. The sum of all feature weights is equal to the score, as for linear models, and I think produced explanations make sense in all cases (including intervals) except for XOR tree. For example, in case of intervals (
So at x0=0 and x0=2 x0 is a negative feature (the example is classified as 0 because of it), and it is positive at x0=1. |
[fs.get(f, 0.) for f in b.feature_names], dtype=np.float32) | ||
return all_features / all_features.sum() | ||
|
||
XGBRegressor.feature_importances_ = property(xgb_feature_importances) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, what do you think about making it an utility function which gets feature importances from XGBRegressor, instead of patching XGBRegressor directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's not great to patch it here. Feature importances are also used to determine the number of features, but it's better to pass them explicitly, I'll fix that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e92334d, now xgb_feature_importances
function is used directly, and num_features
are passed explicitly.
res = Explanation( | ||
estimator=repr(xgb), | ||
method='decision paths', | ||
targets=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be great to have a description as well, because the method is not obvious, and there are caveats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for spotting it! We can also mention the problem with XOR tree and that weights sum to score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @lopuhin! |
Would you mind updating docs and README as well? |
Thanks! Sure, I'll update them. |
If you have a bit more time there are two more "pony" requests :) |
Yes, fixing seed gives the same feature weights, but maybe it just means that xgboost is deterministic in this case :)
Right, I like that! I added this and "single" clarifications in 7fe062e, will make another PR soon. |
x[len(vec_prefix):] if x.startswith(vec_prefix) else None) | ||
|
||
def feature_fn(x): | ||
if (not isinstance(x, FormattedFeatureName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you recall why is FormattedFeatureName needed? I rewrote the if statement into two statements in #110, and coverage shows feature_fn never see FormattedFeatureName in our tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there were no test for feature union text highlighting where missing features appeared, I added missing test in #124
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional test showed my refactoring was incorrect, thanks! ;)
This follows an idea from http://blog.datadive.net/interpreting-random-forests/ (see also #114)
I did not check the issues from #114 (comment) yet, but apart from that it feels ready.
I use some functions from
eli5.sklearn
- I think it's fair to use them here because we are supporting explain_prediction for an sklearn wrapper of XGBoost. I just made some of them public. Maybe in the future when we support more different vectorizers, it will make sense to use singledispatch for them too.