Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explain_prediction for XGBoost #117

Merged
merged 28 commits into from
Dec 22, 2016
Merged

explain_prediction for XGBoost #117

merged 28 commits into from
Dec 22, 2016

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented Dec 21, 2016

This follows an idea from http://blog.datadive.net/interpreting-random-forests/ (see also #114)

I did not check the issues from #114 (comment) yet, but apart from that it feels ready.

I use some functions from eli5.sklearn - I think it's fair to use them here because we are supporting explain_prediction for an sklearn wrapper of XGBoost. I just made some of them public. Maybe in the future when we support more different vectorizers, it will make sense to use singledispatch for them too.

@codecov-io
Copy link

codecov-io commented Dec 21, 2016

Current coverage is 97.26% (diff: 98.03%)

Merging #117 into master will increase coverage by 0.19%

@@             master       #117   diff @@
==========================================
  Files            34         34          
  Lines          1641       1792   +151   
  Methods           0          0          
  Messages          0          0          
  Branches        315        342    +27   
==========================================
+ Hits           1593       1743   +150   
+ Misses           25         24     -1   
- Partials         23         25     +2   
Diff Coverage File Path
••••••••• 91% eli5/_feature_names.py
••••••••• 93% eli5/sklearn/utils.py
••••••••• 99% eli5/xgboost.py
•••••••••• 100% eli5/sklearn/explain_prediction.py
•••••••••• 100% eli5/sklearn/text.py
•••••••••• 100% eli5/formatters/html.py
•••••••••• 100% eli5/formatters/utils.py

Powered by Codecov. Last update 27609f9...e9154dd

@lopuhin
Copy link
Contributor Author

lopuhin commented Dec 21, 2016

Uh, just discovered an important problem about how missing values are handled.

And also it would be nice to support XGBRegressor too.

@lopuhin
Copy link
Contributor Author

lopuhin commented Dec 21, 2016

@kmike this is ready for review, sorry for posting it too early - I made everything much more complicated initially.

The sum of all feature weights is equal to the score, as for linear models, and I think produced explanations make sense in all cases (including intervals) except for XOR tree.

For example, in case of intervals (py.test tests/test_xgboost.py::test_explain_prediction_clf_interval -s), where it is good to have x0 close to 1, and x1 is irrelevant, the following explanations are produced:

[0, 1]
Explained as: decision paths
y=True (probability=0.002, score=-6.481) top features
----------------
  -0.569  <BIAS>
  -0.591  x1    
  -5.321  x0    

[1, 1]
Explained as: decision paths
y=True (probability=0.996, score=5.648) top features
----------------
  +6.567  x0    
  -0.351  x1    
  -0.569  <BIAS>

[2, 1]
Explained as: decision paths
y=True (probability=0.001, score=-6.537) top features
----------------
  -0.490  x1    
  -0.569  <BIAS>
  -5.478  x0    

So at x0=0 and x0=2 x0 is a negative feature (the example is classified as 0 because of it), and it is positive at x0=1.

@lopuhin lopuhin changed the title WIP: explain_prediction for XGBoost explain_prediction for XGBoost Dec 21, 2016
[fs.get(f, 0.) for f in b.feature_names], dtype=np.float32)
return all_features / all_features.sum()

XGBRegressor.feature_importances_ = property(xgb_feature_importances)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, what do you think about making it an utility function which gets feature importances from XGBRegressor, instead of patching XGBRegressor directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's not great to patch it here. Feature importances are also used to determine the number of features, but it's better to pass them explicitly, I'll fix that, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e92334d, now xgb_feature_importances function is used directly, and num_features are passed explicitly.

res = Explanation(
estimator=repr(xgb),
method='decision paths',
targets=[],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be great to have a description as well, because the method is not obvious, and there are caveats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for spotting it! We can also mention the problem with XOR tree and that weights sum to score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike I added description and caveats in a9c692e, does it make sense? Is there anything else we should add?

@kmike
Copy link
Contributor

kmike commented Dec 22, 2016

Looks great, thanks @lopuhin!

@kmike
Copy link
Contributor

kmike commented Dec 22, 2016

Would you mind updating docs and README as well?

@lopuhin
Copy link
Contributor Author

lopuhin commented Dec 22, 2016

Thanks! Sure, I'll update them.

@kmike
Copy link
Contributor

kmike commented Dec 22, 2016

If you have a bit more time there are two more "pony" requests :)
First is that it'd be nice to support explain_prediction for sklearn ensebles as well, while you're on it. And second, it'd be great to have a docs chapter which explains the method you're using in more details (similar to how the blog post explains it), with a link to the blog post.

@lopuhin
Copy link
Contributor Author

lopuhin commented Dec 22, 2016

Can you get different variables on top by changing the dataset (changing random seed used to generate the dataset)?

Yes, fixing seed gives the same feature weights, but maybe it just means that xgboost is deterministic in this case :)

Maybe we should just say that in xor problems it could work, but it is not reliable if most trees happen to choose the same feature at the top?

Right, I like that! I added this and "single" clarifications in 7fe062e, will make another PR soon.

x[len(vec_prefix):] if x.startswith(vec_prefix) else None)

def feature_fn(x):
if (not isinstance(x, FormattedFeatureName)
Copy link
Contributor

@kmike kmike Dec 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recall why is FormattedFeatureName needed? I rewrote the if statement into two statements in #110, and coverage shows feature_fn never see FormattedFeatureName in our tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there were no test for feature union text highlighting where missing features appeared, I added missing test in #124

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional test showed my refactoring was incorrect, thanks! ;)

@kmike kmike added this to the 0.3 milestone Jan 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants