explain_prediction for XGBoost #117

lopuhin · 2016-12-21T09:31:55Z

This follows an idea from http://blog.datadive.net/interpreting-random-forests/ (see also #114)

I did not check the issues from #114 (comment) yet, but apart from that it feels ready.

I use some functions from eli5.sklearn - I think it's fair to use them here because we are supporting explain_prediction for an sklearn wrapper of XGBoost. I just made some of them public. Maybe in the future when we support more different vectorizers, it will make sense to use singledispatch for them too.

Following an idea from http://blog.datadive.net/interpreting-random-forests/

It turned out that they go as ABCABC, not as AABBCC

Also show them separately from BIAS (in a hacky way for now)

It can not originate in the FeatureUnion

So they can be used by other code that wishes to support libraries that provide sklearn wrappers.

And remove FIXMEs in text: xgboost really does poorly with non-binary vectorizer and a small number of examples.

codecov-io · 2016-12-21T09:37:32Z

Current coverage is 97.26% (diff: 98.03%)

Merging #117 into master will increase coverage by 0.19%

@@             master       #117   diff @@
==========================================
  Files            34         34          
  Lines          1641       1792   +151   
  Methods           0          0          
  Messages          0          0          
  Branches        315        342    +27   
==========================================
+ Hits           1593       1743   +150   
+ Misses           25         24     -1   
- Partials         23         25     +2

Diff Coverage	File Path
••••••••• 91%	eli5/_feature_names.py
••••••••• 93%	eli5/sklearn/utils.py
••••••••• 99%	eli5/xgboost.py
•••••••••• 100%	eli5/sklearn/explain_prediction.py
•••••••••• 100%	eli5/sklearn/text.py
•••••••••• 100%	eli5/formatters/html.py
•••••••••• 100%	eli5/formatters/utils.py

Powered by Codecov. Last update 27609f9...e9154dd

lopuhin · 2016-12-21T09:57:50Z

Uh, just discovered an important problem about how missing values are handled.

And also it would be nice to support XGBRegressor too.

The only way to know if the value is missing is to check X. But this needs correctly set XGBClassifier.missing.

This way weights will always sum to score

lopuhin · 2016-12-21T14:23:23Z

@kmike this is ready for review, sorry for posting it too early - I made everything much more complicated initially.

The sum of all feature weights is equal to the score, as for linear models, and I think produced explanations make sense in all cases (including intervals) except for XOR tree.

For example, in case of intervals (py.test tests/test_xgboost.py::test_explain_prediction_clf_interval -s), where it is good to have x0 close to 1, and x1 is irrelevant, the following explanations are produced:

[0, 1]
Explained as: decision paths
y=True (probability=0.002, score=-6.481) top features
----------------
  -0.569  <BIAS>
  -0.591  x1    
  -5.321  x0    

[1, 1]
Explained as: decision paths
y=True (probability=0.996, score=5.648) top features
----------------
  +6.567  x0    
  -0.351  x1    
  -0.569  <BIAS>

[2, 1]
Explained as: decision paths
y=True (probability=0.001, score=-6.537) top features
----------------
  -0.490  x1    
  -0.569  <BIAS>
  -5.478  x0

So at x0=0 and x0=2 x0 is a negative feature (the example is classified as 0 because of it), and it is positive at x0=1.

kmike · 2016-12-21T14:42:44Z

eli5/xgboost.py

+            [fs.get(f, 0.) for f in b.feature_names], dtype=np.float32)
+        return all_features / all_features.sum()
+
+    XGBRegressor.feature_importances_ = property(xgb_feature_importances)


hm, what do you think about making it an utility function which gets feature importances from XGBRegressor, instead of patching XGBRegressor directly?

Right, it's not great to patch it here. Feature importances are also used to determine the number of features, but it's better to pass them explicitly, I'll fix that, thanks!

Fixed in e92334d, now xgb_feature_importances function is used directly, and num_features are passed explicitly.

kmike · 2016-12-21T14:43:54Z

eli5/xgboost.py

+    res = Explanation(
+        estimator=repr(xgb),
+        method='decision paths',
+        targets=[],


it'd be great to have a description as well, because the method is not obvious, and there are caveats

Yes, thanks for spotting it! We can also mention the problem with XOR tree and that weights sum to score.

@kmike I added description and caveats in a9c692e, does it make sense? Is there anything else we should add?

kmike · 2016-12-22T12:12:59Z

Looks great, thanks @lopuhin!

kmike · 2016-12-22T12:29:48Z

Would you mind updating docs and README as well?

lopuhin · 2016-12-22T12:36:39Z

Thanks! Sure, I'll update them.

kmike · 2016-12-22T12:52:12Z

If you have a bit more time there are two more "pony" requests :)
First is that it'd be nice to support explain_prediction for sklearn ensebles as well, while you're on it. And second, it'd be great to have a docs chapter which explains the method you're using in more details (similar to how the blog post explains it), with a link to the blog post.

lopuhin · 2016-12-22T13:30:10Z

Can you get different variables on top by changing the dataset (changing random seed used to generate the dataset)?

Yes, fixing seed gives the same feature weights, but maybe it just means that xgboost is deterministic in this case :)

Maybe we should just say that in xor problems it could work, but it is not reliable if most trees happen to choose the same feature at the top?

Right, I like that! I added this and "single" clarifications in 7fe062e, will make another PR soon.

kmike · 2016-12-27T23:54:28Z

eli5/sklearn/text.py

-            x[len(vec_prefix):] if x.startswith(vec_prefix) else None)
+
+        def feature_fn(x):
+            if (not isinstance(x, FormattedFeatureName)


Do you recall why is FormattedFeatureName needed? I rewrote the if statement into two statements in #110, and coverage shows feature_fn never see FormattedFeatureName in our tests.

Right, there were no test for feature union text highlighting where missing features appeared, I added missing test in #124

The additional test showed my refactoring was incorrect, thanks! ;)

lopuhin added 15 commits December 19, 2016 14:47

Parsing xgboost dump

91c230f

Explaining xgboost prediction with tree paths

3561273

Following an idea from http://blog.datadive.net/interpreting-random-forests/

xgboost explain_prediction: basic multiclass support

d24af2f

WIP: fix some xgboost explain_prediction issues

eb668bf

Fix multiclass xgboost class split

2e929df

It turned out that they go as ABCABC, not as AABBCC

Correctly handle missing features

9cc55d3

Also show them separately from BIAS (in a hacky way for now)

Skip FormattedFeatureName in FeatureUnion support

95a36a1

It can not originate in the FeatureUnion

See #95: a single span for shared styles

c2b1f6a

Handle missing values correctly: leave X sparse

f6b6a5a

Refactoring: make some sklearn utils public

bf22202

So they can be used by other code that wishes to support libraries that provide sklearn wrappers.

Move adding feature to FeatureNames, pass missing_idx

60e9b40

Proper xgboost tests: a bit fragile

cb261e2

Extract main logic into functions

5ab2022

And remove FIXMEs in text: xgboost really does poorly with non-binary vectorizer and a small number of examples.

Merge branch 'master' into xgboost-prediction

98d52e5

Recover comment lost in merge

50a822e

lopuhin added 10 commits December 21, 2016 14:11

Handle missing values correctly

20ceac9

The only way to know if the value is missing is to check X. But this needs correctly set XGBClassifier.missing.

Explaining xor classifier decisions (WIP)

138aaa0

xgboost interval test: explanations make zero sense

ad511f5

The sign should always be the same

6be9b17

This way weights will always sum to score

Check that scores add up to feature weights

0f65567

A bit more interesting test values

76e4916

Support XGBRegressor

4fe8489

Improve xgboost coverage a little bit

e1b8ad7

Fix mypy type error

ba624e9

With correct explain, binary=True isn't required

dddce51

lopuhin changed the title ~~WIP: explain_prediction for XGBoost~~ explain_prediction for XGBoost Dec 21, 2016

kmike reviewed Dec 21, 2016

View reviewed changes

lopuhin added 3 commits December 21, 2016 19:26

Do not patch XGBRegressor, pass num_features instead

e92334d

Add desicion path description and caveats

a9c692e

Show only weights in this tests

e9154dd

lopuhin mentioned this pull request Dec 22, 2016

explain_prediction support for decision trees and ensemble #11

Closed

kmike merged commit f250cd6 into master Dec 22, 2016

lopuhin mentioned this pull request Dec 22, 2016

Update docs and README for XGBoost explain_prediction #118

Merged

kmike mentioned this pull request Dec 23, 2016

Support explain_prediction for xgboost #114

Closed

kmike reviewed Dec 27, 2016

View reviewed changes

lopuhin mentioned this pull request Dec 28, 2016

FeatureUnion tests for xgboost #124

Merged

kmike added this to the 0.3 milestone Jan 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explain_prediction for XGBoost #117

explain_prediction for XGBoost #117

lopuhin commented Dec 21, 2016

codecov-io commented Dec 21, 2016 •

edited

Loading

lopuhin commented Dec 21, 2016

lopuhin commented Dec 21, 2016

kmike Dec 21, 2016

lopuhin Dec 21, 2016

lopuhin Dec 22, 2016

kmike Dec 21, 2016

lopuhin Dec 21, 2016

lopuhin Dec 22, 2016

kmike commented Dec 22, 2016

kmike commented Dec 22, 2016

lopuhin commented Dec 22, 2016

kmike commented Dec 22, 2016

lopuhin commented Dec 22, 2016

kmike Dec 27, 2016 •

edited

Loading

lopuhin Dec 28, 2016

kmike Dec 28, 2016

explain_prediction for XGBoost #117

explain_prediction for XGBoost #117

Conversation

lopuhin commented Dec 21, 2016

codecov-io commented Dec 21, 2016 • edited Loading

Current coverage is 97.26% (diff: 98.03%)

lopuhin commented Dec 21, 2016

lopuhin commented Dec 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike commented Dec 22, 2016

kmike commented Dec 22, 2016

lopuhin commented Dec 22, 2016

kmike commented Dec 22, 2016

lopuhin commented Dec 22, 2016

kmike Dec 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 21, 2016 •

edited

Loading

kmike Dec 27, 2016 •

edited

Loading