Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shapley Values #52

Closed
kmedved opened this issue Sep 9, 2021 · 2 comments
Closed

Shapley Values #52

kmedved opened this issue Sep 9, 2021 · 2 comments

Comments

@kmedved
Copy link

kmedved commented Sep 9, 2021

As I'm sure you're aware, Shapley values, via the Shap package, are a common way of understanding GBDT model outputs, which is often key for stakeholder buyin.

I'm having some difficulty getting xgboost-distribution to work well with the Shap package. I've put together an example here on Colab. As you can see, while the Shap package will accept the xgboost-distribution model (if you extract the underlying booster), when I compare the predictions generated by xgboost-distribution to the Shapley plots for the individual predictions, they don't align. You can see this in cell 7 of the the Colab notebook above.

I've also put in a comparison with ngboost's functionality. As you can see in the notebook, the ngboost outputs match the Shap plots for the individual predictions.

I don't have a good understanding as to what's driving this. My best guess is that Shap is getting tripped up somewhere with the measures of variance which xgboost-distribution is outputting. Note also that I am using model.get_booster() here, since Shap will not accept a native xgboost-distribution object (it will accept an ngboost object). Also, if helpful, the Ngboost added support for Shap in the various pull requests mentioned here: stanfordmlgroup/ngboost#5

Thanks - any assistance here would be helpful.

@CDonnerer
Copy link
Owner

Hi @kmedved,

Thanks for raising!

The reason the outputs don't match up is that the XGBDistribution model holds internal base values (self._starting_params), which the booster does not keep. These need to be passed to SHAP to get correct values.

Example:

...

model = XGBDistribution()
model.fit(X_train, y_train)

booster = model.get_booster()
explainer = shap.TreeExplainer(booster, X_train)
shap_values = explainer.shap_values(X_train)

# add XGBDistribution base values to get correct SHAP values
base_value = model._starting_params[0] + explainer.expected_value[0]  

shap.initjs()
shap.force_plot(
    base_value=base_value,
    shap_values=shap_values[0][0, :], 
    features=X_train.iloc[0, :]
)

There's a small subtlety to get correct results when using early stopping:

...

model = XGBDistribution()
model.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        early_stopping_rounds=10,
)
booster = model.get_booster()
booster = booster[:model.best_ntree_limit]  #slice booster to best ntree limit

...

I hope this helps! Ideally, the above logic should go into the SHAP package to work natively with XGBDistribution, I'll have a look into this.

@kmedved
Copy link
Author

kmedved commented Sep 13, 2021

Thank you - this is very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants