Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelShap redundant/incorrect API for OHE of the categorical features - grouping vs aggregation by summation #879

Open
RobertSamoilescu opened this issue Feb 23, 2023 · 0 comments

Comments

@RobertSamoilescu
Copy link
Collaborator

The are multiple ways to deal with one hot encoding for the categorical features when using the KernelShap explainer.

  • The first method is straightforward and incorporates the preprocessor into the prediction function. This approach would look like this:
def predictor(X):
   return model.predict_proba(preprocessor.transform(X))
  • The second method is to use grouping on the one-hot-encoding representation. This approach can group multiple columns and treat them as one. For example, if we have a categorical column fc with 3 categories, its OHE representation will results in adding 3 columns fc_1, fc_2, fc_3. Without grouping KernelShap will treat each fc_i as an individual player/feature and compute 3 values instead of one. With grouping, we can tell KernelShap to treat all 3 columns as one [fc_1, fc_2, fc_3], basically treating them as a single player. Note that the Shap values should match the ones from the first approach (i.e., incorporating the preprocessor).

  • alibi exposes another method to compute the Shap values, using aggregation by summation. Following the same example from the second bullet point, given the OHE encoding representation as input, KernelShap computes the Shap values for each fc_i and then aggregates them in a single value by summing them up (i.e., fc = fc_1 + fc_2 + fc_3). Unfortunately, this method is not correct since the results obtained by summation won't converge to the true Shap values as for the first two bullet points. This is probably an heuristic borrowed from TreeShap which cannot use the first two approaches. That being said, we should consider removing this approach since it is redundant and incorrect - can be achieved via bullet 2 in the correct way.

Furthemore, the cat_vars_start_idx and cat_vars_enc_dim which do aggregation for KernelShap and TreeShap are parameters in the explain method. We should consider moving those parameters into the fit or __init__ method for the following reasons:

  • if we remove them from KernelShap we would have some symmetry between TreeShap and KernelShap in terms of dealing with categorical features (for KernelShap, groups and group_names are arguments in the fit method).
  • the arguments are not actually used in the computation of the Shap values, but used in the _build_explanation method.
  • those arguments should probably be fixed once the explainer is initialized or fitted. If someone is interested to experiment with various groups, then another explainer can be initialize (if we decide to move those to __init__) or the explainer should be refitted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant