Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for shap.maskers.Impute class #3379

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

stompsjo
Copy link
Contributor

@stompsjo stompsjo commented Nov 3, 2023

In brief, Impute now uses Tabular as a parent and uses the sklearn.impute objects for imputation. See #3378 for more discussion (h/t @CloseChoice).

Overview

Closes #3378

Description of the changes proposed in this pull request:

This addresses the TypeError generated when shap.maskers.Impute() is used in an Explainer object. It currently calls Masker as a parent instead of Tabular and therefore does not have a __call__ method.

Instead, Impute now inherits Tabular as a parent (including its __call__ method and functionality). Upon Impute.__init__, an sklean.impute object is used for imputing masked data in Tabular.__call__. I added four supported keywords to method: "mean", "median", "mode", and "knn." The first three refer to the different strategies used by sklearn.impute.SimpleImputer. The last one creates an sklearn.impute.KNNImputer object with default inputs instead. There are more Scikit-Learn imputers (e.g. sklearn.impute.IterativeImputer, which is experimental) and many more settings, but if that amount of control is needed, I have added the functionality for method to be a pre-initialized sklearn.impute object supplied by the user.

Checklist

I have added two additional tests similar to those for Independent and Partition. I recognize that there's probably a place for making all of these tabular tests more meaningful for each type of masker.

  • All pre-commit checks pass.
  • Unit tests added (if fixing a bug or adding a new feature)

Copy link
Contributor Author

@stompsjo stompsjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions/comments.

shap/maskers/_tabular.py Outdated Show resolved Hide resolved
Comment on lines 347 to 350
elif method == "knn":
impute = KNNImputer(missing_values=0)
else:
impute = SimpleImputer(missing_values=0, strategy=method)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior of sklearn.impute objects is to have a missing_value used to determine what elements to impute. AFAIK, 0 is valid given a standardized mask above, right?

@stompsjo stompsjo marked this pull request as ready for review November 6, 2023 21:40
@stompsjo stompsjo changed the title [In-Progress] initializing fix for shap.maskers.Impute class initializing fix for shap.maskers.Impute class Nov 10, 2023
@stompsjo stompsjo changed the title initializing fix for shap.maskers.Impute class fix for shap.maskers.Impute class Nov 17, 2023
@stompsjo
Copy link
Contributor Author

stompsjo commented Dec 6, 2023

Hm, I missed that (legacy) code/tests specified a "linear" impute method, which was missing from my implementation. (I'm kind of surprised unit tests are passing when method='linear', although I guess it doesn't do anything as is?) I added a simple LinearImpute class that linearly interpolates missing values using the pandas.Series.interpolate() method. I am sure there are other ways to do this, happy to hear suggestions.

@CloseChoice
Copy link
Collaborator

@connortann could this PR get your approval to run the workflows?

@CloseChoice CloseChoice self-requested a review December 8, 2023 09:33
Copy link
Collaborator

@CloseChoice CloseChoice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, I added myself as reviewer so please ping me whenever you want another review. Thanks for the PR, this looks really promising. I added a couple of comments, feel free to ignore the NIT prefixed one.

shap/maskers/_tabular.py Outdated Show resolved Hide resolved
shap/maskers/_tabular.py Outdated Show resolved Hide resolved
shap/maskers/_tabular.py Outdated Show resolved Hide resolved
@@ -54,7 +54,7 @@ def test_serialization_independent_masker_numpy():
temp_serialization_file.seek(0)

# deserialize masker
new_independent_masker = shap.maskers.Masker.load(temp_serialization_file)
new_independent_masker = shap.maskers.Independent.load(temp_serialization_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misinterpreted what this test is doing. I assumed that we want to use the load method specifically of the masker being tested, rather than the parent Masker class. Happy to revert it if needed. This and line 110 below are changed:

new_partition_masker = shap.maskers.Partition.load(temp_serialization_file)

temp_serialization_file.seek(0)

# deserialize masker
new_partition_masker = shap.maskers.Impute.load(temp_serialization_file)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some tests where you actually impute something, the housing dataset does not have missing data AFAIK. Parameterizing the tests for each method would be great aswell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a start:

def test_imputation():
# toy data
x = np.full((5, 5), np.arange(1,6)).T
methods = ["linear", "mean", "median", "most_frequent", "knn"]
# toy background data
bckg = np.full((5, 5), np.arange(1,6)).T
for method in methods:
# toy sample to impute
x = np.arange(1, 6)
masker = shap.maskers.Impute(np.full((1,5), 1), method=method)
# only mask the second value
mask = np.ones_like(bckg[0])
mask[1] = 0
# masker should impute the original value (toy data is predictable)
imputed = masker(mask.astype(bool), x)
assert np.all(x == imputed)

I tried to come up with something reproducible for all methods. Let me know if you have feedback.

@CloseChoice
Copy link
Collaborator

Please also add the reproducible example from issue #3378 as a test

Jordan Stomps and others added 2 commits December 11, 2023 09:24
Co-authored-by: Tobias Pitters <31857876+CloseChoice@users.noreply.github.com>
@stompsjo
Copy link
Contributor Author

Alright, changes made. @CloseChoice, feel free to give it another review when you're ready.

@connortann
Copy link
Collaborator

Thanks for the PR! Chipping in with my two cents.

It's not clear to me what the original implementation intended; it does indeed look half-finished. So, it would be great to get this sorted. Two thoughts:

1. Imputing strategy

The current docstring for the Impute states that Gaussian imputing is used:

class Impute(Masker): # we should inherit from Tabular once we add support for arbitrary masking
""" This imputes the values of missing features using the values of the observed features.
Unlike Independent, Gaussian imputes missing values based on correlations with observed data points.

This doesn't seem to be consistent with the new implementation; should we look to add it in?

2. Use from Explainer

I noticed the Imputer is used in the LinearExplainer here. Will this this need to be rewritten?

if isinstance(masker, pd.DataFrame) or ((isinstance(masker, np.ndarray) or issparse(masker)) and len(masker.shape) == 2):
if self.feature_perturbation == "correlation_dependent":
masker = maskers.Impute(masker)
else:
masker = maskers.Independent(masker)
elif issubclass(type(masker), tuple) and len(masker) == 2:
if self.feature_perturbation == "correlation_dependent":
masker = maskers.Impute({"mean": masker[0], "cov": masker[1]}, method="linear")

@@ -61,6 +63,16 @@ def __init__(self, data, max_samples=100, clustering=None):
self.clustering = clustering
self.max_samples = max_samples

# prepare by fitting sklearn imputer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather see this in the impute than in the generic tabular class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. Do you suggest overwriting the __call__ method for Impute? I agree that this shaping check could be moved to the LinearImpute and Impute classes (sklearn imputers expect 2-dimensional arrays, so I kept that consistent in LinearImpute). However, some form of self.impute.fit() and self.impute.transform() still need to be called during the main algorithm as written.

@@ -17,7 +19,7 @@ class Tabular(Masker):
""" A common base class for Independent and Partition.
"""

def __init__(self, data, max_samples=100, clustering=None):
def __init__(self, data, max_samples=100, clustering=None, impute=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add this keyword only for the impute class ant not for tabular. See my comment below.

@@ -117,6 +129,14 @@ def __call__(self, mask, x):

return (masked_inputs_out,), varying_rows_out

if self.impute is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should then also just be initialized in impute


mask = np.ones(X.shape[1]).astype(int)
mask[0] = 0
mask[4] = 0

# comparing masked values
assert np.array_equal(original_partition_masker(mask, X[0])[0], new_partition_masker(mask, X[0])[0])

def test_imputation():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get another test where we have values to impute (so explicitly set some values to np.nan).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can we do

@pytest.mark.parametrize("method", ["linear", "mean", "median", "most_frequent", "knn"])
def test_imputation(method):
    ...

and then get rid of the loop within the function? That helps us locate what method throws an error

if method not in methods:
raise NotImplementedError(f"Given imputation method is not supported. Please provide one of the following methods: {', '.join(methods)}")
elif method == "knn":
impute = KNNImputer(missing_values=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the missing values should be np.nan otherwise replacing 0 can lead to unintended results

Copy link
Contributor Author

@stompsjo stompsjo Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line 142 may cause issues:

self._masked_data[:] = x * mask + self.data * np.invert(mask)

If the missing value is np.nan, and that is present in x, then x*mask will still leave the np.nan in place. I think the current behavior is to have masked values set to 0 via the mask.

(That said, I recognize that as is, every time 0 is present in the data it will be imputed (or replaced by the other maskers like Partition or Independent), regardless of whether that 0 was masked or part of the inherent data.)

@stompsjo
Copy link
Contributor Author

@connortann Thanks for the comments! Responding,

  1. Yes, the docstring should be updated to better reflect that this class is imputing based on some "interpolation" method. I will make that update.
  2. I don't believe any change should be necessary(?) since I have added a "linear" method with LinearImpute.

@CloseChoice
Copy link
Collaborator

There seems to be something wrong with the implementation since it is not correctly imputing nans. I just changed the test_imputation to:

def test_imputation():
    # toy data
    x = np.full((5, 5), np.arange(1,6)).T

    methods = ["mean", "median", "most_frequent", "knn"]
    # toy background data
    bckg = np.full((5, 5), np.arange(1,6)).T
    for method in methods:
        # toy sample to impute
        x = np.arange(1, 6)
        x = x.astype(np.float32)
        x[3] = np.nan
        masker = shap.maskers.Impute(np.full((1,5), 1), method=method)
        # only mask the second value
        mask = np.ones_like(bckg[0])
        mask[1] = 0
        # masker should impute the original value (toy data is predictable)
        imputed = masker(mask.astype(bool), x)
        # assert np.all(x == imputed)

and the imputed result was almost always ((array([[ 1, 2, 3, -2147483648, 5]]),)). Also had to change the missing_values=0 parameter to missing_values=np.nan

@stompsjo
Copy link
Contributor Author

Yeah, check my comment above, it has to do with line 142 and the way that masks are applied.

@connortann
Copy link
Collaborator

@connortann Thanks for the comments! Responding,

  1. Yes, the docstring should be updated to better reflect that this class is imputing based on some "interpolation" method. I will make that update.
  2. I don't believe any change should be necessary(?) since I have added a "linear" method with LinearImpute.

On 1., my thinking was actually more along the lines that perhaps the code should implement what was originally described in the docstring: using a Gaussian model with means and a covariance matrix to impute the missing values. The Gaussian imputing strategy seems particularly useful in the context of Shapley values as it accounts for the correlations between variables- perhaps it should be the default?

Regarding the other imputing strategies- I'm a little uncomfortable with the "interpolation" method. I think the rest of the code base generally assumes that samples are I.I.D. and exchangable, so that any two rows in the dataset could be swapped. The "interpolation" strategy breaks this assumption as it depends on the row order. @CloseChoice what do you think about this one, is the "interpolation" strategy problematic?

Thank you again for helping us get this into shape! Apologies that this PR has a fair bit of discussion; that's just a consequence of the original code being quite poorly defined and documented.

@stompsjo
Copy link
Contributor Author

On 1., my thinking was actually more along the lines that perhaps the code should implement what was originally described in the docstring: using a Gaussian model with means and a covariance matrix to impute the missing values. The Gaussian imputing strategy seems particularly useful in the context of Shapley values as it accounts for the correlations between variables- perhaps it should be the default?

This is where I think it might be useful to review the implementation presented in #1489 that was merged into the shap:benchmark_utility branch. That seems to do as you describe, but I have not tested it (or know the history behind it/that branch). Maybe it's time to include that in shap:master? I like what you suggest: having that behavior as a default, with the other options (sans interpolation) as additional methods.

@CloseChoice
Copy link
Collaborator

@connortann Thanks for the comments! Responding,

  1. Yes, the docstring should be updated to better reflect that this class is imputing based on some "interpolation" method. I will make that update.
  2. I don't believe any change should be necessary(?) since I have added a "linear" method with LinearImpute.

On 1., my thinking was actually more along the lines that perhaps the code should implement what was originally described in the docstring: using a Gaussian model with means and a covariance matrix to impute the missing values. The Gaussian imputing strategy seems particularly useful in the context of Shapley values as it accounts for the correlations between variables- perhaps it should be the default?

Regarding the other imputing strategies- I'm a little uncomfortable with the "interpolation" method. I think the rest of the code base generally assumes that samples are I.I.D. and exchangable, so that any two rows in the dataset could be swapped. The "interpolation" strategy breaks this assumption as it depends on the row order. @CloseChoice what do you think about this one, is the "interpolation" strategy problematic?

Thank you again for helping us get this into shape! Apologies that this PR has a fair bit of discussion; that's just a consequence of the original code being quite poorly defined and documented.

Sorry, I somehow missed this. So yes, I think it's not ideal to break iid + exchangability assumption. I would change this in such a fashion, that we create a linear model, that takes the well defined subset of X (where now np.nan values are present), and then we fit the model to each column with missing values and replace the missing ones with the model output. This might be tedious if there are a lot of columns with missing values but IMO the most straight forward approach.

I can even pick this one up and implement that approach if need be.

@stompsjo
Copy link
Contributor Author

@CloseChoice That sounds like a plan. If you can get to this in the near-term, feel free to work on an implementation. I can otherwise try to pick this back up soonish. Going with your approach, presumably the columns the model is fitting to would have to be complete (i.e. no missing values)? Otherwise how would the model estimate a value for that row? I concede I probably didn't fully comprehend your comment 😃

@CloseChoice
Copy link
Collaborator

@CloseChoice That sounds like a plan. If you can get to this in the near-term, feel free to work on an implementation. I can otherwise try to pick this back up soonish. Going with your approach, presumably the columns the model is fitting to would have to be complete (i.e. no missing values)? Otherwise how would the model estimate a value for that row? I concede I probably didn't fully comprehend your comment 😃

Yes, my approach would be to first filter for the subset of rows that have null values in only the target column or no null values at all and then fit the model on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: shap.maskers.Impute() throws TypeError
3 participants