fix for shap.maskers.Impute class #3379

stompsjo · 2023-11-03T19:44:05Z

In brief, Impute now uses Tabular as a parent and uses the sklearn.impute objects for imputation. See #3378 for more discussion (h/t @CloseChoice).

Overview

Closes #3378

Description of the changes proposed in this pull request:

This addresses the TypeError generated when shap.maskers.Impute() is used in an Explainer object. It currently calls Masker as a parent instead of Tabular and therefore does not have a __call__ method.

Instead, Impute now inherits Tabular as a parent (including its __call__ method and functionality). Upon Impute.__init__, an sklean.impute object is used for imputing masked data in Tabular.__call__. I added four supported keywords to method: "mean", "median", "mode", and "knn." The first three refer to the different strategies used by sklearn.impute.SimpleImputer. The last one creates an sklearn.impute.KNNImputer object with default inputs instead. There are more Scikit-Learn imputers (e.g. sklearn.impute.IterativeImputer, which is experimental) and many more settings, but if that amount of control is needed, I have added the functionality for method to be a pre-initialized sklearn.impute object supplied by the user.

Checklist

I have added two additional tests similar to those for Independent and Partition. I recognize that there's probably a place for making all of these tabular tests more meaningful for each type of masker.

All pre-commit checks pass.
Unit tests added (if fixing a bug or adding a new feature)

h/t: jasonmhite

stompsjo

A couple of questions/comments.

shap/maskers/_tabular.py

stompsjo · 2023-11-06T21:40:11Z

shap/maskers/_tabular.py

+            elif method == "knn":
+                impute = KNNImputer(missing_values=0)
+            else:
+                impute = SimpleImputer(missing_values=0, strategy=method)


The default behavior of sklearn.impute objects is to have a missing_value used to determine what elements to impute. AFAIK, 0 is valid given a standardized mask above, right?

for more information, see https://pre-commit.ci

stompsjo · 2023-12-06T02:52:58Z

Hm, I missed that (legacy) code/tests specified a "linear" impute method, which was missing from my implementation. (I'm kind of surprised unit tests are passing when method='linear', although I guess it doesn't do anything as is?) I added a simple LinearImpute class that linearly interpolates missing values using the pandas.Series.interpolate() method. I am sure there are other ways to do this, happy to hear suggestions.

…ries

CloseChoice · 2023-12-08T09:33:00Z

@connortann could this PR get your approval to run the workflows?

CloseChoice

Sorry for the late review, I added myself as reviewer so please ping me whenever you want another review. Thanks for the PR, this looks really promising. I added a couple of comments, feel free to ignore the NIT prefixed one.

shap/maskers/_tabular.py

CloseChoice · 2023-12-08T09:47:54Z

tests/maskers/test_tabular.py

@@ -54,7 +54,7 @@ def test_serialization_independent_masker_numpy():
        temp_serialization_file.seek(0)

        # deserialize masker
-        new_independent_masker = shap.maskers.Masker.load(temp_serialization_file)
+        new_independent_masker = shap.maskers.Independent.load(temp_serialization_file)


why this change?

Maybe I misinterpreted what this test is doing. I assumed that we want to use the load method specifically of the masker being tested, rather than the parent Masker class. Happy to revert it if needed. This and line 110 below are changed:

shap/tests/maskers/test_tabular.py

Line 110 in 43d47f6

new_partition_masker = shap.maskers.Partition.load(temp_serialization_file)

CloseChoice · 2023-12-08T09:51:03Z

tests/maskers/test_tabular.py

+        temp_serialization_file.seek(0)
+
+        # deserialize masker
+        new_partition_masker = shap.maskers.Impute.load(temp_serialization_file)



Could you please add some tests where you actually impute something, the housing dataset does not have missing data AFAIK. Parameterizing the tests for each method would be great aswell.

Here's a start:

shap/tests/maskers/test_tabular.py

Lines 175 to 191 in 75c88e8

def test_imputation():

# toy data

x = np.full((5, 5), np.arange(1,6)).T

methods = ["linear", "mean", "median", "most_frequent", "knn"]

# toy background data

bckg = np.full((5, 5), np.arange(1,6)).T

for method in methods:

# toy sample to impute

x = np.arange(1, 6)

masker = shap.maskers.Impute(np.full((1,5), 1), method=method)

# only mask the second value

mask = np.ones_like(bckg[0])

mask[1] = 0

# masker should impute the original value (toy data is predictable)

imputed = masker(mask.astype(bool), x)

assert np.all(x == imputed)

I tried to come up with something reproducible for all methods. Let me know if you have feedback.

CloseChoice · 2023-12-08T09:53:55Z

Please also add the reproducible example from issue #3378 as a test

Co-authored-by: Tobias Pitters <31857876+CloseChoice@users.noreply.github.com>

stompsjo · 2023-12-11T18:02:32Z

Alright, changes made. @CloseChoice, feel free to give it another review when you're ready.

connortann · 2023-12-11T18:58:21Z

Thanks for the PR! Chipping in with my two cents.

It's not clear to me what the original implementation intended; it does indeed look half-finished. So, it would be great to get this sorted. Two thoughts:

1. Imputing strategy

The current docstring for the Impute states that Gaussian imputing is used:

shap/shap/maskers/_tabular.py

Lines 302 to 305 in 63223e1

    
           class Impute(Masker): # we should inherit from Tabular once we add support for arbitrary masking 
        
               """ This imputes the values of missing features using the values of the observed features. 
        
               Unlike Independent, Gaussian imputes missing values based on correlations with observed data points.

This doesn't seem to be consistent with the new implementation; should we look to add it in?

2. Use from Explainer

I noticed the Imputer is used in the LinearExplainer here. Will this this need to be rewritten?

shap/shap/explainers/_linear.py

Lines 76 to 83 in 63223e1

    
           if isinstance(masker, pd.DataFrame) or ((isinstance(masker, np.ndarray) or issparse(masker)) and len(masker.shape) == 2): 
        
               if self.feature_perturbation == "correlation_dependent": 
        
                   masker = maskers.Impute(masker) 
        
               else: 
        
                   masker = maskers.Independent(masker) 
        
           elif issubclass(type(masker), tuple) and len(masker) == 2: 
        
               if self.feature_perturbation == "correlation_dependent": 
        
                   masker = maskers.Impute({"mean": masker[0], "cov": masker[1]}, method="linear")

CloseChoice · 2023-12-11T20:08:01Z

shap/maskers/_tabular.py

@@ -61,6 +63,16 @@ def __init__(self, data, max_samples=100, clustering=None):
        self.clustering = clustering
        self.max_samples = max_samples

+        # prepare by fitting sklearn imputer


I would rather see this in the impute than in the generic tabular class

I'm not sure I understand. Do you suggest overwriting the __call__ method for Impute? I agree that this shaping check could be moved to the LinearImpute and Impute classes (sklearn imputers expect 2-dimensional arrays, so I kept that consistent in LinearImpute). However, some form of self.impute.fit() and self.impute.transform() still need to be called during the main algorithm as written.

CloseChoice · 2023-12-11T20:09:03Z

shap/maskers/_tabular.py

@@ -17,7 +19,7 @@ class Tabular(Masker):
    """ A common base class for Independent and Partition.
    """

-    def __init__(self, data, max_samples=100, clustering=None):
+    def __init__(self, data, max_samples=100, clustering=None, impute=None):


I would add this keyword only for the impute class ant not for tabular. See my comment below.

CloseChoice · 2023-12-11T20:09:32Z

shap/maskers/_tabular.py

@@ -117,6 +129,14 @@ def __call__(self, mask, x):

            return (masked_inputs_out,), varying_rows_out

+        if self.impute is not None:


This should then also just be initialized in impute

CloseChoice · 2023-12-11T20:34:11Z

tests/maskers/test_tabular.py


    mask = np.ones(X.shape[1]).astype(int)
    mask[0] = 0
    mask[4] = 0

    # comparing masked values
    assert np.array_equal(original_partition_masker(mask, X[0])[0], new_partition_masker(mask, X[0])[0])
+
+def test_imputation():


I think we should get another test where we have values to impute (so explicitly set some values to np.nan).

also can we do

@pytest.mark.parametrize("method", ["linear", "mean", "median", "most_frequent", "knn"]) def test_imputation(method): ...

and then get rid of the loop within the function? That helps us locate what method throws an error

CloseChoice · 2023-12-11T20:37:23Z

shap/maskers/_tabular.py

+            if method not in methods:
+                raise NotImplementedError(f"Given imputation method is not supported. Please provide one of the following methods: {', '.join(methods)}")
+            elif method == "knn":
+                impute = KNNImputer(missing_values=0)


IMO the missing values should be np.nan otherwise replacing 0 can lead to unintended results

I think line 142 may cause issues:

shap/shap/maskers/_tabular.py

Line 141 in 75c88e8

self._masked_data[:] = x * mask + self.data * np.invert(mask)

If the missing value is np.nan, and that is present in x, then x*mask will still leave the np.nan in place. I think the current behavior is to have masked values set to 0 via the mask.

(That said, I recognize that as is, every time 0 is present in the data it will be imputed (or replaced by the other maskers like Partition or Independent), regardless of whether that 0 was masked or part of the inherent data.)

stompsjo · 2023-12-11T20:38:02Z

@connortann Thanks for the comments! Responding,

Yes, the docstring should be updated to better reflect that this class is imputing based on some "interpolation" method. I will make that update.
I don't believe any change should be necessary(?) since I have added a "linear" method with LinearImpute.

CloseChoice · 2023-12-11T21:19:34Z

There seems to be something wrong with the implementation since it is not correctly imputing nans. I just changed the test_imputation to:

def test_imputation():
    # toy data
    x = np.full((5, 5), np.arange(1,6)).T

    methods = ["mean", "median", "most_frequent", "knn"]
    # toy background data
    bckg = np.full((5, 5), np.arange(1,6)).T
    for method in methods:
        # toy sample to impute
        x = np.arange(1, 6)
        x = x.astype(np.float32)
        x[3] = np.nan
        masker = shap.maskers.Impute(np.full((1,5), 1), method=method)
        # only mask the second value
        mask = np.ones_like(bckg[0])
        mask[1] = 0
        # masker should impute the original value (toy data is predictable)
        imputed = masker(mask.astype(bool), x)
        # assert np.all(x == imputed)

and the imputed result was almost always ((array([[ 1, 2, 3, -2147483648, 5]]),)). Also had to change the missing_values=0 parameter to missing_values=np.nan

stompsjo · 2023-12-11T21:21:37Z

Yeah, check my comment above, it has to do with line 142 and the way that masks are applied.

connortann · 2023-12-11T23:32:08Z

@connortann Thanks for the comments! Responding,

Yes, the docstring should be updated to better reflect that this class is imputing based on some "interpolation" method. I will make that update.

I don't believe any change should be necessary(?) since I have added a "linear" method with LinearImpute.

On 1., my thinking was actually more along the lines that perhaps the code should implement what was originally described in the docstring: using a Gaussian model with means and a covariance matrix to impute the missing values. The Gaussian imputing strategy seems particularly useful in the context of Shapley values as it accounts for the correlations between variables- perhaps it should be the default?

Regarding the other imputing strategies- I'm a little uncomfortable with the "interpolation" method. I think the rest of the code base generally assumes that samples are I.I.D. and exchangable, so that any two rows in the dataset could be swapped. The "interpolation" strategy breaks this assumption as it depends on the row order. @CloseChoice what do you think about this one, is the "interpolation" strategy problematic?

Thank you again for helping us get this into shape! Apologies that this PR has a fair bit of discussion; that's just a consequence of the original code being quite poorly defined and documented.

stompsjo · 2023-12-12T00:15:45Z

On 1., my thinking was actually more along the lines that perhaps the code should implement what was originally described in the docstring: using a Gaussian model with means and a covariance matrix to impute the missing values. The Gaussian imputing strategy seems particularly useful in the context of Shapley values as it accounts for the correlations between variables- perhaps it should be the default?

This is where I think it might be useful to review the implementation presented in #1489 that was merged into the shap:benchmark_utility branch. That seems to do as you describe, but I have not tested it (or know the history behind it/that branch). Maybe it's time to include that in shap:master? I like what you suggest: having that behavior as a default, with the other options (sans interpolation) as additional methods.

CloseChoice · 2024-01-20T16:16:26Z

@connortann Thanks for the comments! Responding,

Yes, the docstring should be updated to better reflect that this class is imputing based on some "interpolation" method. I will make that update.

I don't believe any change should be necessary(?) since I have added a "linear" method with LinearImpute.

On 1., my thinking was actually more along the lines that perhaps the code should implement what was originally described in the docstring: using a Gaussian model with means and a covariance matrix to impute the missing values. The Gaussian imputing strategy seems particularly useful in the context of Shapley values as it accounts for the correlations between variables- perhaps it should be the default?

Regarding the other imputing strategies- I'm a little uncomfortable with the "interpolation" method. I think the rest of the code base generally assumes that samples are I.I.D. and exchangable, so that any two rows in the dataset could be swapped. The "interpolation" strategy breaks this assumption as it depends on the row order. @CloseChoice what do you think about this one, is the "interpolation" strategy problematic?

Thank you again for helping us get this into shape! Apologies that this PR has a fair bit of discussion; that's just a consequence of the original code being quite poorly defined and documented.

Sorry, I somehow missed this. So yes, I think it's not ideal to break iid + exchangability assumption. I would change this in such a fashion, that we create a linear model, that takes the well defined subset of X (where now np.nan values are present), and then we fit the model to each column with missing values and replace the missing ones with the model output. This might be tedious if there are a lot of columns with missing values but IMO the most straight forward approach.

I can even pick this one up and implement that approach if need be.

stompsjo · 2024-01-22T20:34:52Z

@CloseChoice That sounds like a plan. If you can get to this in the near-term, feel free to work on an implementation. I can otherwise try to pick this back up soonish. Going with your approach, presumably the columns the model is fitting to would have to be complete (i.e. no missing values)? Otherwise how would the model estimate a value for that row? I concede I probably didn't fully comprehend your comment 😃

CloseChoice · 2024-01-23T06:36:50Z

@CloseChoice That sounds like a plan. If you can get to this in the near-term, feel free to work on an implementation. I can otherwise try to pick this back up soonish. Going with your approach, presumably the columns the model is fitting to would have to be complete (i.e. no missing values)? Otherwise how would the model estimate a value for that row? I concede I probably didn't fully comprehend your comment 😃

Yes, my approach would be to first filter for the subset of rows that have null values in only the target column or no null values at all and then fit the model on it.

Jordan Stomps added 2 commits November 3, 2023 13:39

initializing fix for shap.maskers.Impute class

53835a6

h/t: jasonmhite

adding sklearn.impute functionality for shap.maskers.Impute

7db61b6

stompsjo commented Nov 6, 2023

View reviewed changes

stompsjo marked this pull request as ready for review November 6, 2023 21:40

stompsjo changed the title ~~[In-Progress] initializing fix for shap.maskers.Impute class~~ initializing fix for shap.maskers.Impute class Nov 10, 2023

stompsjo changed the title ~~initializing fix for shap.maskers.Impute class~~ fix for shap.maskers.Impute class Nov 17, 2023

Jordan Stomps and others added 2 commits December 5, 2023 21:50

adding a LinearImpute class for legacy code

dd13787

[pre-commit.ci] auto fixes from pre-commit.com hooks

bf8a8ac

for more information, see https://pre-commit.ci

changing LinearImpute to handle pandas.DataFrame instead of pandas.Se…

5f6e58a

…ries

CloseChoice self-requested a review December 8, 2023 09:33

CloseChoice suggested changes Dec 8, 2023

View reviewed changes

Jordan Stomps and others added 2 commits December 11, 2023 09:24

code suggestions from @CloseChoice

43d47f6

Co-authored-by: Tobias Pitters <31857876+CloseChoice@users.noreply.github.com>

addressing feedback from PR

75c88e8

CloseChoice reviewed Dec 11, 2023

View reviewed changes

CloseChoice suggested changes Dec 11, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for shap.maskers.Impute class #3379

fix for shap.maskers.Impute class #3379

stompsjo commented Nov 3, 2023 •

edited

stompsjo left a comment

stompsjo Nov 6, 2023

stompsjo commented Dec 6, 2023

CloseChoice commented Dec 8, 2023

CloseChoice left a comment •

edited

CloseChoice Dec 8, 2023

stompsjo Dec 11, 2023

CloseChoice Dec 8, 2023

stompsjo Dec 11, 2023

CloseChoice commented Dec 8, 2023

stompsjo commented Dec 11, 2023

connortann commented Dec 11, 2023

CloseChoice Dec 11, 2023

stompsjo Dec 11, 2023

CloseChoice Dec 11, 2023

CloseChoice Dec 11, 2023

CloseChoice Dec 11, 2023

CloseChoice Dec 11, 2023

CloseChoice Dec 11, 2023 •

edited

stompsjo Dec 11, 2023 •

edited

stompsjo commented Dec 11, 2023

CloseChoice commented Dec 11, 2023

stompsjo commented Dec 11, 2023

connortann commented Dec 11, 2023

stompsjo commented Dec 12, 2023

CloseChoice commented Jan 20, 2024

stompsjo commented Jan 22, 2024

CloseChoice commented Jan 23, 2024

	def test_imputation():
	# toy data
	x = np.full((5, 5), np.arange(1,6)).T

	methods = ["linear", "mean", "median", "most_frequent", "knn"]
	# toy background data
	bckg = np.full((5, 5), np.arange(1,6)).T
	for method in methods:
	# toy sample to impute
	x = np.arange(1, 6)
	masker = shap.maskers.Impute(np.full((1,5), 1), method=method)
	# only mask the second value
	mask = np.ones_like(bckg[0])
	mask[1] = 0
	# masker should impute the original value (toy data is predictable)
	imputed = masker(mask.astype(bool), x)
	assert np.all(x == imputed)

		@@ -117,6 +129,14 @@ def __call__(self, mask, x):

		return (masked_inputs_out,), varying_rows_out

		if self.impute is not None:

fix for shap.maskers.Impute class #3379

Are you sure you want to change the base?

fix for shap.maskers.Impute class #3379

Conversation

stompsjo commented Nov 3, 2023 • edited

Overview

Description of the changes proposed in this pull request:

Checklist

stompsjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stompsjo commented Dec 6, 2023

CloseChoice commented Dec 8, 2023

CloseChoice left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CloseChoice commented Dec 8, 2023

stompsjo commented Dec 11, 2023

connortann commented Dec 11, 2023

1. Imputing strategy

2. Use from Explainer

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CloseChoice Dec 11, 2023 • edited

Choose a reason for hiding this comment

stompsjo Dec 11, 2023 • edited

Choose a reason for hiding this comment

stompsjo commented Dec 11, 2023

CloseChoice commented Dec 11, 2023

stompsjo commented Dec 11, 2023

connortann commented Dec 11, 2023

stompsjo commented Dec 12, 2023

CloseChoice commented Jan 20, 2024

stompsjo commented Jan 22, 2024

CloseChoice commented Jan 23, 2024

stompsjo commented Nov 3, 2023 •

edited

CloseChoice left a comment •

edited

CloseChoice Dec 11, 2023 •

edited

stompsjo Dec 11, 2023 •

edited