New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for shap.maskers.Impute class #3379
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions/comments.
shap/maskers/_tabular.py
Outdated
elif method == "knn": | ||
impute = KNNImputer(missing_values=0) | ||
else: | ||
impute = SimpleImputer(missing_values=0, strategy=method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior of sklearn.impute
objects is to have a missing_value
used to determine what elements to impute. AFAIK, 0
is valid given a standardized mask above, right?
Hm, I missed that (legacy) code/tests specified a "linear" impute method, which was missing from my implementation. (I'm kind of surprised unit tests are passing when |
@connortann could this PR get your approval to run the workflows? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, I added myself as reviewer so please ping me whenever you want another review. Thanks for the PR, this looks really promising. I added a couple of comments, feel free to ignore the NIT prefixed one.
@@ -54,7 +54,7 @@ def test_serialization_independent_masker_numpy(): | |||
temp_serialization_file.seek(0) | |||
|
|||
# deserialize masker | |||
new_independent_masker = shap.maskers.Masker.load(temp_serialization_file) | |||
new_independent_masker = shap.maskers.Independent.load(temp_serialization_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misinterpreted what this test is doing. I assumed that we want to use the load method specifically of the masker being tested, rather than the parent Masker
class. Happy to revert it if needed. This and line 110 below are changed:
shap/tests/maskers/test_tabular.py
Line 110 in 43d47f6
new_partition_masker = shap.maskers.Partition.load(temp_serialization_file) |
temp_serialization_file.seek(0) | ||
|
||
# deserialize masker | ||
new_partition_masker = shap.maskers.Impute.load(temp_serialization_file) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some tests where you actually impute something, the housing dataset does not have missing data AFAIK. Parameterizing the tests for each method would be great aswell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a start:
shap/tests/maskers/test_tabular.py
Lines 175 to 191 in 75c88e8
def test_imputation(): | |
# toy data | |
x = np.full((5, 5), np.arange(1,6)).T | |
methods = ["linear", "mean", "median", "most_frequent", "knn"] | |
# toy background data | |
bckg = np.full((5, 5), np.arange(1,6)).T | |
for method in methods: | |
# toy sample to impute | |
x = np.arange(1, 6) | |
masker = shap.maskers.Impute(np.full((1,5), 1), method=method) | |
# only mask the second value | |
mask = np.ones_like(bckg[0]) | |
mask[1] = 0 | |
# masker should impute the original value (toy data is predictable) | |
imputed = masker(mask.astype(bool), x) | |
assert np.all(x == imputed) |
I tried to come up with something reproducible for all methods. Let me know if you have feedback.
Please also add the reproducible example from issue #3378 as a test |
Co-authored-by: Tobias Pitters <31857876+CloseChoice@users.noreply.github.com>
Alright, changes made. @CloseChoice, feel free to give it another review when you're ready. |
Thanks for the PR! Chipping in with my two cents. It's not clear to me what the original implementation intended; it does indeed look half-finished. So, it would be great to get this sorted. Two thoughts: 1. Imputing strategyThe current docstring for the Lines 302 to 305 in 63223e1
This doesn't seem to be consistent with the new implementation; should we look to add it in? 2. Use from ExplainerI noticed the Imputer is used in the LinearExplainer here. Will this this need to be rewritten? shap/shap/explainers/_linear.py Lines 76 to 83 in 63223e1
|
@@ -61,6 +63,16 @@ def __init__(self, data, max_samples=100, clustering=None): | |||
self.clustering = clustering | |||
self.max_samples = max_samples | |||
|
|||
# prepare by fitting sklearn imputer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather see this in the impute than in the generic tabular class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. Do you suggest overwriting the __call__
method for Impute
? I agree that this shaping check could be moved to the LinearImpute
and Impute
classes (sklearn imputers expect 2-dimensional arrays, so I kept that consistent in LinearImpute
). However, some form of self.impute.fit()
and self.impute.transform()
still need to be called during the main algorithm as written.
@@ -17,7 +19,7 @@ class Tabular(Masker): | |||
""" A common base class for Independent and Partition. | |||
""" | |||
|
|||
def __init__(self, data, max_samples=100, clustering=None): | |||
def __init__(self, data, max_samples=100, clustering=None, impute=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add this keyword only for the impute class ant not for tabular. See my comment below.
@@ -117,6 +129,14 @@ def __call__(self, mask, x): | |||
|
|||
return (masked_inputs_out,), varying_rows_out | |||
|
|||
if self.impute is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should then also just be initialized in impute
|
||
mask = np.ones(X.shape[1]).astype(int) | ||
mask[0] = 0 | ||
mask[4] = 0 | ||
|
||
# comparing masked values | ||
assert np.array_equal(original_partition_masker(mask, X[0])[0], new_partition_masker(mask, X[0])[0]) | ||
|
||
def test_imputation(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get another test where we have values to impute (so explicitly set some values to np.nan
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also can we do
@pytest.mark.parametrize("method", ["linear", "mean", "median", "most_frequent", "knn"])
def test_imputation(method):
...
and then get rid of the loop within the function? That helps us locate what method throws an error
if method not in methods: | ||
raise NotImplementedError(f"Given imputation method is not supported. Please provide one of the following methods: {', '.join(methods)}") | ||
elif method == "knn": | ||
impute = KNNImputer(missing_values=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the missing values should be np.nan
otherwise replacing 0
can lead to unintended results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think line 142 may cause issues:
Line 141 in 75c88e8
self._masked_data[:] = x * mask + self.data * np.invert(mask) |
If the missing value is
np.nan
, and that is present in x
, then x*mask
will still leave the np.nan in place. I think the current behavior is to have masked values set to 0
via the mask.
(That said, I recognize that as is, every time 0
is present in the data it will be imputed (or replaced by the other maskers like Partition
or Independent
), regardless of whether that 0
was masked or part of the inherent data.)
@connortann Thanks for the comments! Responding,
|
There seems to be something wrong with the implementation since it is not correctly imputing nans. I just changed the def test_imputation():
# toy data
x = np.full((5, 5), np.arange(1,6)).T
methods = ["mean", "median", "most_frequent", "knn"]
# toy background data
bckg = np.full((5, 5), np.arange(1,6)).T
for method in methods:
# toy sample to impute
x = np.arange(1, 6)
x = x.astype(np.float32)
x[3] = np.nan
masker = shap.maskers.Impute(np.full((1,5), 1), method=method)
# only mask the second value
mask = np.ones_like(bckg[0])
mask[1] = 0
# masker should impute the original value (toy data is predictable)
imputed = masker(mask.astype(bool), x)
# assert np.all(x == imputed) and the imputed result was almost always ( |
Yeah, check my comment above, it has to do with line 142 and the way that masks are applied. |
On 1., my thinking was actually more along the lines that perhaps the code should implement what was originally described in the docstring: using a Gaussian model with means and a covariance matrix to impute the missing values. The Gaussian imputing strategy seems particularly useful in the context of Shapley values as it accounts for the correlations between variables- perhaps it should be the default? Regarding the other imputing strategies- I'm a little uncomfortable with the "interpolation" method. I think the rest of the code base generally assumes that samples are I.I.D. and exchangable, so that any two rows in the dataset could be swapped. The "interpolation" strategy breaks this assumption as it depends on the row order. @CloseChoice what do you think about this one, is the "interpolation" strategy problematic? Thank you again for helping us get this into shape! Apologies that this PR has a fair bit of discussion; that's just a consequence of the original code being quite poorly defined and documented. |
This is where I think it might be useful to review the implementation presented in #1489 that was merged into the |
Sorry, I somehow missed this. So yes, I think it's not ideal to break iid + exchangability assumption. I would change this in such a fashion, that we create a linear model, that takes the well defined subset of X (where now np.nan values are present), and then we fit the model to each column with missing values and replace the missing ones with the model output. This might be tedious if there are a lot of columns with missing values but IMO the most straight forward approach. I can even pick this one up and implement that approach if need be. |
@CloseChoice That sounds like a plan. If you can get to this in the near-term, feel free to work on an implementation. I can otherwise try to pick this back up soonish. Going with your approach, presumably the columns the model is fitting to would have to be complete (i.e. no missing values)? Otherwise how would the model estimate a value for that row? I concede I probably didn't fully comprehend your comment 😃 |
Yes, my approach would be to first filter for the subset of rows that have null values in only the target column or no null values at all and then fit the model on it. |
In brief,
Impute
now usesTabular
as a parent and uses the sklearn.impute objects for imputation. See #3378 for more discussion (h/t @CloseChoice).Overview
Closes #3378
Description of the changes proposed in this pull request:
This addresses the
TypeError
generated whenshap.maskers.Impute()
is used in anExplainer
object. It currently callsMasker
as a parent instead ofTabular
and therefore does not have a__call__
method.Instead,
Impute
now inheritsTabular
as a parent (including its__call__
method and functionality). UponImpute.__init__
, ansklean.impute
object is used for imputing masked data inTabular.__call__
. I added four supported keywords tomethod
: "mean", "median", "mode", and "knn." The first three refer to the different strategies used bysklearn.impute.SimpleImputer
. The last one creates ansklearn.impute.KNNImputer
object with default inputs instead. There are more Scikit-Learn imputers (e.g.sklearn.impute.IterativeImputer
, which is experimental) and many more settings, but if that amount of control is needed, I have added the functionality formethod
to be a pre-initializedsklearn.impute
object supplied by the user.Checklist
I have added two additional tests similar to those for
Independent
andPartition
. I recognize that there's probably a place for making all of these tabular tests more meaningful for each type of masker.