Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strong dependence on using kmeans background samples for SHAP #1

Closed
slundberg opened this issue Apr 21, 2020 · 5 comments
Closed

Strong dependence on using kmeans background samples for SHAP #1

slundberg opened this issue Apr 21, 2020 · 5 comments

Comments

@slundberg
Copy link

slundberg commented Apr 21, 2020

Hey! I finally got around to playing with the examples you have here, and I noticed that you were using shap.kmeans to get the background data. Since I typically use a random sample not kmeans (unless I am trying to really trying to play with run time optimization), I just swapped

background_distribution = shap.kmeans(xtrain,10)

for

background_distribution = shap.sample(xtrain,10)

When I did this all the adversarial results for SHAP seemed to fall apart for COMPAS...meaning 79% of the time race is still the top SHAP feature in the test dataset for the adversarial model.

This very strong dependence on using kmeans was surprising to me, since it seems to imply SHAP is much more robust to these adversarial attacks when using a typical random background sample. Have you noticed this before, or do you have any thoughts on this? I think it is worth pointing out, but I wanted to get your feedback before suggesting to users that a random sample provides better adversarial robustness.

Thanks!

@dylan-slack
Copy link
Owner

dylan-slack commented Apr 23, 2020

Hi Scott, excited to have you engage with this project! The thing here is that the adversarial attack is being trained with kmeans in mind. So if you switch to a different sampling strategy (random sample), it's likely the attack won't work as well because the OOD/IID detection is looking for substitutions that came from kmeans sampling and wouldn't be found in the data distribution.

It's worth verifying that random sampling alone is enough to break the attacks however. It could be the case that random sampling produces enough IID perturbations that the attacks are ineffective. But, when I run with random sampling for COMPAS, race shows up as the top feature 16% of the time while the uncorrelated column shows up 83% of the time. It's worth noting that fidelity falls to 83%. Additionally, I bumped up the adversarial classifier training to 4e4 draws from the data provided.

Explicitly my changes were:

background_distribution = xtrain[np.random.choice(xtrain.shape[0], 10)]

adv_shap = Adversarial_Kernel_SHAP_Model(racist_model_f(), innocuous_model_psi()).train(xtrain, ytrain, background_distribution=background_distribution, feature_names=features, n_samples=4e4)

I got an error when I called shap.sample, but I believe what I did here is equivalent?

Further, a clever attacker could train the attack with a few plausible sampling strategies (sample, kmeans, zeros, etc.) to make the attacks resilient to sampling changes. So long as substitutions SHAP produces aren't found in the data distribution, this attack is plausible to some degree.

Please let me know if you have any other questions!

@dylan-slack dylan-slack pinned this issue Apr 23, 2020
@slundberg
Copy link
Author

slundberg commented Apr 23, 2020

Thanks for getting back to me Dylan! It looks like it all has to do with how many samples we give the adversarial classifier. The default number 2e4 doesn't seem to work very well with a random background, but adding 4e4 helps and 4e5 makes things work like in the paper. This is what I did:

One line 2 of the SHAP experiment box I made your change:

background_distribution = xtrain[np.random.choice(xtrain.shape[0], 10)]#shap.kmeans(xtrain,10)

Then I computed the results right after that as:

biased_shap_values = biased_kernel_explainer.shap_values(xtest[:100])
adv_shap_values = adv_kerenel_explainer.shap_values(xtest[:100])

print("Fraction race on top for biased", (np.argsort(-np.abs(biased_shap_values), axis=1)[:,0] == 8).mean())
print("Fraction race on top for biased adversarial", (np.argsort(-np.abs(adv_shap_values), axis=1)[:,0] == 8).mean())

Which produced:

Fraction race on top for biased 1.0
Fraction race on top for biased adversarial 0.77

Then I increased the number of samples to 4e4 and 0.77 dropped to 0.45 ... much better (or worse) but still not close to what you got. Then I upped the samples to 4e5 and got 0.16

So I looped over many sample sizes and got a general trend...but fidelity also drops (note the left side of this plot is the default # of samples in the code right now). and I am not sure how good 0.7-0.8 fidelity is since 0.5 is random (thoughts?).

image

Since % match is hard to interpret I reran this again using R-squared for fidelity, and it looks like we need to drop to about R^2 = 0.5 to consistently knock race out of the top feature importance rank:

image

@dylan-slack
Copy link
Owner

This is cool! It definitely depends on the number of samples you're considering, because this is just the number of substitutions you're training on --- we'd expect at better ood classifier if we see more combinations.

The significance of the fidelity number really depends on the target application of the attack IMO. We were thinking about this in terms of presenting a biased classifier in the real world while going undetected. If we're saying that sacrificing 20-30% fidelity allows us to go "undetected" then we're still being perfectly discriminatory on 70-80% of our population. Assessing this from a demographic parity perspective, on the 4e-4 classifier I'm seeing 20% DP, which is pretty discriminatory. I could see this being less meaningful in other applications though.

@slundberg
Copy link
Author

slundberg commented Apr 23, 2020

Makes sense. Though I would caution that you are not being "perfectly discriminatory" on 70-80% of the population, since if it was 50% it would be random guessing and hence could have no discriminatory effects. That's why I think R^2 is much easier to interpret since it is the % of variance you are able to predict.

To your point of being "undetected", I think the actual importance given to the feature is probably more informative than "it was the top feature". To that end I re-ran things (with a bit different range of sample sizes) and it looks like the actual weight of the race feature is dropping by about a factor of 2 (which is also roughly how much the R2 drops by, so it might not be doing much more than moving weight from one feature to another):

image

If we run the same experiment with the kmeans background then we get a much more effective attack (though still at a cost to fidelity). Which seems to imply that at least for the COMPAS example kmeans is important for this attack:

image

(note the left end of these plots start at n_samples = 1000)

@dylan-slack
Copy link
Owner

Yeah, I do think you raise a good point here about the distinction between using kmeans and random sampling. KernelSHAP does seem to be more robust given this analysis and data set!

The only other thing I'd consider here is that by increasing the n_samples, the IID/OOD data set is starting to become pretty unbalanced, given that there's only ~5,500 COMPAS training instances I believe. It's setup right now to address this by resampling the original instances (set through the through the perturbation_multiplier parameter with is at 10x right now). It could be worthwhile to look at change this as the number of samples increases, but I'm unsure how much that will really affect this analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants