Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFRL library generating counterfactual with same label #878

Open
Himanshu-1988 opened this issue Feb 21, 2023 · 3 comments
Open

CFRL library generating counterfactual with same label #878

Himanshu-1988 opened this issue Feb 21, 2023 · 3 comments
Labels
Type: Question User questions

Comments

@Himanshu-1988
Copy link

I have followed this example : https://docs.seldon.io/projects/alibi/en/stable/examples/cfrl_adult.html
I am generating counterfactual for two class 0 & 1.

But for my example in cell 18 : cf_pd ,which is generated counterfactual has column label which has both 0 & 1 ,Why is it so?If i am generating counterfactual for class 1 than in cf_pd i should be having only class 0 in label column.

Please let me know where i am doing wrong

@RobertSamoilescu
Copy link
Collaborator

Thank you for your question!

If you are generating counterfactuals for class 1 (i.e., in orig_pd) and some counterfactuals still belong to the class 1 (i.e., in cf_pd), then it means that the method couldn't find a counterfactual for your give input. This has to do with the validity score. All counterfactual methods can struggle with that. For more details on the topic see paper here.

If you look in the paper, you will see that the objective function is a mixture of divergence loss (i.e., the one that is flipping the class), sparsity loss (i.e., which makes the counterfactul to be close to the original input) and a consistency loss. There might be some difficult cases in which the method will favour sparsity over divergence which will lead to what you've described.

That being said, to increase the validity of your counterfactuals I recommend you the followings:

  • increase the number of TRAIN_STEPS. In the example notebook it is set to a very low value for demonstrative purposes. You can set it to TRAIN_STEPS=100000, even TRAIN_STEPS=150000. The more the better.
  • Decrease the COEFF_SPARSITY to a lower value. Currently it is set to 0.5. The lower the better chances you will get a higher validity score. You might actually try in the beginning to also set COEF_CONSITENCY=0 for ease of hyper-parameter tuning.

I also recommend using the logging functionalities through callbacks presented at the very end of the notebook. This will give you a much better feeling of how the training evolves and you can actually see how the validity increases during training.

There might be other parameters to try if none of the above is working such as action noise (act_noise), replay_buffer_size, batch_size etc., but I would not probably go there yet.

You can also try to train it without an autoencoder (see example here).

If none of the above works, and you really need a counterfactual, then probably the easiest way to get one is to search in your training set an instance that is classified as your intended target and it is close to the input instance w.r.t. some metric of your choice (probably a combination of L1 and L0). This should work all the time if you didn't impose any constraints on the feature values of the counterfactual.

Although this might not be the case for you, maybe it is worth looking also at this issue - just to avoid an error when computing the validity.

Hope this helps. If you still encounter difficulties, please let me know.

@Himanshu-1988
Copy link
Author

Himanshu-1988 commented Feb 23, 2023

Thanks for your response, I will try to change the configuration.

Meanwhile regarding your comment

"If none of the above works, and you really need a counterfactual, then probably the easiest way to get one is to search in your training set an instance that is classified as your intended target and it is close to the input instance w.r.t. some metric of your choice (probably a combination of L1 and L0). This should work all the time if you didn't impose any constraints on the feature values of the counterfactual."
To measure the closeness where in data both categorical and numerical value combining 30 features ,i tried with cosine similarity earlier but results are not good[similarity is in higher side >.95 for almost all data].

Any metric that i should look in this case?

@RobertSamoilescu
Copy link
Collaborator

I would suggest you to try the metric proposed in this paper at the bottom of page 5. The metric is as follows:

  • Numeric features: Absolute value of the difference between the two data points divided by the standard deviation of that feature across the entire dataset.
  • Categorical features: Data points with the same value have a distance of 0, and other distances are set to the probability that any two examples across the entire dataset would share the same value for that feature.

For categorical features even if you consider a distance equal to 0 if the data points have the same value and 1 otherwise should work fine.

I don't think that cosine similarity would be suited for your problem. One reason is that two data points can have a distance of 0 according to the cosine distance, but be perceptually very different. For example the cosine distance between [0, 1] and [0, 1000] is 0, but the cosine distance between [0, 1] and [0.2, 1.2] is 0.02. Probably in most case, but of course it depends on your problem, one would say that [0.2, 1.2] is much closer to [0, 1] than [0, 1000] is. Note that the cosine distance is not able to capture that. Anyway, if you think for some reason that the cosine distance is what you need, I would probably apply some simple preprocessing to the dataset first: standardize/normalize numerical features and transform the categorical ones into their one-hot-encoded representation. After that I would try to apply the cosine distance.

@jklaise jklaise added the Type: Question User questions label Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Question User questions
Projects
None yet
Development

No branches or pull requests

3 participants