Skip to content

PierreBlanchart/CMBDTC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Counterfactual explanations for XGBoost and tree ensemble models

A package for XGboost model interpretability. It addresses the challenging problem of fault detection and diagnosis by using a counterfactual approach. More specifically, given a faulty data, we compute its closest counterfactual example, i.e. the closest virtual point in the input space that is classified as normal by the model. This point is called "virtual" since it is created based only on the model parameters, and, as such, is not necessarily an existing data from the training set (most of the time it is not). Based on the computed counterfactual example, we make a recommendation on the actions to take a minima in order to correct the fault. The package applies to two-class, multi-class and regression XGBoost models. The supported model types according to XGBoost nomenclature are: "binary:logistic", "reg:logistic", "reg:squarederror", and "multi:softprob".

This R-cpp package is an implementation of the method described in the paper: An exact counterfactual-example-based approach to tree-ensemble models interpretability

Please cite:

@article{blanchart2021exact,
  title={An exact counterfactual-example-based approach to tree-ensemble models interpretability},
  author={Blanchart, Pierre},
  journal={arXiv preprint arXiv:2105.14820},
  year={2021}
}

if you are willing to use the package.

For a quick overview of the method implemented, you can read Explaining the decisions of XGBoost models using counterfactual examples

Install instructions

To install the R dependencies, run (if needed) from the R command line:

install.packages(c("xgboost", "data.table", "Rcpp", "RcppArmadillo", "cpp11"))

The C++ dependencies are Boost and TBB. Check the paths for these two libraries in the file "./src/Makevars".

Once you have correct paths specified, install the package from the console:

cd PACKAGE_SOURCE_FOLDER
R CMD INSTALL -l /INSTALL_PATH ./ --preclean

Results

Example 1

MNIST dataset: image from class 5 missclassified in class 6. We find the CF example of this image that is classified in class 5 by the model.

Alt text

Then, we compute the CF example associated with the query image above. Visually, we see that the original missclasified image is transformed into something which looks more like a 5.

Alt text

Beam search

Making the decision threshold decrease from 0.5 to 0.2 with a step of 0.01. We get closer and closer to the class center, with a CF example that resembles more and more a "5". To be read row-wise.

Alt text

Alt text

Demos

Demos scripts are in the folder ./demos. There are five demo scripts which illustrate the main functionnalities of the software:

  1. ./demos/demo_surrogate_binary.R: this script illustrates the search for the closest CF example on a binary classification problem between two classes of the MNIST dataset. The search range is initialized using an approximate CF search using a derivable surrogate of the tree ensemble model / XGBoost model.
  2. ./demos/demo_surrogate_multiclass.R: same as before but on a multiclass problem with all the 10 classes of MNIST.
  3. ./demos/demo_CF_MNIST_beam_search.R: this script illustrates the search for the closest CF examples on a binary classification problem between two classes of the MNIST dataset. We make vary the decision threshold, which forces the algorithm to search for CF examples which are classified with more and more confidence inside the targeted class for the CF example. For each value of the decision threshold, we plot the corresponding CF example and mention the induced distortion (distance to the original query). We also assess visually the changes brought to the initial query in the successive CF examples to see if the ambiguities in the first query image are "resolved" to make it look more like an image belonging to the targeted class for the CF example.
  4. ./demos/demo_restricted_CF.R: example of fixing the values of some input variables over which the user has no control. The CF example is computed on the remaining variables. Here, a dataset representing consumer credits approval/denial based on a set of 20 exogenous variables describing the credit applicant situation is used. We diagnosed the cases of credit denial: we determine which minimal changes in the input characteristics would allow the user to obtain the credit he asked for, knowing that there are characteristics on which he has no control (and which are forced to stay fixed by using a restricted CF query).
  5. ./demos/demo_restricted_CF_regression.R: extension of the CF algorithm to a regression problem. We use a dataset describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values. These variables describe the characteristics of the house, as well as that of the neighborhood. We apply our algorithm to a XGBoost regression model trained to predict the sale price of a house given a set of explanatory variables, and we try to answer the question from the seller point of view: what would be the changes/repairs to perform a minima in the house to increase the sale price to a target price ? We try to answer this question by considering only the variables that it is possible to change. For instance, it is not possible to change variables such as the total area, the date of construction, the quality of the neighborhood ... So, we keep these variables fixed, applying the CF approach to the remaining set of variables.

About

counterfactual explanations for XGBoost and tree ensemble models - counterfactual reasoning - model interpretability

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published