Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are there any suggestions/best practices on what value to use for n_combinations? #226

Closed
mlds2020 opened this issue Aug 12, 2020 · 10 comments

Comments

@mlds2020
Copy link

I am asking mainly regarding the ctree method, but I guess this would be applicable to any approach?

n_combination obviously has a big impact on memory usage, time and performance it takes for the model to run.
Are there any rules of thumb on what value to set?
I have seen you have implemented a warning:

Due to computational complexity, we recommend setting n_combinations = 10 000 if the number of features is larger than 12.

Would that mean your recommendation is to set exact = TRUE while the number of features <=12, and set it to 10 000 when its higher?

I think a study into the tradeoff between n_combinations and the accuracy of the predicted Shapley Values would be an interesting undertaking - for which I, unfortunately, don't have the necessary hardware for (my program crashes at as little as n_combinations=1000 using the c_tree approach, for the Credit Card Default dataset (i.e. 23 features)). Maybe the Norwegian Computing Center would be better equipped for that? :-)

@martinju
Copy link
Member

Yes, this is applicable for all approaches, and yes it is really just about trading performance for memory usage and runtime.
Our recommendation is to use exact = TRUE if you can in terms of memory and how long you bother to wait for your results. 10 000 is of course just a number, but that is where we felt memory/runtime started to become an issue on our computers (for the empirical and gaussian methods), so I think it is an OK value.

Note that the cutoff to trigger the warning is changed from 12 to 13 in d3f39f3 (will be in master by next week). So for 14 or more features we recommend n_combinations = 10 000.

Note also that n_combinations is currently not giving you the specified number of unique combinations, so if you want 1000 unique combinations you would have to increase n_combinations to say 2000 or 3000 depending. You can check how many that are being used by nrow(explainer$X), where explainer is the outpur from shapr. See #227

I agree that it would be useful to study this trade-off between accuracy and runtime. The problem is that it will potentially vary quite a bit depending on the method you use (empirical, gaussian, ctree) and the type of predictive model you are explaining. Thus, I think it would be hard to reach a general answer to this question.

Sorry to hear your program crashes. Ctree is indeed heavier to run. Depending on where the crash happens, it might help to reduce the number of test samples (predictions you are explaining), and then just run the explain function repeatedly. There are also settings to reduce the complexity of the trees that ctree builds, see ?simulateAllTrees

@mlds2020
Copy link
Author

Depending on where the crash happens, it might help to reduce the number of test samples (predictions you are explaining), and then just run the explain function repeatedly.

That's a smart suggestion! How would one go about doing this?
Suppose one's hardware can only generate at most 1k samples - would the procedure be to run shapr 10 times with different random seeds, just saving the Shapley values after each 'iteration', and ultimately calculating the mean for each sample/feature pair?

@martinju
Copy link
Member

martinju commented Aug 19, 2020

I was merely thinking of just looping over batches of the predictions you want to explain, like

for (i in 1:10){
set.seed(123)
explanation <- explain(
  x_test[batch_list[[i]],],
  approach = "ctree",
  explainer = explainer,
  prediction_zero = p
)
}

where batch_list is a list containing vectors of indexes for x_test. In the extreme case you could run them one by one.
This will cause you to re-fit the trees every time, which is obviously inefficient, but if it makes you not run out of memory, it is worth the extra wait.

Suppose one's hardware can only generate at most 1k samples - would the procedure be to run shapr 10 times with different random seeds, just saving the Shapley values after each 'iteration', and ultimately calculating the mean for each sample/feature pair?

This will give you an estimate of the same thing, and is straight forward. A better way would be to call explain with just a subset of all the samples of combinations and store only the conditional expectations in each of the 10 repeated runs. Finally, you would glue those together and do the computations with the Shapley formula. This is not possible with the current code base, though. It would require a little bit of hacking, but should not be extremely hard.

At least for the ctree and empirical approaches, reducing the number of training observations would also directly reduce memory usage.

@mlds2020
Copy link
Author

A better way would be to call explain with just a subset of all the samples of combinations and store only the conditional expectations in each of the 10 repeated runs. Finally, you would glue those together and do the computations with the Shapley formula. This is not possible with the current code base, though. It would require a little bit of hacking, but should not be extremely hard.

Good call! I will try this one out!
Consider my attempt successful if the issue isn't reopened!! :-)

@martinju
Copy link
Member

A better way would be to call explain with just a subset of all the samples of combinations and store only the conditional expectations in each of the 10 repeated runs. Finally, you would glue those together and do the computations with the Shapley formula. This is not possible with the current code base, though. It would require a little bit of hacking, but should not be extremely hard.

Good call! I will try this one out!
Consider my attempt successful if the issue isn't reopened!! :-)

Great! Please also report back if successful. PRs are welcome :-)

@martinju martinju reopened this Aug 19, 2020
@mlds2020
Copy link
Author

Didn't quite know how to deal with the different-sized (due to random sampling) weight matrices W :-(

For now, I just loop over the whole process with different seeds, record the Shapley Values, and just calculate the mean.

@martinju
Copy link
Member

My take on this is that you need to run the shapr function as usual with the full set of say 10*1000 samples, then create 10 copies of this but with just 1000 rows in the S matrix. Then you need to pass each of those copies to explain, and keeping the res_mat which is created in the last part of the predictions call and storing that (you need to modify the function in the package to do this as res_mat is currently not saved). Finally you merge all the res_mats and run the last few lines of code in predictions.

What you are doing right now might also be OK, although I believe what I am suggesting above is more sampling effective as you avoid (unnecessary) sampling of the same feature combinations in repeated runs.

@martinju
Copy link
Member

martinju commented Aug 20, 2020

@mlds2020
I just hacked a version of what I suggested above: A tiny modification to the predictions function and a script for how to do it.
Since I assume you are working with the ctree branch, so I built it top of that. You will need the ctree_manual_batching branch version of the package to be able to do this. Below is an illustration of how to do this (I have not checked this for other methods or whether it actually saves a lot of memory, but in theory it should). Please let me know how it works if you give it a spin :-)

# Example on how to manually run batches of feature combinations through
# explain() and thereby do just a portion of the heavy computations at a time to save memory.

# The idea is to store dt_res every time, combine those in the end and run
# KernelSHAP to compute the full Shapley values in the end.

# This is just an illustration of the concept. For very memory intensive cases,
# it might be wise to save the dt_mat to disk, clear out unneeded objects from time to time,
# run the garbage collector and possibly even restart R from time to time.

library(xgboost)
library(shapr)
library(data.table)

data("Boston", package = "MASS")

x_var <- c("lstat", "rm", "dis", "indus","nox","age","tax","ptratio")
y_var <- "medv"

x_train <- as.matrix(Boston[-1:-6, x_var])
y_train <- Boston[-1:-6, y_var]
x_test <- as.matrix(Boston[1:6, x_var])

# Fitting a basic xgboost model to the training data
model <- xgboost(
  data = x_train,
  label = y_train,
  nround = 20,
  verbose = FALSE
)


# Prepare the data for explanation
explainer <- shapr(x_train, model,n_combinations = 100)

# Decides how to split the feature combinations in explainer into chunks to be passed one by one
no_batches = 10
set.seed(123)
no_samples = nrow(explainer$S)
p = mean(y_train)

x0 = 2:(no_samples-1)
x = x0[sample(1:length(x0),size = length(x0),replace = F)]
S_groups = split(x, cut(seq_along(x), no_batches, labels = FALSE))
dt_mat_list = list()
for (i in 1:no_batches){
  S_groups[[i]] = sort(S_groups[[i]])
  tmp_explainer = explainer

  # Subsetting the required tables in the explainer
  these_samples = sort(c(1,S_groups[[i]],no_samples)) # Need to always include the first and last row due to a coding thing
  tmp_explainer$S = tmp_explainer$S[these_samples,]
  tmp_explainer$W = tmp_explainer$W[,these_samples]
  tmp_explainer$X = tmp_explainer$X[these_samples]

  # Running explain for only a subset of the feature combinations
  tmp_explanation <- explain(
    x_test,
    approach = "empirical",
    explainer = tmp_explainer,
    prediction_zero = p
  )

  dt_mat_list[[i]] = cbind(tmp_explanation$dt_mat,row_id=these_samples) # Keeping the dt_mat with an identifier for which S row this is

  print(i)
  }

# Joining all the dt_mats
dt_mat_final = data.table::rbindlist(dt_mat_list)
dt_mat_final = unique(dt_mat_final)

setkey(dt_mat_final,row_id)
dt_mat_final[,row_id:=NULL]

# Running kernelSHAP on the merged version of dt_mat
cnms <- colnames(tmp_explanation$x_test)
kshap <- t(explainer$W %*% as.matrix(dt_mat_final))
dt_kshap <- data.table::as.data.table(kshap)
colnames(dt_kshap) <- c("none", cnms)

# Adding proper class etc
explanation_batch <- list(dt = dt_kshap, model = tmp_explanation$model, p = tmp_explanation$p, x_test = tmp_explanation$x_test)
attr(explanation_batch, "class") <- c("shapr", "list")

plot(explanation_batch,index_x_test = 1:4) # This is a proper shapr object, so plotting etc works

### Checking similarity to original version which is more efficient, but may also be more memory hungry

org_explanation <- explain(
  x_test,
  approach = "empirical",
  explainer = explainer,
  prediction_zero = p
)
org_explanation$dt_mat = NULL # Removing the dt_mat

all.equal(org_explanation,explanation_batch)
#[1] TRUE

@AbdollahiAz
Copy link

Hello,
I try to run your code, I see errors in this line
dt_mat_final = data.table::rbindlist(dt_mat_list)
I remove the error by adding this line before it in code:
dt_mat_list <- lapply(dt_mat_list, function(mat) as.data.table(mat))## I added
again I see error in this line:
kshap <- t(explainer$W %*% as.matrix(dt_mat_final))
this is error:

Error in explainer$W %*% as.matrix(dt_mat_final) : 
  non-conformable arguments

I check the parameter "dt_mat_final":

dt_mat_final
Null data.table (0 rows and 0 cols)

could you help me how to remove this error?

Best,

@martinju
Copy link
Member

This code is outdated. I recommend using the more recent n_batches argument with the GitHub master version of shapr to get essentially the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants