Batch mode #244

martinju · 2020-11-12T09:44:48Z

Running cases with many features is currently not easy to do with shapr, mainly due to memory consumption. One way to try to overcome this is to implement a batch mode. This could be done quite nicely by including a batch parameter into explain, which loops over the the calls to prepare_data and prediction (one loop), passing one batch of the feature combinations (rows of S) at a time, and storing the mat_dt output from prediction. The last part of the prediction:

  kshap <- t(explainer$W %*% as.matrix(dt_mat))
  dt_kshap <- data.table::as.data.table(kshap)
  colnames(dt_kshap) <- c("none", cnms)

should be moved out of prediction and the loop and be executed on a dt_mat which has combined the individual dt_mat's from the bathes. See 6d43468#diff-639dbfdc05cfa9df4cd9a3a1b798669638837cf999da4e7be64129e0d3996ed8 for a manual ad-hoc script applying this very idea.
The neat thing about this approach is that if the amount of memory is limited one could use small batches. This should then save memory as the output from the loop, dt_mat, is much smaller than the matrices etc needed internally in prepare_data and prediction when sampling.

The aforementioned loop can be parallilized for speedup when memory is not a (big) issue, and this currently stands out as the superior way to implement parallelization in this R-package (#38)

Taking it one step further one could also write the individual dt_mats to a fixed temporary disk folder (which is not deleted at session termination), and pick them up in the end to compute the shapley values. This is nice in case of a crash (e.g due to memory) as one does not have to rerun all combinations. The filename for the common dt_mat.csv-file should be created based on dimension of training and testing data, the class of the model and n_combinations + maybe a sample of the data. In the beginning of the explain call one can then check the temporary disk folder for a previous dt_mat.csv matching the the present call, and then ask the user whether one should continue from there, or start all over again.

Taking it to the maximum (for simulation runs), we could create an Rscript that could be called from a loop in a shell script with a specification of the feature combination rows that are executed in the Rscript. Within the shell script, after the loop, another Rscript is called and the remaining computations are done and shapley value results are saved to disk. The point of this is that within each Rscript call in the loop, R is restarted and one can be 100% sure that all memory is free.

Now just writing down the idea, as I don't have the time to do this now. Hopefully I can get to this early next year.

The text was updated successfully, but these errors were encountered:

c-bharat · 2021-06-17T04:37:20Z

Hi all,

Is there a simple way of disabling the below ERROR thrown by feature_combinations() when the above batching approach is implemented?

"Currently we are not supporting cases where the number of features is greater than 30."

Thanks in advance.

martinju · 2021-06-20T17:58:59Z

Hi all,

Is there a simple way of disabling the below ERROR thrown by feature_combinations() when the above batching approach is implemented?

"Currently we are not supporting cases where the number of features is greater than 30."

Thanks in advance.

Hi @c-bharat Yes, when the batch mode is implemented, that error will be disabled. Currently, we have set it simply to "help" the user, as unless you have a lot of CPU-time and memory available, estimates with more than 30 features will NOT be trustable as the Monte Carlo error will be too large.

Unfortunately I can't say for sure when, but it is certaintly climbing on the TODO-list.

gringle1 · 2023-10-10T21:46:30Z

Hello! I see that the batch mode has been implemented in the development version of shapr and I was wondering if there is now a way to disable this error?

Thank you!

martinju self-assigned this Nov 12, 2020

This was referenced Nov 12, 2020

Parallelization #38

Open

Reason for hard limit of max. 30 features? #242

Closed

martinju mentioned this issue Jun 20, 2021

Multi-core/memory issues with shapr::explain predicting with ranger models #263

Closed

martinju mentioned this issue Aug 26, 2021

Causal and asymmetric Shapley values implementation #273

Draft

martinju added this to In progress in Towards shapr 1.0.0 Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch mode #244

Batch mode #244

martinju commented Nov 12, 2020 •

edited

c-bharat commented Jun 17, 2021

martinju commented Jun 20, 2021

gringle1 commented Oct 10, 2023 •

edited

Batch mode #244

Batch mode #244

Comments

martinju commented Nov 12, 2020 • edited

c-bharat commented Jun 17, 2021

martinju commented Jun 20, 2021

gringle1 commented Oct 10, 2023 • edited

martinju commented Nov 12, 2020 •

edited

gringle1 commented Oct 10, 2023 •

edited