Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastshap::explain() uses all cores and not those defined by registerDoParallel(cores=X) #75

Open
abussalleuc opened this issue Jan 24, 2024 · 3 comments

Comments

@abussalleuc
Copy link

Hi @bgreenwell

I am using fastshap::explain for a large dataset (4 million rows, 23 columns) on a windows system (512gb ram, 48 cores)

Here is how the code looks like:

t<-fastshap::explain(
model,#ranger::ranger() object
X=train_set[,vars],# train set used to train the model (~1million rows, but a higher gradient of predictor values)
pred_wrapper = pfun,#prediction function: ranger::predict()$predictions
newdata=new_data[,vars],#dataset to explain ~4million rows
feature_names=NULL#predictor variables of interest (23)
nsim=10,
adjust=TRUE,
parallel = TRUE,
.packages=c('ranger') )

Due to the size of my dataset, I don't want to use all cores to avoid memory issues, so I tried defining the paralell backend as:
cl<-makePSOCKcluster(25)
registerDoParallel(cl)
and as:
registerDoParallel(cores=25)

In both cases I have noticed (using task manager-perofmance) that all the available cores and not those defined above are being used.

I tried this with data subsets of 4000 rows and changing the number of simulations and cores, but it still uses more cores than those defined by the registerDoParallel command (based on task manager - performance).
While explain runs and produces results with the smaller datasets, with the whole dataset it use 100% of my RAM and sometimes the computer crashes.

In total the model, the train_set and the new_data objects weight ~5GB, so I don't think is a good idea to use as much clusters/cores/logical processors as possible.

Am I defining the parallel backend wrong?
Should I create a foreach loop for each column instead and with parallel=FALSE?

Thank you for your time.
best,
Alonso

@abussalleuc
Copy link
Author

Capture3
here im using 500k rows and registerDoParallel(cores=10)

@brandongreenwell-8451
Copy link

brandongreenwell-8451 commented Jan 24, 2024

Thanks @abussalleuc, on a Windows system, multicore functionality in R will not work (e.g., specifying cores=25). In your case, I would try setting up the parallel backend as follows:

cl <- makeCluster(25)
registerDoParallel(cl)

Does this seem to fix the issue on your system?

Further, explaining that many rows, even with nsim=1, is going to be terribly slow I suspect, even with massive parallel processing. I have not tested this on such a large sample or have access to that many cores, so let me know how it works out!

@abussalleuc
Copy link
Author

Hi @brandongreenwell-8451
Thank you for your answer.

I was originally using makePSOCKcluster which (to my limited knowledge) should work on a windows machine.
Using cl<-makePSOCKcluster (25) | regirsterDoParallel(cl) would still activate all logical processors.

I tried your suggestion and the issue persists.
If I use the same X (background or train set), and cut the new_data into smaller subset, and run each subset separately, would this affect how the SHAP values are calculated?
I understand that for each column and during each simulation, the predictor values are resampled and new predictions are calculated, but are they resample from the new_data or from the background/train_set?
My train set, although smaller, has probably much more variability in the predictors than the new_data.

My idea is to use SHAP values to explain correlations between modeled variables that share the same predictors so my dataset is a
very small sample from a much larger spatiotemporal extent......

thank you for your time.
best,
Alonso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants