Optimization of future for a newbie #466

dlmatera · 2021-01-27T21:27:32Z

dlmatera
Jan 27, 2021

I am trying to integrate 40GB of data via the Seurat package & have started using the future package (As recommended by Seurat devs for large datasets) and have a few questions:

I am currently using a cluster that has up to 30 cores/node - I understand it isnt 30X faster to use 30 cores (and it is also much more expensive to do that) - given that there may be a diminishing return effect, is there any way to find out an optimal number of cores to use?

I am using Rstudio (can only use Multisession and not multicore/forking) - is there a significant disadvantage to doing this? Should I expect faster results with multicore?

HenrikBengtsson · 2021-03-05T22:20:39Z

HenrikBengtsson
Mar 5, 2021
Maintainer

I am currently using a cluster that has up to 30 cores/node - I understand it isnt 30X faster to use 30 cores (and it is also much more expensive to do that) - given that there may be a diminishing return effect, is there any way to find out an optimal number of cores to use?

I don't think there's a go-to answer to this other than trying. For example, use the exact same, small data set and time the processing time with 1, 2, 5, 10, 15, 20 cores on a single machine, and plot the results to see if there's a clear trend. (If there are other people running on the same machine, your findings might get confounded with what others run at the same time - not an uncommon problem on multi-tenant machines).

Also, I assume you're having a cluster with a HPC scheduler, e.g. SGE or Slurm. If that's the case, you should be able to scale out on multiple machines at the same time by e.g. plan(future.batchtools::batchtools_slurm). I've used this to run very large, multi-day HiC analysis jobs in my work.

I am using Rstudio (can only use Multisession and not multicore/forking) - is there a significant disadvantage to doing this? Should I expect faster results with multicore?

Yes, you will probably see less overhead with multicore because it relies on forked parallel processing, which is something that the operating system provides as is really good at. With multisession more data has to be shuffled back and forth between the R parallel workers. However, not all pipelines are stable with forked processing, so you'll have to try and see.

For these types of large tasks, if you're not already done so, I recommend that you familiarize yourself with running R in the terminal. That will allow you to SSH into a compute cluster and running R directly without having to worry about running GUIs remotely, etc. You can save the GUIs until the end of the pipeline, e.g. load in the final, smaller results and produce plots, etc.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of future for a newbie #466

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Optimization of future for a newbie #466

dlmatera Jan 27, 2021

Replies: 1 comment

HenrikBengtsson Mar 5, 2021 Maintainer

dlmatera
Jan 27, 2021

HenrikBengtsson
Mar 5, 2021
Maintainer