Replies: 1 comment
-
I don't think there's a go-to answer to this other than trying. For example, use the exact same, small data set and time the processing time with 1, 2, 5, 10, 15, 20 cores on a single machine, and plot the results to see if there's a clear trend. (If there are other people running on the same machine, your findings might get confounded with what others run at the same time - not an uncommon problem on multi-tenant machines). Also, I assume you're having a cluster with a HPC scheduler, e.g. SGE or Slurm. If that's the case, you should be able to scale out on multiple machines at the same time by e.g.
Yes, you will probably see less overhead with For these types of large tasks, if you're not already done so, I recommend that you familiarize yourself with running R in the terminal. That will allow you to SSH into a compute cluster and running R directly without having to worry about running GUIs remotely, etc. You can save the GUIs until the end of the pipeline, e.g. load in the final, smaller results and produce plots, etc. |
Beta Was this translation helpful? Give feedback.
-
I am trying to integrate 40GB of data via the Seurat package & have started using the future package (As recommended by Seurat devs for large datasets) and have a few questions:
I am currently using a cluster that has up to 30 cores/node - I understand it isnt 30X faster to use 30 cores (and it is also much more expensive to do that) - given that there may be a diminishing return effect, is there any way to find out an optimal number of cores to use?
I am using Rstudio (can only use Multisession and not multicore/forking) - is there a significant disadvantage to doing this? Should I expect faster results with multicore?
Beta Was this translation helpful? Give feedback.
All reactions