New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpsharememory() generic and bpmapply() efficiency: Avoid passing a copy of the full data to workers #228
bpsharememory() generic and bpmapply() efficiency: Avoid passing a copy of the full data to workers #228
Conversation
This approach is better when the workers and the caller do not share memory. However for workers that can share memory (e.g. serial or multicore) we may be paying the overhead of the transposing the list of lists without reaping any benefit, since passing the whole dataset to the worker may not create any additional copies. I have not checked that, maybe someone else already knows the answer |
In order to transpose the list only when needed I would need to know if the workers of the current backend share memory with the current process or not. Is there any method named "sharedMemory" or similar that returns |
The only way to share memory would be forking, right? (Well, and single-process) |
I guess, yes. I'd like to consider backends such as this one (even if it is experimental): https://github.com/HenrikBengtsson/BiocParallel.FutureParam Do you have an idea in mind? |
Hmm, so I think your suggestion of a method might make sense. For most params, this method will return a static boolean value (TRUE for multicore/serial and FALSE for others) but for "meta-backends" like FutureParam and DoparParam, they would need to examine the registered foreach/future backend and return true or false depending on which one is used. |
For backward compatibility and future-proofing, the abstract base class (BiocParallelParam) should probably provide a default implementation of this method (which always returns FALSE?), which is then overridden by specific backends to "opt in" to the optimization. |
0ba52e7
to
67ea521
Compare
We now have the best of both worlds:
|
@Jiefei-Wang's SharedObject provides object sharing provided one is on a single machine. I'm not a fan of
It's actually not correct that
Please also follow BiocParallel coding convention (such as it is!) with 4-space indentation. Also I have moved away from aligning arguments with the opening parenthesis of a function name, and instead favor starting (if arguments don't fit on a single line) arguments on a new line indented 4 spaces
with some flexibility favoring a compact representation
I think it's also important to ask whether this is enough of a problem to warrant a solution? Can you provide a simple reproducible example that illustrates? |
Interesting points about memory sharing, Martin. How do you provide a reproducible example for memory usage? Is there an easy way to get the peak memory usage of the R process and all its children summed? |
Measuring memory is challenging of course, but actually I had been under the impression that the discussion was about speed, so an example would certainly help clarify! Maybe Rcollectl would be helpful... |
Here I just show the problem, which is RAM usage mostly. I'll discuss the solution afterwards The problemHere is some code to demo the issue, which affects mostly to RAM usage (and indirectly to CPU, because we have to serialize a lot more data). To keep things simple I just use a 1GB list with 1000 elements to iterate on. Each element is 1MB long. This was large enough to be noticeable in my system monitor, but small enough to not crash on my 16GB laptop. If you have less RAM available than me, please make the dataset size smaller to avoid RAM issues. I don't measure the RAM in the script, but I attach two screenshots of my system monitor: # This example shows the RAM issues of bpmapply()
# 1. Create a list of num_samples elements, where each element will be a numeric
# vector of a given size, such as the whole list targets to take 1GB
# amount of RAM. I call this list "samples" because it's like a list of
# samples to process
#
# On my 16GB of RAM laptop, 1GB are easily noticeable, and I won't suffer
# running out of RAM if I have an extra copy or two (but I will notice the bump)
# Dataset size:
dataset_size_GB <- 1 # GB
num_samples <- 1000
sample_size_MB <- dataset_size_GB/num_samples * 1024 # MB
sample_length <- round(sample_size_MB*1024^2/8) # one double is 8 bytes
samples <- lapply(
seq_len(num_samples),
function(i, sample_length) {
runif(sample_length)
},
sample_length = sample_length
)
# 2. Since bpmapply() is designed to take several arguments, we also create
# another list of the same length as the dataset, with one random number.
# I see this as another argument I want to iterate on, but it's not relevant
extra_args <- lapply(
seq_len(num_samples),
function(i) {
runif(1)
}
)
# 3. To show the problem, we can check installing either the master version or
# this pull request:
#
# # Pick one:
# remotes::install_github("Bioconductor/BiocParallel)
# remotes::install_github("Bioconductor/BiocParallel#228")
#
# (The BiocParallel from Bioconductor would work as well)
library(BiocParallel)
# I use three workers
bpparam <- SnowParam(workers = 3, exportglobals=FALSE, exportvariables=FALSE)
process_sample <- local({function(sample, extra_arg) {
force(sample)
force(extra_arg)
# just wait a bit, 1000 samples / 3 workers ~ 333 samples/worker * 0.05 s/sample = 16.6 seconds
Sys.sleep(0.05)
NULL
}}, envir = baseenv())
bpmapply(
process_sample,
sample = samples,
extra_arg = extra_args,
BPPARAM = bpparam,
SIMPLIFY = FALSE
) When running this on the BiocParallel master branch, I see my RAM usage goes from 15% to 60%. This is roughly 7GB of RAM. When running this on the branch from this pull request ( I may be off in my estimations, since I took them with the naked eye, but it looks pretty clear to me this is worth it. |
There are three points open for discussion:
bpmapply efficient implementationWe have here two implementations of
Using
|
The plan sounds good; thanks for your engagement on this. One thing might be to reconsider a special case for SerialParam -- there's value in having consistency across back ends, especially when one might switch to SerialParam when trying to debug. Also and perhaps more importantly the re-organization of data structures might not actually be that expensive -- not actually copying the large vector elements, just the S-expressions of the list 'skeleton'. Not sure where you are in your knowledge of R but
shows that the 'data' Re-arranging,
the list S-expressions have been updated, but the data are being re-used, |
Oh right! I forgot about R copy-on-write magic! I'll keep it simple then! |
I think there are some reasons that I did not touch |
@Jiefei-Wang Sorry, I don't have context of what you were doing. |
Your implementation looks good to me, I only have one minor comment. When creating
It will take care of both variable names and unmatched variable lengths. |
I didn't do that simplification because when I tried some unit tests failed. We could argue if the unit tests should be changed, but I would rather do that on a different issue. Example of a failing test (from library(BiocParallel)
library(RUnit)
X <- list(c(a = 1))
checkIdentical(X, bpmapply(identity, X, SIMPLIFY = FALSE)) For your suggestion to work, that test should be changed to: library(BiocParallel)
library(RUnit)
X <- list(c(a = 1))
checkIdentical(mapply(identity, X, SIMPLIFY = FALSE) , bpmapply(identity, X, SIMPLIFY = FALSE)) Which makes sense to me, to be honest. There are several other tests that would need to be updated as well. Anyway, this is a subtle difference between In case you want to check that out, here is a branch with the simplification you suggested: remotes::install_github("zeehio/BiocParallel@fix-simplify-bpmapply-argument-preparation") Or, if you want the code, you can add my remote and check the branch out (in case you are not familiar with git remotes and branches):
If there are no other issues or suggestions I would suggest to merge this pull request and work on that simplification on a new thread. I would rather avoid introducing breaking changes here |
Thanks @zeehio this looks great! Can you 'squash' this into a single commit maybe with
(from magic at https://stackoverflow.com/a/5201642/547331) ? That way the history shows only the paths taken rather than the paths not taken... Maybe a commit message like
I guess there have been changes since this started; I can update before merging, but if you wanted to I think the |
- merges Bioconductor#228 - bpmapply receives several arguments to iterate on. This ends up being something like: ddd <- list( arg1 = list(arg1_iter1, arg1_iter2, arg1_iter3), arg2 = list(arg2_iter1, arg2_iter2, arg2_iter3) ) The implementation before this commit was passing ddd to all workers as well as the iteration index, and each worker would take the corresponding slice ddd$arg1[[i]] and ddd$arg2[[i]]. For Serial and (sometimes) Multicore backends, where workers share memory, this is very efficient. However for Snow backends, where workers do not share memory, the whole ddd needs to be serialized, copied to each worker and deserialized, which is very inneficient. In the Snow scenario, it is far more efficient to let the main process "transpose" the `ddd` list so it becomes: ddd2 <- list( iter1 = list(arg1_iter1, arg2_iter1), iter2 = list(arg1_iter2, arg2_iter2) ) Then only pass the corresponding iteration to each worker, reducing the amount of serialization/deserialization needed and the memory footprint significantly. The re-arrangement is not too expensive, and for consistency is applied to all backends. - Define helper functions in local() only with the base namespace - Remove unused .mrename - Update NEWS
6659767
to
cf5dd73
Compare
I used I updated the news entry and moved the transpose function as suggested. I did this from my phone and I don't have R available here, I can run R CMD check later today to ensure it's all good, or you can beat me to it if you like :) |
All my checks pass. Feel free to merge |
Thanks @zeehio that was really helpful; I added you as 'ctb' to the DESCRIPTION file. |
Thanks! |
bpmapply was passing another copy of the full data (besides #227). In this case, directly as an explicit extra argument to bplapply.
The arguments to be iterated on are now transposed, so instead of having
ddd[[i]][[j]]
being the value of the i-th argument in the j-th iteration, we build a list of the formddd[[j]][[i]]
. Thenbplapply()
can directly iterate onddd
, passing one element ofddd
to each worker, instead of passing the wholeddd
list of lists and the corresponding index.This approach has a drawback when the arguments to be iterated on are not lists, but vectors. In that case, the transposition has some overhead, but I would argue that it would make more sense to use something like bpvec or bpmvec if it existed.
UPDATED: This pull request has been updated so its drawbacks do not exist anymore. #228 (comment)