New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple core execution doesn't update environments #107
Comments
Thanks for pointing this out! Yes, they should have the same output. Seems like the multi-core instance is not executing all the universes (or is not executing in the correct environment). I'll take a look |
This seems to be an issue with how we use environments and |
Actually, this problem seems to exist for any multicore / multisession library. The problem probably lies somewhere in the use of environments in parallel, but I can't seem to figure out what it is... |
I'm going to hazard a guess that mc*apply functions are designed to return a value, not necessarily carry over the side-effects of running code - as how could one tell what those side-effects are? One approach could be requiring the user to be specific about what objects they want to return - if mc*apply functions return one object per function, perhaps that can be the environment? I don't know, I'm spitballing here. I do currently expect to use cluster computing with multiverse in the next month or two, so I have some time to put into this feature if my need arises. |
tl;dr your approach of rewriting the environments makes sense. I describe below *what I think* is going wrong but I'll see if @mjskay has any alternative suggestions Interesting, so it seems like mc*apply functions does something weird with environments: library(rlang)
library(purrr)
env_list = list(new.env(), new.env(), new.env(), new.env()) # creates four new environments, with the global env as the parent
code_list = list(expr({a = 111}), expr({b = 112}), expr({c = 113}), expr({d = 114})) # random code
res = mapply(eval, expr = code_list, envir = env_list) # executes the code in each environment
map(env_list, env_names) # returns the names of the variables defined in each environment env_list_2 = list(new.env(), new.env(), new.env(), new.env())
res = mcmapply(eval, expr = code_list, envir = env_list_2)
map(env_list_2, env_names) # returns `character(0)` On further inspection (based on the approach you described), it seems like mc*apply functions do not return the same environments that were initially used, but rather returns entirely new environments: eval_in_env = function(c, e) {
eval(expr = c, envir = e)
e
}
env_mapply = mapply(eval_in_env, code_list, env_list)
map2(env_mapply, env_list, identical) # returns TRUE for all
env_mcmapply = mcmapply(eval_in_env, code_list, env_list)
map2(env_mcmapply, env_list, identical) # returns FALSE for all This second issue is why the output differs, because the actual environments in which mc*apply is executing the code is not stored anywhere. This makes me wonder if we should just use mc*apply instead mapply (even for single core operations) and change how we deal with environments instead of having two separate pathways... |
Yeah, I don't know exactly how R environments work with threads or multiple processes, but I would guess that they can't be shared across them. So I would guess that the parallel versions of apply copy environment contents into a new environment on a separate thread or process and then copy results back upon completion. So they would not be able to directly modify environments in the original thread.
Having a single pathway makes sense. Though, did we end up implementing the crazy tree-of-environments approach or not? Would that need to change for a multithreaded approach? If we are going to change this around, I would suggest moving to {future} at the same time as this should make it easier for users doing this on a cluster with custom setups. |
We actually have the tree-of-environments implemented (and I do remember the parallel apply functions working at some point in time), but I don’t think it should be an issue. I'll write some tests for checking parallel execution. |
It appears that when code is run across multiple cores, the
.results
object containing the universe environments aren't updated properly. I would expect both of these methods to have equivalent results:Created on 2022-04-22 by the reprex package (v2.0.1)
I'm running R 4.1.2 with RStudio 2021.09.2 on Ubuntu 20.04 LTS with 11th Gen Intel® Core™ i7-1165G7 @ 2.80GHz × 8. I'm not sure if I'm missing any software libraries to enable this capability.
The text was updated successfully, but these errors were encountered: