Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot find symbols exported to node by parallel::clusterExport #339

Open
renkun-ken opened this issue Sep 19, 2019 · 6 comments
Open

Cannot find symbols exported to node by parallel::clusterExport #339

renkun-ken opened this issue Sep 19, 2019 · 6 comments
Labels
Backend API Part of the Future API that only backend package developers rely on feature request feature/sticky-globals globals

Comments

@renkun-ken
Copy link

In the following example, I try to export certain variables to each cluster node before any future is created. However, the exported symbols cannot be found when a future is resolved.

I don't want future() to detect globals or export variables because in my use case, there are tens of futures and they will be called periodically (every several minutes), each run is time critical so that I don't want the same global variables to be detected and exported to the workers again and again.

library(future)
library(parallel)

test1 <- rnorm(10000)

cl <- makeClusterPSOCK(2)
clusterExport(cl, "test1")
plan(cluster, workers = cl)

f <- future({
  sum(test1)
}, globals = FALSE)
values(f)
stopCluster(cl)
@HenrikBengtsson
Copy link
Owner

It's because the workers global environment is intentionally wiped before (and after if gc=TRUE) each round of evaluation. This is intentionally because futures should really not have a memory, e.g. one future should be able to leave behind side effects that affects future futures.

There's a document "misfeature" that allows you to disable this (see argument persistent). I'm not particularly fond of it and I don't encourage making use of it - ideally (I think) it'll be removed one day.

Having said that, what you're asking for suggests that there might be room for a way to control the default, initial state of futures, e.g. global variables, options, and env vars that are always set. I'll add it to the list of feature requests.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Sep 20, 2019

Sticky globals

BTW, you can also do:

library(future)
cl <- makeClusterPSOCK(2)
plan(cluster, workers = cl)

# Export "sticky" globals to all workers
test1 <- rnorm(10000)
my_globals <- list(test1 = test1)
parallel::clusterExport(cl, "my_globals")
dummy <- parallel::clusterEvalQ(cl, { attach(my_globals, name="my_globals"); rm(my_globals); })

With this, you'll see:

> s %<-% search()
> s
 [1] ".GlobalEnv"        "my_globals"        "package:stats"    
 [4] "package:graphics"  "package:grDevices" "package:utils"    
 [7] "package:datasets"  "toolbox:default"   "package:methods"  
[10] "Autoloads"         "package:base"

> y %<-% sum(test1)
> y
[1] -211.5222

@HenrikBengtsson HenrikBengtsson added the Backend API Part of the Future API that only backend package developers rely on label Oct 24, 2019
@HenrikBengtsson
Copy link
Owner

@renkun-ken, let's revisit sticky globals. There's actually a non-exported rudimentary prototype of this in future 1.17.0. The following illustrates how it can be used right now:

library(future)

## Set up PSOCK workers with sticky globals
cl <- makeClusterPSOCK(2)
test1 <- rnorm(n=10000)
future:::clusterExportSticky(cl, "test1")

plan(cluster, workers=cl)
a <- 42
f <- future({
  sum(a * test1)
}, globals=structure(TRUE, ignore="test1"))
v <- value(f)
print(v)
## [1] 6255.971

Note that globals=structure(TRUE, ignore="test1") tells the future framework to look for global variables but ignore (=don't include) anything named test1. This means that a will be exported but not test1.

To convince ourselves that test1 is indeed not exported each time, we can set:

options(future.globals.maxSize=0.9*object.size(test1))

such that there will be an error if test1 is exported, e.g.

options(future.globals.maxSize=0.9*object.size(test1))
f <- future({
  a*sum(test1)
}, globals = structure(TRUE, ignore="test1"))
v <- value(f)
print(v)
## [1] 6255.971

still works the following throws an error as expected:

f <- future({
  a*sum(test1)
})
## Error in getGlobalsAndPackages(expr, envir = envir, persistent = persistent,  : 
##   The total size of the 2 globals that need to be exported for the future expression 
## ('{; a * sum(test1); }') is 78.23 KiB. This exceeds the maximum allowed size of
## 70.35 KiB (option 'future.globals.maxSize'). There are two globals: 'test1'
## (78.17 KiB of class 'numeric') and 'a' (56 bytes of class 'numeric').

Please see if this provides the minimal basics that you need.

The big challenge will be to avoid having to specify globals = structure(TRUE, ignore="test1"). That would require having a mechanism where the main R session runs a checksum on test1 and then on the worker where it is to be exported and skip the export if there's a match. Conceptually, something like:

worker <- cl[1]  ## the worked allocated to the current future

## All identified globals
names <- c("a", "test1")
globals <- mget(names)

## The checksums of globals in the main R session
checksums <- vapply(globals, FUN = digest::digest, FUN.VALUE = NA_character_)

## Compare to the checksums of sticky globals on the worker
skip <- parallel::clusterCall(worker, fun = function(checksums) {
  env <- as.environment("future:sticky_globals")
  skip <- logical(length = length(checksums))
  names(skip) <- names(checksums)
  for (name in names(checksums)) {
    if (!exists(name, envir = env, inherits = FALSE)) next
    obj <- get(name, envir = env, inherits = FALSE)
    checksum <- digest::digest(obj)
    skip[name] <- (checksum == checksums[[name]])
  }
  skip
}, checksums = checksums)[[1]]

such that we get:

print(skip)
    a test1 
FALSE  TRUE

@renkun-ken
Copy link
Author

In my case, there are tens of data.tables (each is several gigabytes). I could imagine if any global checksum is done before running futures, the performance could not look good.

After all, the very reason I need the sticky globals is that there are many big objects in the global environment that should not be touched in any form at all (e.g. export, digest) and some objects are exported once and for all (to be sticky) exactly in the need of low overhead before running futures.

@renkun-ken
Copy link
Author

Therefore, the minimal API I think would work for me could be that I should be able to export a list of objects to the cluster prior to running any future and those exported objects are persistent across each run of futures so that they could run with minimal overhead (in my case, not detect any globals) but have direct access to those globals that already exist. The overall purpose for me in my use case is to reduce as much overhead as possible before running futures.

@HenrikBengtsson
Copy link
Owner

I see.

So, then there might be a need for sticky globals that are of class "trust-me-no-need-to-run-checksum". Such sticky globals will only be checked for their existence by name but not checksum

BTW, what I didn't show in above mockup is that one could of course cache the checksums on the worker side, i.e. they only need to be calculated ones. Of course, if a new sticky global with the same name is exported, then it'll have to be checksum:ed again.

Also, with mutable objects such data.table:s, there is a risk that the sticky global is changed on the worker end. This would invalidate any checksums. What is worse, it might no longer be the same object as intended. Point is, there's lots of things that can go wrong here and my concern is that one might end up with different result when running with plan(sequential) and plan(cluster), say. That is very much against the philosophy of the future framework. This is why these type of features must be introduced with great care.

Finally, ideally there would be a checksum field in the internal SEXP structure of any R object. This way one could calculate its checksum ones and after it'll be available in O(1). If the content of the object changes, this checksum could be reset. This would of course require changes to base R itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend API Part of the Future API that only backend package developers rely on feature request feature/sticky-globals globals
Projects
None yet
Development

No branches or pull requests

2 participants