Cannot find symbols exported to node by parallel::clusterExport #339

renkun-ken · 2019-09-19T05:14:54Z

In the following example, I try to export certain variables to each cluster node before any future is created. However, the exported symbols cannot be found when a future is resolved.

I don't want future() to detect globals or export variables because in my use case, there are tens of futures and they will be called periodically (every several minutes), each run is time critical so that I don't want the same global variables to be detected and exported to the workers again and again.

library(future)
library(parallel)

test1 <- rnorm(10000)

cl <- makeClusterPSOCK(2)
clusterExport(cl, "test1")
plan(cluster, workers = cl)

f <- future({
  sum(test1)
}, globals = FALSE)
values(f)
stopCluster(cl)

The text was updated successfully, but these errors were encountered:

HenrikBengtsson · 2019-09-19T09:24:06Z

It's because the workers global environment is intentionally wiped before (and after if gc=TRUE) each round of evaluation. This is intentionally because futures should really not have a memory, e.g. one future should be able to leave behind side effects that affects future futures.

There's a document "misfeature" that allows you to disable this (see argument persistent). I'm not particularly fond of it and I don't encourage making use of it - ideally (I think) it'll be removed one day.

Having said that, what you're asking for suggests that there might be room for a way to control the default, initial state of futures, e.g. global variables, options, and env vars that are always set. I'll add it to the list of feature requests.

HenrikBengtsson · 2019-09-20T22:27:00Z

Sticky globals

BTW, you can also do:

library(future)
cl <- makeClusterPSOCK(2)
plan(cluster, workers = cl)

# Export "sticky" globals to all workers
test1 <- rnorm(10000)
my_globals <- list(test1 = test1)
parallel::clusterExport(cl, "my_globals")
dummy <- parallel::clusterEvalQ(cl, { attach(my_globals, name="my_globals"); rm(my_globals); })

With this, you'll see:

> s %<-% search()
> s
 [1] ".GlobalEnv"        "my_globals"        "package:stats"    
 [4] "package:graphics"  "package:grDevices" "package:utils"    
 [7] "package:datasets"  "toolbox:default"   "package:methods"  
[10] "Autoloads"         "package:base"

> y %<-% sum(test1)
> y
[1] -211.5222

HenrikBengtsson · 2020-06-18T00:03:36Z

@renkun-ken, let's revisit sticky globals. There's actually a non-exported rudimentary prototype of this in future 1.17.0. The following illustrates how it can be used right now:

library(future)

## Set up PSOCK workers with sticky globals
cl <- makeClusterPSOCK(2)
test1 <- rnorm(n=10000)
future:::clusterExportSticky(cl, "test1")

plan(cluster, workers=cl)
a <- 42
f <- future({
  sum(a * test1)
}, globals=structure(TRUE, ignore="test1"))
v <- value(f)
print(v)
## [1] 6255.971

Note that globals=structure(TRUE, ignore="test1") tells the future framework to look for global variables but ignore (=don't include) anything named test1. This means that a will be exported but not test1.

To convince ourselves that test1 is indeed not exported each time, we can set:

options(future.globals.maxSize=0.9*object.size(test1))

such that there will be an error if test1 is exported, e.g.

options(future.globals.maxSize=0.9*object.size(test1))
f <- future({
  a*sum(test1)
}, globals = structure(TRUE, ignore="test1"))
v <- value(f)
print(v)
## [1] 6255.971

still works the following throws an error as expected:

f <- future({
  a*sum(test1)
})
## Error in getGlobalsAndPackages(expr, envir = envir, persistent = persistent,  : 
##   The total size of the 2 globals that need to be exported for the future expression 
## ('{; a * sum(test1); }') is 78.23 KiB. This exceeds the maximum allowed size of
## 70.35 KiB (option 'future.globals.maxSize'). There are two globals: 'test1'
## (78.17 KiB of class 'numeric') and 'a' (56 bytes of class 'numeric').

Please see if this provides the minimal basics that you need.

The big challenge will be to avoid having to specify globals = structure(TRUE, ignore="test1"). That would require having a mechanism where the main R session runs a checksum on test1 and then on the worker where it is to be exported and skip the export if there's a match. Conceptually, something like:

worker <- cl[1]  ## the worked allocated to the current future

## All identified globals
names <- c("a", "test1")
globals <- mget(names)

## The checksums of globals in the main R session
checksums <- vapply(globals, FUN = digest::digest, FUN.VALUE = NA_character_)

## Compare to the checksums of sticky globals on the worker
skip <- parallel::clusterCall(worker, fun = function(checksums) {
  env <- as.environment("future:sticky_globals")
  skip <- logical(length = length(checksums))
  names(skip) <- names(checksums)
  for (name in names(checksums)) {
    if (!exists(name, envir = env, inherits = FALSE)) next
    obj <- get(name, envir = env, inherits = FALSE)
    checksum <- digest::digest(obj)
    skip[name] <- (checksum == checksums[[name]])
  }
  skip
}, checksums = checksums)[[1]]

such that we get:

print(skip)
    a test1 
FALSE  TRUE

renkun-ken · 2020-06-18T00:30:05Z

In my case, there are tens of data.tables (each is several gigabytes). I could imagine if any global checksum is done before running futures, the performance could not look good.

After all, the very reason I need the sticky globals is that there are many big objects in the global environment that should not be touched in any form at all (e.g. export, digest) and some objects are exported once and for all (to be sticky) exactly in the need of low overhead before running futures.

renkun-ken · 2020-06-18T00:34:25Z

Therefore, the minimal API I think would work for me could be that I should be able to export a list of objects to the cluster prior to running any future and those exported objects are persistent across each run of futures so that they could run with minimal overhead (in my case, not detect any globals) but have direct access to those globals that already exist. The overall purpose for me in my use case is to reduce as much overhead as possible before running futures.

HenrikBengtsson · 2020-06-18T00:45:22Z

I see.

So, then there might be a need for sticky globals that are of class "trust-me-no-need-to-run-checksum". Such sticky globals will only be checked for their existence by name but not checksum

BTW, what I didn't show in above mockup is that one could of course cache the checksums on the worker side, i.e. they only need to be calculated ones. Of course, if a new sticky global with the same name is exported, then it'll have to be checksum:ed again.

Also, with mutable objects such data.table:s, there is a risk that the sticky global is changed on the worker end. This would invalidate any checksums. What is worse, it might no longer be the same object as intended. Point is, there's lots of things that can go wrong here and my concern is that one might end up with different result when running with plan(sequential) and plan(cluster), say. That is very much against the philosophy of the future framework. This is why these type of features must be introduced with great care.

Finally, ideally there would be a checksum field in the internal SEXP structure of any R object. This way one could calculate its checksum ones and after it'll be available in O(1). If the content of the object changes, this checksum could be reset. This would of course require changes to base R itself.

HenrikBengtsson added the feature request label Sep 19, 2019

HenrikBengtsson added the Backend API Part of the Future API that only backend package developers rely on label Oct 24, 2019

HenrikBengtsson added the globals label Dec 8, 2019

HenrikBengtsson mentioned this issue Dec 8, 2019

Need API like parallel::clusterCall #273

Open

HenrikBengtsson mentioned this issue May 15, 2020

DEPRECATION: local=FALSE and persistent=TRUE #382

Closed

HenrikBengtsson added this to the Future release (not next) milestone Jun 18, 2020

jeffkeller87 mentioned this issue Oct 24, 2020

Tuning/reducing worker overhead costs #437

Closed

HenrikBengtsson added the feature/sticky-globals label Dec 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot find symbols exported to node by parallel::clusterExport #339

Cannot find symbols exported to node by parallel::clusterExport #339

renkun-ken commented Sep 19, 2019

HenrikBengtsson commented Sep 19, 2019

HenrikBengtsson commented Sep 20, 2019 •

edited

HenrikBengtsson commented Jun 18, 2020

renkun-ken commented Jun 18, 2020

renkun-ken commented Jun 18, 2020

HenrikBengtsson commented Jun 18, 2020

Cannot find symbols exported to node by parallel::clusterExport #339

Cannot find symbols exported to node by parallel::clusterExport #339

Comments

renkun-ken commented Sep 19, 2019

HenrikBengtsson commented Sep 19, 2019

HenrikBengtsson commented Sep 20, 2019 • edited

Sticky globals

HenrikBengtsson commented Jun 18, 2020

renkun-ken commented Jun 18, 2020

renkun-ken commented Jun 18, 2020

HenrikBengtsson commented Jun 18, 2020

HenrikBengtsson commented Sep 20, 2019 •

edited