DEPRECATION: local=FALSE and persistent=TRUE #382

HenrikBengtsson · 2020-05-15T17:00:32Z

I'm planning to phase out support for local = FALSE and possibly also persistent = TRUE on cluster futures. I'm posting this issue to clarify that this is on the roadmap and to provide a location for discussion this move. If someone out there make use of local = FALSE or persistent = TRUE, I'd like to hear how and why. It might be that one solution is to provide workarounds with future backend solutions that support these but are not fully compliant to the Future API.

The main reason for deprecating local = FALSE is that it may affect the results and it is not supported by all backend. This is not safe because as a developer you don't know what backend the end-user will use and as an end-user you might break the developer's intentions if you set it for a backend. It was added to the Future API in the early days during its "experimental" phase. Also, keeping support for argument local prevents further development of the Future API.

The main reason for persistent = TRUE is it's use in remote futures where you want to control a remote R session as if it was your main one, but control it from your local one. An example of this can be seen in https://cran.r-project.org/web/packages/future/vignettes/future-3-topologies.html. I also think that some people use

cl <- future::makeClusterPSOCK(...)
plan(cluster, workers = cl, persistent =TRUE)
parallel::clusterExport(cl, c("some", "large", "global", "objects"))

to avoid large objects being exported/transferred each time some "heavy" future are launched. If this is a common use case, please note that I've got some ideas on adding support for sticky globals (#339 (comment)). For this to work across the board and with all backends, this feature needs to be optional, i.e. globals are sticky and cached on workers if supported and up-to-date, otherwise they are re-exported. With support for the latter, the need of the above kind of persistent = TRUE is less needed. Support for persistent = TRUE is a bit of a hack and like local it blocks taking the Future API to the next level.

The text was updated successfully, but these errors were encountered:

renkun-ken · 2020-06-05T04:43:04Z

Thanks for the idea of sticky globals.

In my use case in practice, I need both sticky globals exported to all cluster nodes in the beginning. They will not be changed any more.

And in each iteration every couple of minites, there are several less sticky variables that needs to be exported to all cluster nodes, such as the current iteration number, timestamp, etc, all nodes need to sync these variables before they start to process their tasks. To attach a list of objects in the search path looks nice when they are permenent and not subject to change. But what about these variables that actually needs update each time? Is it compatible with the roadmap or design of future to use clusterExport each time before starting tasks?

HenrikBengtsson · 2020-06-08T03:22:07Z

Is it compatible with the roadmap or design of future to use clusterExport each time before starting tasks?

Calling clusterExport() just before creating futures is not really part of the Future API and how futures should be used - simply because you shouldn't make assumptions about future/parallel backends when using futures. For instance, your code would break if plan(sequential) would be used.

Maybe you meant calling clusterExport() when setting up a parallel PSOCK cluster. Then, yes, in some sense it should theoretically be compatible with what I'm planning on doing. But, rather than exporting objects to the workers' global environment, it's probably safer to put those objects in a separate environment on the search() path as was prototyped in #339 (comment).

renkun-ken · 2020-06-08T03:44:31Z

Yes, I'm only talking about PSOCK cluster. Since PSOCK cluster is the only backend that works with my use case and I know exactly what I'm doing with it but some particular operations have to be done to make it more performant as needed. I'm curious that if the compatibility of different backend is top concern, and the real difference of operating on different backends are not accessible by user, it would be an obvious limitation for user to take advantage a certain backend.

HenrikBengtsson · 2020-06-09T03:29:32Z

My goal, without having to give up on the essential design objective that all future code should be able to run on any type of future backends, is to make it possible for these type of needs to be more efficient on certain backends than others. A caching framework for globals which may only apply to certain backends is one example of this.

In the ideal world, such caching could be fully automated. For example, instead of exporting all globals (as done now), the main R process could checksum the global objects and ask the worker whether it already has such an object or not. If it already exists on the worker, then the object does not have to be transferred (and thereby saving time and I/O). If it already exists, it will be transferred as now. What is complicated is how to specify which globals could/should be cached and which shouldn't. This would require some additional syntax. Caching all objects can quickly consume too much memory. Then we some type of syntax for communicating when the cache can be cleared.

I should also say, moving toward persistent = FALSE and making that the only option, does not preventing you from storing data persistently on the worker. persistent = FALSE/TRUE only controls whether the global environment should be wiped on the worker or not. Some rudimentary form of "sticky globals" will be in place before dropping support for persistent = TRUE. Maybe even by introducing helper functions such as clusterExportSticky() and clusterEraseSticky().

dwachsmuth · 2020-07-16T15:19:27Z

I'm currently using persistent = TRUE in order to set separate proxy servers for multithreaded web scraping functions I use. The use case is a little complex, but basically the Selenium headless browser looks for a list of arguments in the calling environment, and when I want to launch n browsers (one per future), I initialize the clusters first, put the proxy server information into each future with parallel::clusterApply, and then use foreach with %dopar% to do the scraping within the futures.

The relevant code looks like:

cl <- parallel::makeCluster(cores, setup_strategy = "sequential")
future::plan(future::cluster, workers = cl, persistent = TRUE)
proxies <- c("A", "B", "C", "D") # Actual proxy servers are specified here
parallel::clusterApply(cl, proxies, function(x) {proxy <<- x})
foreach(i = seq_along(URLS_to_scrape)) %dopar$ {
  # Do the scraping here
}

Would it be possible to supply the proxy servers as globals, but send separate globals to each future? (That's the key behaviour I need in this use case: each future should have a different proxy server specified.)

HenrikBengtsson · 2020-07-16T17:02:49Z

Hi. Thanks for reaching out with your usecase. Yes, you'll be able to achieve what you need also without using persistent=TRUE. Below is how you can, and actually should, do it already now.

I took the freedom to rewrite what you're doing right now, to help clarify what can be done:

## Set up cluster workers
cl <- parallel::makeCluster(4L, setup_strategy = "sequential")

## Customize each worker
proxies <- c("A", "B", "C", "D") # Actual proxy servers are specified here
stopifnot(length(proxies) == length(cl))  ## Implicit assumption
dummy <- parallel::clusterApply(cl, proxies, function(x) {
  config <- list(proxy=x)                                    
  env <- globalenv()                           
  env[["config"]] <- config
})

## Use these cluster workers for futures
future::plan(future::cluster, workers = cl, persistent = TRUE)

## The foreach to parallelize with futures
library(doFuture)
registerDoFuture()

URLS_to_scrape <- 1:8
res <- foreach(i = seq_along(URLS_to_scrape)) %dopar% {
  # Do the scraping here
  data.frame(i=i, proxy=config$proxy)
}
res <- do.call(rbind, res)
print(res)
###   i proxy
### 1 1     A
### 2 2     A
### 3 3     B
### 4 4     B
### 5 5     C
### 6 6     C
### 7 7     D
### 8 8     D

Now, the main reason for you using persistent=TRUE above is actually that futures do not automatically clean up the global environment of the workers. If that wouldn't take place, you would have to use persistent=TRUE.

Now, you can "attach" variables elsewhere on the search() path, where the easiest way to to create a new environment on position 2 immediately after the global environment. This way, if config is neither in the local environment nor the global environment, then it will be found in this custom environment. Here is how you do this:

## Set up cluster workers
cl <- parallel::makeCluster(4L, setup_strategy = "sequential")

## Cusomize each worker
proxies <- c("A", "B", "C", "D") # Actual proxy servers are specified here
stopifnot(length(proxies) == length(cl))  ## Implicit assumption
dummy <- parallel::clusterApply(cl, proxies, function(x) {
  config <- list(proxy=x)                                    
  attach(list(config=config), name="worker-config")
})

## Use these cluster workers for futures
future::plan(future::cluster, workers = cl)

## The foreach to parallelize with futures
library(doFuture)
registerDoFuture()

URLS_to_scrape <- 1:8
res <- foreach(i = seq_along(URLS_to_scrape)) %dopar% {
  # Do the scraping here
  data.frame(i=i, proxy=config$proxy)
}
res <- do.call(rbind, res)

Note how there is no persistent=TRUE above.

... (That's the key behaviour I need in this use case: each future should have a different proxy server specified.)

Just to be 100% clear here; Because of chunking, one future may process multiple foreach iterations, as the above res illustrates. I assume you meant to say: "each ~~future~~ worker should have a different proxy server specified."

dwachsmuth · 2020-07-17T01:38:24Z

Thanks so much for the response. I think you've nailed it here; you're right that I meant that each worker should have a different proxy server specified, and your suggested rewrite works perfectly! I will remove the persistent = TRUE argument in my next update.

HenrikBengtsson · 2020-07-27T23:57:45Z

HEADS UP: The next step on this issue will be to deprecate local = FALSE except when persistent = TRUE. This will help phase out the local argument without breaking the current use of persistent = TRUE of 'cluster' futures. The deprecation of argument local will probably start by:

if (!local) {
  .Deprecated(msg = "Using 'local = FALSE' is deprecated and will soon be defunct and produce an error")
}

for all cases where persistent is not used or it is FALSE.

Deprecation of persistent = TRUE will come at a later stage when I/we know all needs can be met.

HenrikBengtsson · 2020-09-30T21:15:22Z

I've deprecated local = FALSE in the next release (commit 549ac88). The only allowed exception for this is when using persistent=TRUE with cluster futures. (The latter will eventually be deprecated as well, but it'll take a bit longer)

HenrikBengtsson · 2020-10-21T00:14:34Z

I'm closing this one; local = FALSE is now deprecated as far as possible. The remaining part of deprecating persistent = TRUE is tracked by Issue #433.

… only `local = FALSE` was defunct [#382]

… only `local = FALSE` was defunct [#382] [ci skip]

….e. don't break it just yet [#382]

…ng when {civis} runs in batch mode; in interactive mode, it's defunct [#382]

HenrikBengtsson added help wanted Frontend API Part of the Future API that users of futures rely on Backend API Part of the Future API that only backend package developers rely on cluster (PSOCK) labels May 15, 2020

HenrikBengtsson pinned this issue May 15, 2020

HenrikBengtsson added this to the Next release milestone Jul 27, 2020

DavisVaughan mentioned this issue Aug 6, 2020

Version of functions that leave data where it is? DavisVaughan/furrr#46

Closed

HenrikBengtsson modified the milestones: 1.19.0, Next release Sep 18, 2020

HenrikBengtsson added a commit that referenced this issue Sep 30, 2020

'local = FALSE' is deprecated [#382]

549ac88

HenrikBengtsson added a commit that referenced this issue Oct 4, 2020

It's ok for transparent to use local=FALSE [#382]

45b9e5c

HenrikBengtsson unpinned this issue Oct 11, 2020

HenrikBengtsson pinned this issue Oct 11, 2020

HenrikBengtsson mentioned this issue Oct 21, 2020

Deprecate support for persistent=TRUE in cluster futures #433

Open

3 tasks

HenrikBengtsson closed this as completed Oct 21, 2020

HenrikBengtsson unpinned this issue Oct 24, 2020

HenrikBengtsson added a commit that referenced this issue Jan 11, 2023

CLEANUP: Argument local is defunct and has been removed. Previously…

f5b4724

… only `local = FALSE` was defunct [#382]

HenrikBengtsson added a commit that referenced this issue Jan 11, 2023

CLEANUP: Argument local is defunct and has been removed. Previously…

3b39c87

… only `local = FALSE` was defunct [#382] [ci skip]

HenrikBengtsson added a commit that referenced this issue Jan 27, 2023

Allow the 'civis' package to use argument 'local' for a tad longer, i…

f0135d1

….e. don't break it just yet [#382]

HenrikBengtsson added a commit that referenced this issue Feb 9, 2023

Be more conservative and de-escalate the local=FALSE error to a warni…

e8c8a9a

…ng when {civis} runs in batch mode; in interactive mode, it's defunct [#382]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPRECATION: local=FALSE and persistent=TRUE #382

DEPRECATION: local=FALSE and persistent=TRUE #382

HenrikBengtsson commented May 15, 2020

renkun-ken commented Jun 5, 2020

HenrikBengtsson commented Jun 8, 2020

renkun-ken commented Jun 8, 2020

HenrikBengtsson commented Jun 9, 2020

dwachsmuth commented Jul 16, 2020

HenrikBengtsson commented Jul 16, 2020 •

edited

Loading

dwachsmuth commented Jul 17, 2020

HenrikBengtsson commented Jul 27, 2020

HenrikBengtsson commented Sep 30, 2020

HenrikBengtsson commented Oct 21, 2020

DEPRECATION: local=FALSE and persistent=TRUE #382

DEPRECATION: local=FALSE and persistent=TRUE #382

Comments

HenrikBengtsson commented May 15, 2020

renkun-ken commented Jun 5, 2020

HenrikBengtsson commented Jun 8, 2020

renkun-ken commented Jun 8, 2020

HenrikBengtsson commented Jun 9, 2020

dwachsmuth commented Jul 16, 2020

HenrikBengtsson commented Jul 16, 2020 • edited Loading

dwachsmuth commented Jul 17, 2020

HenrikBengtsson commented Jul 27, 2020

HenrikBengtsson commented Sep 30, 2020

HenrikBengtsson commented Oct 21, 2020

HenrikBengtsson commented Jul 16, 2020 •

edited

Loading