Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the future package to aid in determining symbols and packages to export #2

Open
richcalaway opened this issue Aug 1, 2019 · 0 comments

Comments

@richcalaway
Copy link
Collaborator

@richcalaway richcalaway commented Aug 1, 2019

Background

Original implementation

In April 2018, in an unreleased version 1.4.6 of foreach, Rich Calaway introduced a new feature that used Henrik Bengtsson's future package (if available) to determine the appropriate symbols and packages to export to the foreach workers. In the initial implementation, future was used if it was installed and its namespace could be loaded via requireNamespace. Henrik responded with the following comments:

COMMENT #1: Surprised users

When future::getGlobalsAndPackages() will be used or not might
surprise users. Your proposal to use it when

requireNamespace("future", quietly=TRUE) == TRUE

risks confusing the end users and whoever helps troubleshooting issues
related to globals. They might get one result one day, and another
result the next day, just because the 'future' package happened to be
installed in between. The same if they run on different systems. If
someone runs into issues related to globals, the troubleshooting
counter questions have to involve: "Do you have the 'future' package
installed?" (unless they share a proper sessionInfo() that is).

One slightly less confusing condition would be to condition it on:

("future" %in% loadedNamespaces())

I've seen that style used in some packages. The user has to enable
the "feature" by making sure a package is loaded. However, that is
still not transparent to the end user since the 'future' package might
be loaded as a side effect by some other package. For instance, it
may be loaded when using registerDoSeq(), but then not when they retry
with registerDoParallel().

It might be better to let the user explicitly control this via an R option, e.g.

useFuture <- getOption("foreach.globalsAs", default = "future")

or default = "foreach". A middle ground, during your migration, is to
use something like:

useFuture <- getOption("foreach.globalsAs", default = NULL)
if (is.null(useFuture)) {
useFuture <- requireNamespace("future", quietly=TRUE)
if (useFuture) {
warning('foreach() will identify globals and packages using the
future package because that package is installed. To suppress this
warning set options(foreach.globalsAs = "future"). To use the
traditional approach of foreach for identifying globals, use
options(foreach.globalsAs = "foreach").')
} else {
warning('foreach() will identify globals and packages using the
foreach package because the alternative based on the future package
requires that future is installed. To suppress this warning set
options(foreach.globalsAs = "foreach"). To use future for identifying
globals, install that package.')
}
}

COMMENT #2: Maximum total size of globals

future::getGlobalsAndPackages() has a built-in protection for
exporting too large amounts of globals. It's currently set to 500
MiB, and currently only controlled via option
'future.globals.maxSize'. To minimize the surprise here, you might
wanna temporarily set this to +Inf to disable this check/assertion,
e.g.

gp <- local({
oopts <- options(future.globals.maxSize = +Inf)
on.exit(options(oopts))
future::getGlobalsAndPackages(ex, envir = env)
})

FYI, in the next release of the future package, you'll be able to do
future::getGlobalsAndPackages(..., maxSize = +Inf).

COMMENT #3: future+globals vs foreach (<= 1.4.5)

I've identified one use case where some people might say that the
'globals' package is too conservative. The case is when a variable is
global or local conditionally on some other variable/state, e.g.

{
if (runif(1) < 1/2) y <- 0
y
}

In future+globals, the above will NOT pick up 'y' as a global variable
(*), whereas in foreach (<= 1.4.5), it is identified as a global
variable (because you are using a more liberal approach).
Unfortunately, I don't this there is an easy solution to handle the
above ambiguous case, where a variable is global or local depending on
the run-time state, without adding significant processing overhead.
OTH, I think this is an oversight by the developer (or at least a
deliberate hack) and I don't mind "forcing" code to break in these
ambiguous cases since it's quite easy to make it non-ambiguous. So
far I've only identified one case of this in the wild:

example("avNNet", package = "caret", run.dontrun = TRUE)

so I think you shouldn't expect much reports on that.

(*) The main reason for future+globals failing to find 'y' here is
because it is a side-effect of its "ordered" code inspection for the
purpose of identifying 'x' as a global in { x <- x + 1 }. I'm
tracking this over at HenrikBengtsson/globals#31

First revised implementation

Rich revised the implementation in foreach 1.5.0/1.5.1, adding a check for a global option foreachGlobals to see if the user preferred the original foreach functionality (which found some global symbols but did not search for additional package exports), and expanding the symbol search to include the union of those found by future's getGlobalsAndPackages function and those found by foreach's original method.

This revision addressed issue #3 and part of #1, but did not address the size of globals problem.

Henrik responded with the following additional comments (and Rich responded with >RBC inline comments):

  • DOCUMENTATION: A user who sets options(foreachGlobals = "foreach")
    may wonder how to undo that. From the code, it looks like one can do
    this by unsetting the option, i.e. options(foreachGlobals = NULL). It
    would help to document that too.

RBC --Agreed

  • DOCUMENTATION: It says "Beginning with foreach 0.5.0, foreach will
    use the future package, if available, to automatically detect
    needed packages." Without seeing the code, the term "available" is a
    bit ambiguous to the user. Maybe write "..., if it is installed and
    can be loaded, ..." instead. And clarify further by "If the future
    package is not installed, then the identification of globals will be
    done as if options(foreachGlobals = "foreach") was set."

RBC I'll work on this.

  • SURPRISE FACTOR: The silent, conditional requireNamespace("future")
    behavior may still be confusing, surprising, ... and makes it hard to
    troubleshoot. It's likely that there'll be comments like "weird, it
    works for me" kind of discussions.

RBC -- Agreed, but that's part of the point--I want to get the benefit of future as seamlessly as possible.
RBC If users need to set an option, most users won't.

  • CLEANUP: To avoid loading 'future' when not needed, you might wanna
    swap the order to if (!identical(getOption("foreachGlobals"),
    "foreach") && requireNamespace("future", quietly=TRUE)){){ ... }

RBC: Good idea!

  • PREDICTABILITY: You could introduce options(foreachGlobals =
    "future+foreach"), which if set, will produce an error if the 'future'
    package is not installed. OTH, not sure what the current default
    should be named - maybe "future+foreach-or-foreach"?
  • doFuture: If you could support/standardize on options(foreachGlobals
    = "future+foreach") and possible also options(foreachGlobals =
    "future"), then I could rely on that in doFuture rather than adding
    another option.

RBC: Yes, I think I can do this.

  • ROBUSTNESS: I've also played around with the idea of something like
    options(foreachGlobals = "manual") which will disable the automatic
    identification of globals and rely solely on arguments .export and
    '.packages`.

RBC: Another good idea.

Henrik responded to Rich's responses with the following additional comments:

I still argue its a bad idea that the behavior is conditioned on
whether you have a package installed or not. What is even worse is
that the dependent package (here 'future') may be installed at a
random time because it is installed together with some other package.
All of a sudden the behavior changes and the user has no idea why
because they did not change anything.

BTW, do you have a better place than an email thread where this can be
discussed and be properly documented? There's a great risk that
things are getting lost in nested email threads. I would love if you
would bring foreach to GitHub where the issue tracker provides an
excellent communication channel. I believe there's valueable feedback
from other developers that would reach you if you'd be on GitHub.
Also, I have two suggestions that I think would improve foreach a lot-
but I'll wait with those for now.

The second of these final comments is addressed with this issue; foreach has indeed been brought to GitHub and this issue is the first to be set into the issue tracker.

Current Development

Rich has handed off maintenance of foreach to Hong Ooi, but is leaving one final pass at the future integration as a branch richcala/foreachFutureTake3. This addresses most of the remaining issues--including the surprise factor. In this implementation, if the user is not using Microsoft R Open, the future option is added only if one of the foreachGlobals options "future+foreach", "foreach+future", or "future" is explicitly set. All of those act the same, however--the union of future and foreach previously implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.