Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reassign a specific task when some remote cluster connection closed? #188

Open
seonghobae opened this issue Jan 14, 2018 · 2 comments

Comments

@seonghobae
Copy link

Hi,

I want to know how to reassign a specific task which one was failed to get when some remote cluster connection closed.

Sometimes, I work in the for loop with listenv::listenv(), I experienced the remote cluster connection closed with internet connectivity issues, or remote R session was dead unexpectedly.

Then, I do not know which task was failed, gone, or successes. Hence I got an incomplete result or none.

In the current, I did like this with my little knowledge, that's time waste so many situations, I try to parallelise over one thousand model estimation works during Bayesian calibrations.

https://github.com/seonghobae/kaefa/blob/master/R/kaefa.R#L441-L462
https://github.com/seonghobae/kaefa/blob/master/R/newEngine.R#L226-L435

Here are my detailed questions:

  1. I want to know how to reassign a specific task which one was failed to get when some remote cluster connection closed.
  2. How to know which job fails with connectivity issues? (not to non-convergence)
  3. The future() has a failover (robustness) mode retry automatically some remote cluster has disconnect unexpectedly?

Best,
Seongho

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Jan 29, 2018

This is related to questions/comments in Issue #154. There's currently no automatic, built-in "failover" mechanism in the future framework.

Here's a minimal reproducible example that emulates an R worker going down:

library("future")
plan(multisession, workers = 2L)
f <- future( quit("no") )
v <- value(f)
# Error in unserialize(node$con) : 
#  Failed to retrieve the value of MultisessionFuture from cluster node #1 (on 'localhost').  The reason reported was 'error reading from connection'

As a first step, what needs to be added to the future framework is a way to distinguish this type of errors from regular errors produced from evaluating the future expression itself. This minimal extension is on my todo list.

@HenrikBengtsson
Copy link
Owner

A quick follow up; with the release of future 1.8.0, the first two items below should now be possible:

  1. I want to know how to reassign a specific task which one was failed to get when some remote cluster connection closed.
  2. How to know which job fails with connectivity issues? (not to non-convergence)
  3. The future() has a failover (robustness) mode retry automatically some remote cluster has disconnect unexpectedly?

Errors due to orchestration of futures (e.g. connection errors) are now of class FutureError, which can be caught by tryCatch() and friends. For instance,

> library("future")
> plan(multisession, workers = 2L)
> f <- future( quit("no") )
> res <- tryCatch(v <- value(f), FutureError = identity)
> str(res)
List of 2
 $ message: chr "Failed to retrieve the value of MultisessionFuture from cluster node #1 (on 'localhost').  The reason reported "| __truncated__
 $ call   : language unserialize(node$con)
 - attr(*, "class")= chr [1:5] "FutureError" "simpleError" "error" "FutureCondition" ...
 - attr(*, "future")=Classes 'MultisessionFuture', 'ClusterFuture', 'MultiprocessFuture', 'Future', 'environment' <environment: 0x40122a8> 

The actually relaunching of a failed future is discussed in Issue #205 - more work is needed there for sure.

Another issue is what happens with the state of the workers and how to recover those. A naive approach is to restart the workers by temporarily switching to another plan and back:

plan(sequential)
plan(multisession, workers = 2L)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants