Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to recover when the worker process segfaults? #11

Open
pfernique opened this issue Sep 21, 2020 · 5 comments
Open

Is it possible to recover when the worker process segfaults? #11

pfernique opened this issue Sep 21, 2020 · 5 comments

Comments

@pfernique
Copy link

pfernique commented Sep 21, 2020

Hi,

I'm trying to launch some processes that can sometimes throw a segfault (And this can't be predicted or modified since I don't have the source code).
This MWE code is behaving has I want using future::multiprocess plan (i.e., return 40 in the last line).

future::plan(future::multiprocess,
             workers = 2)

segfault <- function() {
  system('kill -11 $PPID')
}

processes <- c()

for (i in seq(1, 40)) {
  if (i == 20) {
    processes <- c(processes, withCallingHandlers({ future::future({segfault()}, lazy = TRUE, earlySignal=FALSE) }, error=function(...) {}))
  } else {
    processes <- c(processes, future::future({ Sys.sleep(2); i }, lazy = TRUE, earlySignal=FALSE))
  }
}

future::resolve(processes)
future::value(processes[[40]])

But, this MWE code is not behaving has I want using future.callr::callr plan (i.e., throw Error in readRDS(res) : error reading from connection).

future::plan(future.callr::callr,
             workers = 2)

segfault <- function() {
  system('kill -11 $PPID')
}

processes <- c()

for (i in seq(1, 40)) {
  if (i == 20) {
    processes <- c(processes, withCallingHandlers({ future::future({segfault()}, lazy = TRUE, earlySignal=FALSE) }, error=function(...) {}))
  } else {
    processes <- c(processes, future::future({ Sys.sleep(2); i }, lazy = TRUE, earlySignal=FALSE))
  }
}

future::resolve(processes)
future::value(processes[[40]])

Is it normal or do you have any idea why ?
Note that I'm using future v1.15.1 and future.callr v0.5.0.

@pfernique
Copy link
Author

I tried with future::multicore it's working fine but with future::multisession I have a similar problem (Error in unserialize(node$con) : error reading from connection).

@HenrikBengtsson
Copy link
Owner

Is it normal or do you have any idea why ?

Yes. There's lots of exception handling done in the future framework, and some of it even recoverable, but kicking workers far off the track is not automagically taken care.

Before anything else, use multicore or multisession explicitly. The multiprocess is just an alias to one of them depending on your operating system. I'm going to phase out multiprocess because it is ambiguous (e.g. I don't know what OS you're running here but reading between the lines in your error reports, it sounds like you're running on MS Windows).
get a similar
In the multisession case, we run PSOCK background workers (as defined by the parallel package) that communicate over a socket connection. If you kill a background worker, the communication with main R session is likely to become corrupted. In the future.callr::callr, which handled by the callr package, you get similar errors because callr communicates via the file system - a half-written file is corrupt. In the multicore case, workers are forked processes. Knocking those offline will confuse the main R process because it can no longer find a way to communicate with it's child process. The symptom will be something like a message on "An irrecoverable exception occurred. R is aborting now ..." from the forked process. On MS Windows, multicore equals sequential, which means the above example will kill the main R session.

In summary, what you're asking for is not part of the current future backend design. To support it, in general, would require looots of work. Even if it's a long-term roadmap, there are several things that need to come in place before it can be attacked. I also doubt one can cover cases such as sequential. Before this, it is more likely that someone develops a future backend that can handle severe corruption like this. Indeed, it might be that the batchtools package supports it, e.g. try with the sequential plan(future.batchtools::batchtools_local).

@HenrikBengtsson HenrikBengtsson changed the title Segfault within processes Is it possible to recover when the worker process segfault? Sep 21, 2020
@HenrikBengtsson HenrikBengtsson changed the title Is it possible to recover when the worker process segfault? Is it possible to recover when the worker process segfaults? Sep 21, 2020
@pfernique
Copy link
Author

Thanks for your reply ! I'm on Windows Subsystem for Linux (that behaves as Linux). I was glad enough to find that one backend could recover from segfaults. I was just surprised that it wasn't the future.callr::callr backend. Since callr communicates via the filesystem, I thought that handling segfaults will be easier: I was using callr when the code wasn't parallelized since the segfault in the launched session was transformed as an errror (only for the segfaulting process not the following ones) in my current session (Error in readRDS(res) : error reading from connection).

future.batchtools::batchtools_local seems quite interesting, I will give it a try !

@HenrikBengtsson
Copy link
Owner

saveRDS() is not atomic, so if killed in the middle of a write it will leave behind a half-written file, which results in that readRDS() error.

@pfernique
Copy link
Author

Yes, I have no problem understanding that. It's just that it seems to indicate that all processes use the same rds file (otherwise I really don't get why a rds file corrupted by a process would lead to corrupted rds files for all remaining processes ) and I naively believed that a different rds file would be used for each process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants