-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multisession cluster cleanup is incomplete #261
Comments
<scratch - not relevant> plan(sequential)
plan(multisession) does indeed give you fresh set of PSOCK workers? PS. Calling |
Judging by your edit, it looks like you discovered this too, but the interstitial |
Thanks for reporting. This is related the non-robustness representation of connections that R uses internally. Basically, R connections are (a) indexed by a single integer, (b) these indices can be reused, and (c) there is no protection against corrupt/overridden connection handlers, e.g. > con_a <- file("a", open = "w")
> as.integer(con_a)
[1] 3
> print(con_a)
A connection with
description "a"
class "file"
mode "w"
text "text"
opened "opened"
can read "no"
can write "yes"
> close(con_a)
> con_b <- file("b", open = "w")
> as.integer(con_b)
[1] 3
> as.integer(con_a) ## <== same as 'con_b'
[1] 3
> print(con_a)
A connection with
description "b" ## <== not what we want
class "file"
mode "w"
text "text"
opened "opened"
can read "no"
can write "yes" See also my R-devel post 'closeAllConnections() can really mess things up' on 2016-10-30. There's already a few protection mechanisms against "lost" connections in the packages, but not for the example you show. As a very first step, I've added protection against this (in the develop branch). Here is a minimal example of what happens now: > library(future)
> plan(multisession, workers = 2L)
> f <- future(42)
> plan(multisession, workers = 2L) # <== messes up R's table of connections
> value(f)
Error: Cannot receive result of MultisessionFuture future (<none>), because the connection
to the worker is corrupt: Connection (connection: description="<-localhost:11187",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes",
can write="yes", id=943) is no longer valid. It differ from the currently registered R connection
with the same index 3 (connection: description="<-localhost:11187", class="sockconn",
mode="a+b", text="binary", opened="opened", can read="yes", can write="yes", id=949)
I'll look into having repeated |
Please try the develop version, in which (1) repeated calls of equal > library(future)
> plan(multisession, workers = 2L)
> f <- future(42)
> plan(multisession, workers = 2L) # <== automatically skipped
> value(f)
[1] 42
> f <- future(42)
> plan(multisession, workers = 3L) # <== not skipped; messes up f's connection
> value(f)
Error: Cannot receive result of MultisessionFuture future (<none>), because the connection
to the worker is corrupt: Connection (connection: description="<-localhost:11478",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes",
can write="yes", id=947) is no longer valid. It differ from the currently registered R
connection with the same index 3 (connection: description="<-localhost:11478",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes",
can write="yes", id=956) Seems to work also with the 'promises' and Shiny examples in your original post. |
Please, let me know when you've confirmed that the develop version solves that use pattern where a Shiny app keeps being reloaded over and over. This will help me planning when to do the next release. |
Sorry for the delay, I think I'm not getting notifications on this thread for some reason. I'll try it first thing tomorrow. |
Hmmm. Sorry, the problem still repros for me with the Shiny app above.
|
Hmm x 2 ... and I can't reproduce it with the develop version. I could reproduce it in the past (running What happens when you do: > library(future)
> plan(multisession, workers = 2L)
> f <- future(42)
> plan(multisession, workers = 2L) # <== used to mess up R's table of connections
> value(f) ? DetailsIn a fresh R session without a ~/.Rprofile: library(shiny)
library(promises)
library(future)
plan(multisession)
iterations <- 8
delay <- 3
ui <- fluidPage(
"After ", delay, " seconds, \"Hello\" should appear ", iterations, " times.",
lapply(1:iterations, function(x) {
verbatimTextOutput(paste0("out", x))
})
)
server <- function(input, output, session) {
lapply(1:iterations, function(x) {
output[[paste0("out", x)]] <- renderText({
future(Sys.sleep(delay)) %...>%
{ message("got here ", x) } %...>%
{ paste0(Sys.time(), ": Hello") }
})
})
}
shinyApp(ui, server) and > devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.5.1 (2018-07-02)
os Ubuntu 18.04.1 LTS
system x86_64, linux-gnu
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2018-11-15
─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1)
backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.1)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.1)
callr 3.0.0 2018-08-24 [1] CRAN (R 3.5.1)
cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
codetools 0.2-15 2016-10-05 [4] CRAN (R 3.5.0)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1)
debugme 1.1.0 2017-10-22 [1] CRAN (R 3.5.1)
desc 1.2.0 2018-08-12 [1] Github (r-lib/desc@4f60833)
devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.1)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.1)
future * 1.10.0-9000 2018-11-14 [1] local
globals 0.12.4 2018-10-11 [1] local
glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.1)
httpuv 1.4.5 2018-07-19 [1] CRAN (R 3.5.1)
jsonlite 1.5 2017-06-01 [1] CRAN (R 3.5.1)
later 0.7.5 2018-09-18 [1] CRAN (R 3.5.1)
listenv 0.7.0 2018-01-21 [1] CRAN (R 3.5.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.1)
mime 0.6 2018-10-05 [1] CRAN (R 3.5.1)
pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.1)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1)
processx 3.2.0 2018-08-12 [1] Github (r-lib/processx@c565be4)
promises * 1.0.1 2018-04-13 [1] CRAN (R 3.5.1)
R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1)
Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.1)
remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.1)
rlang 0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1)
rsconnect 0.8.11 2018-11-12 [1] CRAN (R 3.5.1)
rstudioapi 0.8 2018-10-02 [1] CRAN (R 3.5.1)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1)
shiny * 1.2.0 2018-11-02 [1] CRAN (R 3.5.1)
testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.1)
usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1)
xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.1)
yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1)
[1] /home/hb/R/x86_64-pc-linux-gnu-library/3.5
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library |
Thanks. I'll try again this afternoon on Linux. |
I am able to repro on Ubuntu 16.04.5, R 3.5.1, from the terminal. Note that to repro you have to Ctrl-C the terminal after the page loads but before the "Hello" output appears. I can take a closer look next week. |
…ompatibility for now) [#261]
Thanks for clarifying the use case. I can now reproduce this (somewhat new) issue too. The short storyThere is a combination of issues kicking in in your use case. I'll to cover them below, but here's a fix: remotes::install_github("HenrikBengtsson/future@develop")
remotes::install_github("rstudio/promises#37") Then, when rerunning the above Shiny app (on two cores), however you want to break and interrupt it, you'll get: > shiny::runApp()
Loading required package: shiny
Listening on http://127.0.0.1:3173
^C
>
> shiny::runApp()
Listening on http://127.0.0.1:3173
Warning: Error in : Cannot resolve MultisessionFuture (<none>), because the connection to
the worker is corrupt: Connection (connection: description="<-localhost:11294",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", can
write="yes", id=290) is no longer valid. It differ from the currently registered R
connection with the same index 4 (connection: description="<-localhost:11294",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", can
write="yes", id=434)
[No stack trace available]
Warning: Error in : Cannot resolve MultisessionFuture (<none>), because the connection to
the worker is corrupt: Connection (connection: description="<-localhost:11294",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", can
write="yes", id=288) is no longer valid. It differ from the currently registered R
connection with the same index 3 (connection: description="<-localhost:11294",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", can
write="yes", id=432)
[No stack trace available]
got here 1
got here 2
got here 3
got here 4 The key thing here is that the modified The long storyfuture 1.10.0, parallel and R connections:First, lets look at the problems that exist with future 1.10.0 and the parallel package. As I see it, there is one major problem: You will corrupt the connection to PSOCK workers used by existing futures if you close the old PSOCK cluster by replacing it with a new one. This is due to a weakness in how R represents connections internally. The main concern I have right now with the current R implementation is that this weakness is not detected and therefore not protected against. See HenrikBengtsson/Wishlist-for-R#81 for details - I've brought this up on R-devel. Now, when calling future 1.10.0-9000 (develop version):I've updated the future package (develop version) to detect and produce an error whenever there is an attempt to use a connection that is no longer valid. This protects us from messing up the state of the futures. Skipping replicated plan():s?I decided to roll back to use backward compatible |
@jcheng5, did you have a chance to look at this one? |
Sorry, I have not, other than reading your description. I have a card on my Trello board that is constantly reminding me to revisit this though. I don't think I will be able to get to it until after rstudio::conf though (mid-January), I'm really sorry. Am I understanding correctly though that the next step you'd like from me is to make a more-minimal repro? |
That's alright. I'll probably release next version of future before then.
Nah, the only action needed on your end is to "understand" that the 'promises' package needs to be agile also to future orchestration errors (e.g. broken connections, crashed workers, etc). Those type of errors are of class FutureError and different from regular errors that may occur when evaluating (future) expressions. My https://github.com/rstudio/promises/pull/37/files PR addresses/clarifies this. |
Updated take-home message:
(*) The state can get corrupted when one, for instancem "forces" a new multisession/cluster while an existing, non-resolved one exists. This will corrupt the state due to a weakness in R itself, which future now detects/works around. |
I'll consider this one done on the future side. rstudio/promises#37 still needs to be incorporated. |
Hi Henrik, the Shiny team ran into this issue while testing our latest Shiny release. I don't think it is a new issue.
The below repro may seem contrived, but it's easy to get into a similar situation when using Shiny with future and promises. I'll explain the repro first, then how it relates to Shiny.
Basically, the problem occurs when you
plan(multisession)
, launch some tasks, thenplan(multisession)
again with.cleanup = TRUE
(the default), then launch some more tasks. In this situation, I'd expect that the first set of tasks would either error out, or maybe be allowed to complete. What happens instead is that the first set of tasks actually interferes with the second set of tasks.The reason this can occur easily in Shiny is because 1) we tell people to put
plan(multisession)
at the top of their app.R file, and then when they make iterative changes on their app they're running this file again and again in the same R session; and 2) promises work by repeatedly polling againstfuture
objects and they have no way of knowing that the cluster the futures belonged to is stopped.The upshot is that if you're using a Shiny app and it has multisession futures in flight, you stop the app using Esc, and launch the app again, subsequent multisession futures randomly hang. Here's an example of a Shiny app that does this:
After the app UI loads, but before the "Hello" outputs actually appear, stop the app. Then launch it again. This time you'll probably never see the "Hello" outputs appear, and it'll be stuck in this state until you restart the R session.
I assume this is a problem in the underlying parallel psock cluster implementation, but I was wondering if it'd be reasonable to have a workaround in future. If you don't have time to look into it, if you could at least give some general pointers we can try to put together a PR. (My first instinct was to have
resolved.ClusterFuture
andresult.ClusterFuture
do some additional checking and throw if they detect this situation.)The text was updated successfully, but these errors were encountered: