Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multisession cluster cleanup is incomplete #261

Closed
jcheng5 opened this issue Oct 25, 2018 · 16 comments
Closed

multisession cluster cleanup is incomplete #261

jcheng5 opened this issue Oct 25, 2018 · 16 comments
Labels
bug waiting on Waiting on a follow-up reply
Milestone

Comments

@jcheng5
Copy link

jcheng5 commented Oct 25, 2018

Hi Henrik, the Shiny team ran into this issue while testing our latest Shiny release. I don't think it is a new issue.

The below repro may seem contrived, but it's easy to get into a similar situation when using Shiny with future and promises. I'll explain the repro first, then how it relates to Shiny.

Basically, the problem occurs when you plan(multisession), launch some tasks, then plan(multisession) again with .cleanup = TRUE (the default), then launch some more tasks. In this situation, I'd expect that the first set of tasks would either error out, or maybe be allowed to complete. What happens instead is that the first set of tasks actually interferes with the second set of tasks.

library(future)
library(promises)

# Create a future that sleeps for 3 seconds, then returns the given value
make_future <- function(x) {
  future({Sys.sleep(3); x})
}

# Launch first run (1 through 8)
plan(multisession)
futures1 <- lapply(1:8, make_future)

# Launch second run (9 through 16)
plan(multisession) # Important!
futures2 <- lapply(9:16, make_future)


# 1 through 8 expected, or--more likely--errors.
# Instead, it prints 9 through 16.
print(values(futures1))

# 9 through 16 expected. Instead, it hangs.
print(values(futures2))

The reason this can occur easily in Shiny is because 1) we tell people to put plan(multisession) at the top of their app.R file, and then when they make iterative changes on their app they're running this file again and again in the same R session; and 2) promises work by repeatedly polling against future objects and they have no way of knowing that the cluster the futures belonged to is stopped.

The upshot is that if you're using a Shiny app and it has multisession futures in flight, you stop the app using Esc, and launch the app again, subsequent multisession futures randomly hang. Here's an example of a Shiny app that does this:

library(shiny)
library(promises)
library(future)
plan(multisession)

iterations <- 8
delay <- 3

ui <- fluidPage(
  "After ", delay, " seconds, \"Hello\" should appear ", iterations, " times.",
  lapply(1:iterations, function(x) {
    verbatimTextOutput(paste0("out", x))
  })
)

server <- function(input, output, session) {
  lapply(1:iterations, function(x) {
    output[[paste0("out", x)]] <- renderText({
      future(Sys.sleep(delay)) %...>%
        { message("got here ", x) } %...>%
        { "Hello" }
    })
  })
}

shinyApp(ui, server)

After the app UI loads, but before the "Hello" outputs actually appear, stop the app. Then launch it again. This time you'll probably never see the "Hello" outputs appear, and it'll be stuck in this state until you restart the R session.

I assume this is a problem in the underlying parallel psock cluster implementation, but I was wondering if it'd be reasonable to have a workaround in future. If you don't have time to look into it, if you could at least give some general pointers we can try to put together a PR. (My first instinct was to have resolved.ClusterFuture and result.ClusterFuture do some additional checking and throw if they detect this situation.)

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Oct 26, 2018

<scratch - not relevant>
Have to run, and maybe the problem is completely different, but from very very quickly skimming through this, can you confirm that doing:

plan(sequential)
plan(multisession)

does indeed give you fresh set of PSOCK workers?

PS. Calling plan() repeatedly with the exact same "plan/strategy" as in plan(multisession); plan(multisession) avoids resetting the backend/workers by design (your use case might argue that this shouldn't be the case).
</scratch - not relevant>

@HenrikBengtsson HenrikBengtsson added this to the Next release milestone Oct 26, 2018
@jcheng5
Copy link
Author

jcheng5 commented Oct 26, 2018

Judging by your edit, it looks like you discovered this too, but the interstitial plan(sequential) didn't change the behavior.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Oct 28, 2018

Thanks for reporting. This is related the non-robustness representation of connections that R uses internally. Basically, R connections are (a) indexed by a single integer, (b) these indices can be reused, and (c) there is no protection against corrupt/overridden connection handlers, e.g.

> con_a <- file("a", open = "w")
> as.integer(con_a)
[1] 3
> print(con_a)
A connection with                    
description "a"     
class       "file"  
mode        "w"     
text        "text"  
opened      "opened"
can read    "no"    
can write   "yes"   
> close(con_a)

> con_b <- file("b", open = "w")
> as.integer(con_b)
[1] 3
> as.integer(con_a)  ## <== same as 'con_b'
[1] 3
> print(con_a)
A connection with
description "b"     ## <== not what we want
class       "file"  
mode        "w"     
text        "text"  
opened      "opened"
can read    "no"    
can write   "yes"   

See also my R-devel post 'closeAllConnections() can really mess things up' on 2016-10-30.

There's already a few protection mechanisms against "lost" connections in the packages, but not for the example you show. As a very first step, I've added protection against this (in the develop branch). Here is a minimal example of what happens now:

> library(future)
> plan(multisession, workers = 2L)
> f <- future(42)
> plan(multisession, workers = 2L)  # <== messes up R's table of connections
> value(f)
Error: Cannot receive result of MultisessionFuture future (<none>), because the connection
to the worker is corrupt: Connection (connection: description="<-localhost:11187",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes",
can write="yes", id=943) is no longer valid. It differ from the currently registered R connection
with the same index 3 (connection: description="<-localhost:11187", class="sockconn",
mode="a+b", text="binary", opened="opened", can read="yes", can write="yes", id=949)

Comment: This assertion is not bullet proof. The only reason it is detected is that two different (random) ports were used. If we use a fix port, then it is not possible to detect the change in connections. This really needs to be fixed in R itself, e.g. adding a unique identifier for each new connection. Found that non-documented attr(con, "conn_id") is set and servers as a unique ID for connections (e.g. https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/connections.c#L3384 and https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/connections.c#L5176-L5177), so this assertion should be robust.

I'll look into having repeated plan(multisession) calls not to replace already existing workers of the same kind.

@HenrikBengtsson
Copy link
Owner

Please try the develop version, in which (1) repeated calls of equal plan(...) setups are skipped, and (2) invalid R connections are detected;

> library(future)
> plan(multisession, workers = 2L)
> f <- future(42)
> plan(multisession, workers = 2L)  # <== automatically skipped
> value(f)
[1] 42
> f <- future(42)
> plan(multisession, workers = 3L)  # <== not skipped; messes up f's connection
> value(f)
Error: Cannot receive result of MultisessionFuture future (<none>), because the connection
to the worker is corrupt: Connection (connection: description="<-localhost:11478",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", 
can write="yes", id=947) is no longer valid. It differ from the currently registered R
connection with the same index 3 (connection: description="<-localhost:11478",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", 
can write="yes", id=956)

Seems to work also with the 'promises' and Shiny examples in your original post.

@HenrikBengtsson
Copy link
Owner

Please, let me know when you've confirmed that the develop version solves that use pattern where a Shiny app keeps being reloaded over and over. This will help me planning when to do the next release.

@HenrikBengtsson HenrikBengtsson added the waiting on Waiting on a follow-up reply label Nov 9, 2018
@jcheng5
Copy link
Author

jcheng5 commented Nov 15, 2018

Sorry for the delay, I think I'm not getting notifications on this thread for some reason. I'll try it first thing tomorrow.

@jcheng5
Copy link
Author

jcheng5 commented Nov 16, 2018

Hmmm. Sorry, the problem still repros for me with the Shiny app above.

Session info -------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.0 (2017-04-21)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.2.1114)          
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2018-11-15                  

Packages -----------------------------------------------------------------------------
 package   * version     date       source                                 
 base      * 3.4.0       2017-04-21 local                                  
 codetools   0.2-15      2016-10-05 CRAN (R 3.4.0)                         
 compiler    3.4.0       2017-04-21 local                                  
 datasets  * 3.4.0       2017-04-21 local                                  
 devtools    1.13.6      2018-06-27 CRAN (R 3.4.4)                         
 digest      0.6.18      2018-10-10 cran (@0.6.18)                         
 future    * 1.10.0-9000 2018-11-16 Github (HenrikBengtsson/future@e254fdb)
 globals     0.12.4      2018-10-11 CRAN (R 3.4.4)                         
 graphics  * 3.4.0       2017-04-21 local                                  
 grDevices * 3.4.0       2017-04-21 local                                  
 htmltools   0.3.6       2017-04-28 CRAN (R 3.4.0)                         
 httpuv      1.4.5       2018-07-19 CRAN (R 3.4.4)                         
 jsonlite    1.5         2017-06-01 CRAN (R 3.4.0)                         
 later       0.7.5       2018-09-18 CRAN (R 3.4.4)                         
 listenv     0.7.0       2018-01-21 cran (@0.7.0)                          
 magrittr    1.5         2014-11-22 CRAN (R 3.4.0)                         
 memoise     1.1.0       2017-04-21 CRAN (R 3.4.0)                         
 methods   * 3.4.0       2017-04-21 local                                  
 mime        0.6         2018-10-05 cran (@0.6)                            
 packrat     0.4.9-1     2018-03-12 cran (@0.4.9-1)                        
 parallel    3.4.0       2017-04-21 local                                  
 promises  * 1.0.1       2018-04-13 CRAN (R 3.4.4)                         
 R6          2.3.0       2018-10-04 cran (@2.3.0)                          
 Rcpp        1.0.0       2018-11-07 cran (@1.0.0)                          
 rlang       0.3.0.1     2018-10-25 cran (@0.3.0.1)                        
 rsconnect   0.8.10      2018-11-13 Github (rstudio/rsconnect@e6a7248)     
 shiny     * 1.2.0       2018-11-02 CRAN (R 3.4.4)                         
 stats     * 3.4.0       2017-04-21 local                                  
 tools       3.4.0       2017-04-21 local                                  
 utils     * 3.4.0       2017-04-21 local                                  
 withr       2.1.2       2018-03-15 cran (@2.1.2)                          
 xtable      1.8-3       2018-08-29 cran (@1.8-3)                          

@HenrikBengtsson
Copy link
Owner

Hmm x 2 ... and I can't reproduce it with the develop version. I could reproduce it in the past (running shiny::runApp() - in a plain terminal as well as doing 'Run App-stop-Run App' or 'Reload App' in RStudio 1.1.453), but not anymore. My details at the end.

What happens when you do:

> library(future)
> plan(multisession, workers = 2L)
> f <- future(42)
> plan(multisession, workers = 2L)  # <== used to mess up R's table of connections
> value(f)

?

Details

In a fresh R session without a ~/.Rprofile:

library(shiny)
library(promises)
library(future)
plan(multisession)

iterations <- 8
delay <- 3

ui <- fluidPage(
  "After ", delay, " seconds, \"Hello\" should appear ", iterations, " times.",
  lapply(1:iterations, function(x) {
    verbatimTextOutput(paste0("out", x))
  })
)

server <- function(input, output, session) {
  lapply(1:iterations, function(x) {
    output[[paste0("out", x)]] <- renderText({
      future(Sys.sleep(delay)) %...>%
        { message("got here ", x) } %...>%
        { paste0(Sys.time(), ": Hello") }
    })
  })
}

shinyApp(ui, server)

and

> devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Ubuntu 18.04.1 LTS          
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2018-11-15Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version     date       lib source                         
 assertthat    0.2.0       2017-04-11 [1] CRAN (R 3.5.1)                 
 backports     1.1.2       2017-12-13 [1] CRAN (R 3.5.1)                 
 base64enc     0.1-3       2015-07-28 [1] CRAN (R 3.5.1)                 
 callr         3.0.0       2018-08-24 [1] CRAN (R 3.5.1)                 
 cli           1.0.1       2018-09-25 [1] CRAN (R 3.5.1)                 
 codetools     0.2-15      2016-10-05 [4] CRAN (R 3.5.0)                 
 crayon        1.3.4       2017-09-16 [1] CRAN (R 3.5.1)                 
 debugme       1.1.0       2017-10-22 [1] CRAN (R 3.5.1)                 
 desc          1.2.0       2018-08-12 [1] Github (r-lib/desc@4f60833)    
 devtools      2.0.1       2018-10-26 [1] CRAN (R 3.5.1)                 
 digest        0.6.18      2018-10-10 [1] CRAN (R 3.5.1)                 
 fs            1.2.6       2018-08-23 [1] CRAN (R 3.5.1)                 
 future      * 1.10.0-9000 2018-11-14 [1] local                          
 globals       0.12.4      2018-10-11 [1] local                          
 glue          1.3.0       2018-07-17 [1] CRAN (R 3.5.1)                 
 htmltools     0.3.6       2017-04-28 [1] CRAN (R 3.5.1)                 
 httpuv        1.4.5       2018-07-19 [1] CRAN (R 3.5.1)                 
 jsonlite      1.5         2017-06-01 [1] CRAN (R 3.5.1)                 
 later         0.7.5       2018-09-18 [1] CRAN (R 3.5.1)                 
 listenv       0.7.0       2018-01-21 [1] CRAN (R 3.5.1)                 
 magrittr      1.5         2014-11-22 [1] CRAN (R 3.5.1)                 
 memoise       1.1.0       2017-04-21 [1] CRAN (R 3.5.1)                 
 mime          0.6         2018-10-05 [1] CRAN (R 3.5.1)                 
 pkgbuild      1.0.2       2018-10-16 [1] CRAN (R 3.5.1)                 
 pkgload       1.0.2       2018-10-29 [1] CRAN (R 3.5.1)                 
 prettyunits   1.0.2       2015-07-13 [1] CRAN (R 3.5.1)                 
 processx      3.2.0       2018-08-12 [1] Github (r-lib/processx@c565be4)
 promises    * 1.0.1       2018-04-13 [1] CRAN (R 3.5.1)                 
 R6            2.3.0       2018-10-04 [1] CRAN (R 3.5.1)                 
 Rcpp          1.0.0       2018-11-07 [1] CRAN (R 3.5.1)                 
 remotes       2.0.2       2018-10-30 [1] CRAN (R 3.5.1)                 
 rlang         0.3.0.1     2018-10-25 [1] CRAN (R 3.5.1)                 
 rprojroot     1.3-2       2018-01-03 [1] CRAN (R 3.5.1)                 
 rsconnect     0.8.11      2018-11-12 [1] CRAN (R 3.5.1)                 
 rstudioapi    0.8         2018-10-02 [1] CRAN (R 3.5.1)                 
 sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 3.5.1)                 
 shiny       * 1.2.0       2018-11-02 [1] CRAN (R 3.5.1)                 
 testthat      2.0.1       2018-10-13 [1] CRAN (R 3.5.1)                 
 usethis       1.4.0       2018-08-14 [1] CRAN (R 3.5.1)                 
 withr         2.1.2       2018-03-15 [1] CRAN (R 3.5.1)                 
 xtable        1.8-3       2018-08-29 [1] CRAN (R 3.5.1)                 
 yaml          2.2.0       2018-07-25 [1] CRAN (R 3.5.1)                 

[1] /home/hb/R/x86_64-pc-linux-gnu-library/3.5
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

@jcheng5
Copy link
Author

jcheng5 commented Nov 16, 2018

Thanks. I'll try again this afternoon on Linux.

@jcheng5
Copy link
Author

jcheng5 commented Nov 17, 2018

I am able to repro on Ubuntu 16.04.5, R 3.5.1, from the terminal. Note that to repro you have to Ctrl-C the terminal after the page loads but before the "Hello" output appears.

I can take a closer look next week.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Nov 18, 2018

Thanks for clarifying the use case. I can now reproduce this (somewhat new) issue too.

The short story

There is a combination of issues kicking in in your use case. I'll to cover them below, but here's a fix:

remotes::install_github("HenrikBengtsson/future@develop")
remotes::install_github("rstudio/promises#37")

Then, when rerunning the above Shiny app (on two cores), however you want to break and interrupt it, you'll get:

> shiny::runApp()
Loading required package: shiny

Listening on http://127.0.0.1:3173
^C
> 
> shiny::runApp()

Listening on http://127.0.0.1:3173
Warning: Error in : Cannot resolve MultisessionFuture (<none>), because the connection to
the worker is corrupt: Connection (connection: description="<-localhost:11294", 
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", can
write="yes", id=290) is no longer valid. It differ from the currently registered R
connection with the same index 4 (connection: description="<-localhost:11294",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", can
write="yes", id=434)
  [No stack trace available]
Warning: Error in : Cannot resolve MultisessionFuture (<none>), because the connection to
the worker is corrupt: Connection (connection: description="<-localhost:11294",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", can
write="yes", id=288) is no longer valid. It differ from the currently registered R
connection with the same index 3 (connection: description="<-localhost:11294",
class="sockconn", mode="a+b", text="binary", opened="opened", can read="yes", can
write="yes", id=432)
  [No stack trace available]
got here 1
got here 2
got here 3
got here 4

The key thing here is that the modified check() of as.promise.Future() now handles FutureErrors, which is required in order not to break the "state machine" of futures/promises. FutureError represent errors that occur while orchestrating futures. Specifically, in this case it detects that the old futures now carry connections that are corrupt.

The long story

future 1.10.0, parallel and R connections:

First, lets look at the problems that exist with future 1.10.0 and the parallel package. As I see it, there is one major problem:

You will corrupt the connection to PSOCK workers used by existing futures if you close the old PSOCK cluster by replacing it with a new one. This is due to a weakness in how R represents connections internally. The main concern I have right now with the current R implementation is that this weakness is not detected and therefore not protected against. See HenrikBengtsson/Wishlist-for-R#81 for details - I've brought this up on R-devel.

Now, when calling plan(multisession) on an existing plan(multisession), the old cluster will be replaced by a new one, which results in the above problem. In future 1.10.0, the problem with corrupt connections was not detected resulting in a completly broken, mixed-up state of old and new futures.

future 1.10.0-9000 (develop version):

I've updated the future package (develop version) to detect and produce an error whenever there is an attempt to use a connection that is no longer valid. This protects us from messing up the state of the futures.

Skipping replicated plan():s?

I decided to roll back to use backward compatible plan(..., .skip = FALSE), because I'd like to better understand what new side effects that entails. For instance, what you experienced in your most recent comment used .skip = TRUE and it clearly puts the state of the future framework in an endless loop (different reason from the one you first reported on). For some reason, it, incorrectly, thinks the old futures are unresolved and occupy the workers, preventing any new futures from being created. I don't know whether this has to do the future package per se, or the promise and later packages are the problems. It's a bit tricky to troubleshoot with all the different layers involved. As a starter, it would be neat to be able replicate this without Shiny being involved.

@HenrikBengtsson
Copy link
Owner

@jcheng5, did you have a chance to look at this one?

@jcheng5
Copy link
Author

jcheng5 commented Dec 12, 2018

Sorry, I have not, other than reading your description. I have a card on my Trello board that is constantly reminding me to revisit this though. I don't think I will be able to get to it until after rstudio::conf though (mid-January), I'm really sorry. Am I understanding correctly though that the next step you'd like from me is to make a more-minimal repro?

@HenrikBengtsson
Copy link
Owner

That's alright. I'll probably release next version of future before then.

Am I understanding correctly though that the next step you'd like from me is to make a more-minimal repro?

Nah, the only action needed on your end is to "understand" that the 'promises' package needs to be agile also to future orchestration errors (e.g. broken connections, crashed workers, etc). Those type of errors are of class FutureError and different from regular errors that may occur when evaluating (future) expressions. My https://github.com/rstudio/promises/pull/37/files PR addresses/clarifies this.

@HenrikBengtsson
Copy link
Owner

Updated take-home message:

(*) The state can get corrupted when one, for instancem "forces" a new multisession/cluster while an existing, non-resolved one exists. This will corrupt the state due to a weakness in R itself, which future now detects/works around.

@HenrikBengtsson
Copy link
Owner

I'll consider this one done on the future side. rstudio/promises#37 still needs to be incorporated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug waiting on Waiting on a follow-up reply
Projects
None yet
Development

No branches or pull requests

2 participants