Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploying jobs to a Docker container -- timeout error #265

Closed
januz opened this issue Nov 29, 2018 · 17 comments

Comments

@januz
Copy link

commented Nov 29, 2018

when I try to deploy jobs to a Docker container using either one of the examples from the future reference manual or the drake Docker example (see also), I get the following error:

Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  : 
  reached elapsed time limit

My environment:

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Los_Angeles         
#>  date     2018-11-29
#> Packages -----------------------------------------------------------------
#>  package   * version date       source        
#>  backports   1.1.2   2017-12-13 CRAN (R 3.5.0)
#>  base      * 3.5.0   2018-04-24 local         
#>  compiler    3.5.0   2018-04-24 local         
#>  datasets  * 3.5.0   2018-04-24 local         
#>  devtools    1.13.6  2018-06-27 CRAN (R 3.5.0)
#>  digest      0.6.18  2018-10-10 CRAN (R 3.5.0)
#>  evaluate    0.12    2018-10-09 CRAN (R 3.5.0)
#>  graphics  * 3.5.0   2018-04-24 local         
#>  grDevices * 3.5.0   2018-04-24 local         
#>  htmltools   0.3.6   2017-04-28 CRAN (R 3.5.0)
#>  knitr       1.20    2018-02-20 CRAN (R 3.5.0)
#>  magrittr    1.5     2014-11-22 CRAN (R 3.5.0)
#>  memoise     1.1.0   2017-04-21 CRAN (R 3.5.0)
#>  methods   * 3.5.0   2018-04-24 local         
#>  Rcpp        1.0.0   2018-11-07 CRAN (R 3.5.0)
#>  rmarkdown   1.10    2018-06-11 CRAN (R 3.5.0)
#>  rprojroot   1.3-2   2018-01-03 CRAN (R 3.5.0)
#>  stats     * 3.5.0   2018-04-24 local         
#>  stringi     1.2.4   2018-07-20 CRAN (R 3.5.0)
#>  stringr     1.3.1   2018-05-10 CRAN (R 3.5.0)
#>  tools       3.5.0   2018-04-24 local         
#>  utils     * 3.5.0   2018-04-24 local         
#>  withr       2.1.2   2018-03-15 CRAN (R 3.5.0)
#>  yaml        2.2.0   2018-07-25 CRAN (R 3.5.0)

Created on 2018-11-29 by the reprex package (v0.2.1)

Thanks for your help!

@HenrikBengtsson

This comment has been minimized.

Copy link
Owner

commented Nov 29, 2018

To troubleshoot your problem of getting a working R worker in Docker running, it should be enough to work with:

> cl <- future::makeClusterPSOCK("localhost", rscript = c("docker", "run", "--net=host", "rocker/r-base", "Rscript"))

Also, I've made some updates to future 1.10.0-9000 (develop version) that provides more informative output and error messages. Install it as:

> remotes::install_github("HenrikBengtsson/future@develop")

Then retry (in a fresh R session). Here is what it looks like on R 3.5.1 on Ubuntu 18.04:

> cl <- future::makeClusterPSOCK("localhost", rscript = c("docker", "run", "--net=host", "rocker/r-base", "Rscript"), outfile = NULL, verbose = TRUE)
[local output] Workers: [n = 1] ‘localhost’
[local output] Base port: 11029
[local output] Creating node 1 of 1 ...
[local output] - setting up node
[local output] Starting worker #1 on ‘localhost’: 'docker' 'run' '--net=host' 'rocker/r-base' 'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11029 OUT= TIMEOUT=2592000 XDR=TRUE
[local output] - Exit code of system() call: 0
[local output] Waiting for worker #1 on ‘localhost’ to connect back
[local output] - Detected 'outfile=NULL': this will make the output from the background worker visible
starting worker pid=1 on localhost:11029 at 20:26:58.724
[local output] Connection with worker #1 on ‘localhost’ established
[local output] - collecting session information
Warning: namespace ‘future’ is not available and has been replaced
by .GlobalEnv when processing object ‘’
[local output] Creating node 1 of 1 ... done

> cl
socket cluster with 1 nodes on host ‘localhost’

Comment: The 'Warning: namespace ‘future’ is not available and has been replaced' is ok, because we have not yet installed 'future' in the Docker session.

Also, try to run this in a terminal rather than a GUI, because then you'll be able to see more output. RStudio Terminal works too.

EDIT: My example did not show arguments outfile = NULL, verbose = TRUE in the call.

@januz

This comment has been minimized.

Copy link
Author

commented Nov 29, 2018

@HenrikBengtsson Thanks for getting back to me!

I installed the development version of future and ran the command. The result is the same, i.e., the command times out:

> cl <- future::makeClusterPSOCK("localhost", rscript = c("docker", "run", "--net=host", "rocker/r-base", "Rscript"), verbose = TRUE)
[local output] Workers: [n = 1] ‘localhost’
[local output] Base port: 11751
[local output] Creating node 1 of 1 ...
[local output] - setting up node
[local output] Starting worker #1 on ‘localhost’: 'docker' 'run' '--net=host' 'rocker/r-base' 'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11751 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE
[local output] - Exit code of system() call: 0
[local output] Waiting for worker #1 on ‘localhost’ to connect back
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  : 
  Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘mac023-2.local’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11751 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: 'docker' 'run' '--net=host' 'rocker/r-base' 'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11751 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE.
 * Troubleshooting suggestions:
   - Suggestion #1: Set 'outfile=NULL' to set output from worker.

Do you see anything that could help troubleshoot the issue?

try to run this in a terminal rather than a GUI

Do you mean to a) open R in the terminal and run the command or b) run the docker run ... command in the shell?

a) gives the same output as posted above
b) doesn't give any output

Interestingly, both a) and b) aren't terminated automatically by the timeout, but the connection does not seem to be established.

@HenrikBengtsson

This comment has been minimized.

Copy link
Owner

commented Nov 29, 2018

Sorry, I realize that due to cut'n'paste mistake I left out argumeng outfile=NULL in my example. Retry with:

> cl <- future::makeClusterPSOCK("localhost", rscript = c("docker", "run", "--net=host", "rocker/r-base", "Rscript"), outfile = NULL, verbose = TRUE)

The outfile=NULL will help us see what is done in the worker side.

Try to run this in a terminal rather than a GUI

Do you mean to a) open R in the terminal and run the command or b) run the docker run ... command in the shell?

I meant (a) - run R from the terminal, because then the terminal will relay any output produced by the background R worker (when using outfile=NULL). When R runs in a GUI (e.g. Rgui or RStudio Console), the GUI swallows any output from the background worker.

To further confirm that the worker, i.e. Rscript actually launched in the Docker container, we can inject some addition R output in the call by doing:

> cl <- future::makeClusterPSOCK("localhost", rscript = c("docker", "run", "--net=host", "rocker/r-base", "Rscript"), rscript_args = c("-e", shQuote("Sys.info()")), verbose = TRUE, outfile = NULL)
[local output] Workers: [n = 1] ‘localhost’
[local output] Base port: 11512
[local output] Creating node 1 of 1 ...
[local output] - setting up node
[local output] Starting worker #1 on ‘localhost’: 'docker' 'run' '--net=host' 'rocker/r-base' 'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'Sys.info()' -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11512 OUT= TIMEOUT=2592000 XDR=TRUE
[local output] - Exit code of system() call: 0
[local output] Waiting for worker #1 on ‘localhost’ to connect back
[local output] - Detected 'outfile=NULL': this will make the output from the background worker visible
                                      sysname 
                                      "Linux" 
                                      release 
                          "4.15.0-39-generic" 
                                      version 
"#42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018" 
                                     nodename 
                                      "hb-x1" 
                                      machine 
                                     "x86_64" 
                                        login 
                                    "unknown" 
                                         user 
                                       "root" 
                               effective_user 
                                       "root" 
starting worker pid=1 on localhost:11512 at 21:59:31.960
[local output] Connection with worker #1 on ‘localhost’ established
[local output] - collecting session information
Warning: namespacefutureis not available and has been replaced
by .GlobalEnv when processing object ‘’
[local output] Creating node 1 of 1 ... done

That Sys.info() tells me that Rscript is running and that it is inside the Docker container (e.g. user=root).

@januz

This comment has been minimized.

Copy link
Author

commented Nov 29, 2018

here the output of the second command:

[local output] Workers: [n = 1] ‘localhost’
[local output] Base port: 11359
[local output] Creating node 1 of 1 ...
[local output] - setting up node
[local output] Starting worker #1 on ‘localhost’: 'docker' 'run' '--net=host' 'rocker/r-base' 'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'Sys.info()' -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11359 OUT= TIMEOUT=2592000 XDR=TRUE
[local output] - Exit code of system() call: 0
[local output] Waiting for worker #1 on ‘localhost’ to connect back
[local output] - Detected 'outfile=NULL': this will make the output from the background worker visible
                             sysname                              release 
                             "Linux"                   "4.9.125-linuxkit" 
                             version                             nodename 
"#1 SMP Fri Sep 7 08:20:28 UTC 2018"              "linuxkit-025000000001" 
                             machine                                login 
                            "x86_64"                            "unknown" 
                                user                       effective_user 
                              "root"                               "root" 
starting worker pid=1 on localhost:11359 at 22:08:26.972
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  : 
  Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘mac023.local’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11359 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: 'docker' 'run' '--net=host' 'rocker/r-base' 'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'Sys.info()' -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11359 OUT= TIMEOUT=2592000 XDR=TRUE.

so, Rscript seems to run inside the container, right?

Weirdly, executing the command in terminal R just hangs on the line

starting worker pid=1 on localhost:...

while executing it in RStudio terminates after the timeout.

@HenrikBengtsson

This comment has been minimized.

Copy link
Owner

commented Nov 29, 2018

so, Rscript seems to run inside the container, right?

Yes, we know that Rscript actually runs and that it runs inside the Docker. We now know that the launched R worker fails to connect back to your main R session. So, something prevents the two from being able to "talk". I'm not really sure what causes this, but wild guesses are: (a) a firewall issue, (b) a Docker configuration blocking --net=host, ...?. Does it even work without Docker, i.e.

> cl <- future::makeClusterPSOCK("localhost", outfile = NULL, verbose = TRUE)

Weirdly, executing the command in terminal R just hangs on the line
starting worker pid=1 on localhost:...
while executing it in RStudio terminates after the timeout.

Yeah, the success of a timeout event to bubble up and actually interrupt R at the top level varies with OS and environment, e.g. HenrikBengtsson/Wishlist-for-R#47. Luke Tierney, who's the R-core person most likely to fix this, is aware of it and it's on his todo list to improve upon it.

@januz

This comment has been minimized.

Copy link
Author

commented Nov 29, 2018

Does it even work without Docker

Yes, without Docker the connection seems to work:

> cl <- future::makeClusterPSOCK("localhost", outfile = NULL, verbose = TRUE)
[local output] Workers: [n = 1] ‘localhost’
[local output] Base port: 11102
[local output] Creating node 1 of 1 ...
[local output] - setting up node
[local output] Starting worker #1 on ‘localhost’: '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11102 OUT= TIMEOUT=2592000 XDR=TRUE
[local output] - Exit code of system() call: 0
[local output] Waiting for worker #1 on ‘localhost’ to connect back
[local output] - Detected 'outfile=NULL': this will make the output from the background worker visible
starting worker pid=6502 on localhost:11102 at 14:41:23.277
[local output] Connection with worker #1 on ‘localhost’ established
[local output] - collecting session information
[local output] Creating node 1 of 1 ... done

(a) a firewall issue

the system firewall is off by default and I turned off both the pfctl firewall and an additional software LittleSnitch, but the problems persist

Hm, I don't really know how to debug this problem. Do you know whether it might be a general Mac problem (I read something online about the net=host function being differently implemented on Docker for Mac)

@HenrikBengtsson

This comment has been minimized.

Copy link
Owner

commented Nov 29, 2018

Hm, I don't really know how to debug this problem. Do you know whether it might be a general Mac problem (I read something online about the net=host function being differently implemented on Docker for Mac)

Don't know, and unfortunately, I don't have access to macOS so I cannot poke around myself. We need help from others on macOS who can either confirm it works for them or that the can reproduce the same problem you've got. It might be specific to a particular macOS version or Docker version (I'm using Docker version 18.09.0, build 4d60db4 on my Linux box).

PS. When troubleshooting, it's useful to understand that nothing done here really involves the Future API per se. The makeClusterPSOCK() function is in the future package because that was the most natural place to put it when I started to "patch" parallel::makePSOCKcluster(). One day the function might deserve to be in a standalone 'parallel.extras' package or even be incorporated into parallel::makePSOCKcluster().

@januz

This comment has been minimized.

Copy link
Author

commented Nov 30, 2018

@HenrikBengtsson I just tried to test it with another Mac OS X account on my computer, but the problem persists. Next thing I'll do is check it on my wife's Mac.

I found this StackOverflow thread. It seems as though the thread starter had similar problems as I do and solved them with a clean install of Mac OS X.

Thanks for all your help so far, I'll let you know if I find a solution!

@januz

This comment has been minimized.

Copy link
Author

commented Nov 30, 2018

Hm, I just tried it on my wife's computer, which -- in comparison to mine -- is a relatively clean install of Mac OS X without any bigger special configurations, etc.

The result is exactly the same. So I'm getting more convinced that it is a Mac-related problem :/

@januz

This comment has been minimized.

Copy link
Author

commented Nov 30, 2018

for reference, this is the general Docker on Mac issue I mentioned above which describes problems with using net=host and accessing the host. This might be at the core of my problems...

@januz

This comment has been minimized.

Copy link
Author

commented Nov 30, 2018

OK, so my (limited) understanding of the issue is that on Mac (as on Windows) the Docker app only simulates to run natively, but in reality runs a VM (running Linux) that interacts with the containers. Communication from the containers that goes to "localhost" thus cannot reach the Mac, because "localhost" is the VM.

On the Docker for Mac (and Windows) help page, they suggest using host.docker.internal to communicate with the host system.

@HenrikBengtsson I don't understand your code well enough to know whether and if so how I could use this alias in a call to makeClusterPSOCK/makeNodePSOCK. Is that possible?

@HenrikBengtsson

This comment has been minimized.

Copy link
Owner

commented Nov 30, 2018

Aha... I see. So, if I understand you and those references correct, we want the call not to be:

'docker' 'run' '--net=host' 'rocker/r-base' 
  'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods 
  -e 'parallel:::.slaveRSOCK()' MASTER=localhost
  PORT=11359 OUT= TIMEOUT=2592000 XDR=TRUE

but instead be:

'docker' 'run' '--net=host' 'rocker/r-base' 
  'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods 
  -e 'parallel:::.slaveRSOCK()' MASTER=host.docker.internal   <<<=== 
  PORT=11359 OUT= TIMEOUT=2592000 XDR=TRUE

Is that correct? If so, then just add argument master="host.docker.internal" in the setup. Here is what I get on my Linux in a "dry run" call:

> cl <- future::makeClusterPSOCK("localhost", rscript = c("docker", "run", "--net=host", "rocker/r-base", "Rscript"), master = "host.docker.internal", outfile = NULL, verbose = TRUE, dryrun = TRUE)
[local output] Workers: [n = 1] ‘localhost’
[local output] Base port: 11446
[local output] Creating node 1 of 1 ...
[local output] - setting up node
----------------------------------------------------------------------
Manually, start worker #1 on local machine ‘localhost’ with:

  '/usr/bin/docker' 'run' '--net=host' 'rocker/r-base' 'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=host.docker.internal PORT=11446 OUT= TIMEOUT=2592000 XDR=TRUE

[local output] - collecting session information
[local output] Creating node 1 of 1 ... done
@januz

This comment has been minimized.

Copy link
Author

commented Nov 30, 2018

Is that correct? If so, then just add argument master="host.docker.internal" in the setup.

YES. That works! Thank you so much and sorry that I didn't find out on my own that one can set the master argument in the function call...

Here is what I get on my Linux in a "dry run" call:

What happens when you do it with dryrun = FALSE? From what I saw on the web, host.docker.internal is not yet implemented for Linux (one doesn't need it there, but people are pushing Docker to implement it to be able to write code that works on all platforms), but only for Windows and Mac. So I would expect this command to fail on your machine.

HenrikBengtsson added a commit that referenced this issue Nov 30, 2018
EXAMPLE: makeClusterPSOCK() now mentions master='host.docker.internal…
…' when running via Docker on macOS & Windows [#265]
@HenrikBengtsson

This comment has been minimized.

Copy link
Owner

commented Nov 30, 2018

Brilliant. Correct, using host.docker.internal on Linux will cause it to hang at

starting worker pid=1 on host.docker.internal:11996 at 16:54:55.501

because the worker fails to connect back to the master R session (which is the exact same as your original problem). Unfortunately, there's no "onliner" available in R to detect if we run on macOS & Windows, or not.

I've updated the Docker example in example(makeClusterPSOCK) to:

## EXAMPLE: Two workers running in Docker on the local machine
## Setup of 2 Docker workers running rocker/r-base
## (requires installation of future package)
cl <- makeClusterPSOCK(
  rep("localhost", times = 2L),
  ## Launch Rscript inside Docker container
  rscript = c(
    "singularity", "run", "--net=host", "rocker/r-base",
    "Rscript"
  ),
  ## Install future package
  rscript_args = c(
    "-e", shQuote("install.packages('future')")
  ),
  ## IMPORTANT: Because Docker runs inside a virtual machine (VM) on macOS
  ## and Windows (not Linux), when the R worker tries to connect back to
  ## the default 'localhost' it will fail, because the main R session is
  ## not running in the VM, but outside on the host.  To reach the host on
  ## macOS and Windows, make sure to use master = "host.docker.internal"
  # master = "host.docker.internal",  # <= macOS & Windows
  dryrun = TRUE
)

Thanks for figuring this one out. I'm sure there will be other macOS/Windows users who have been and will be struggling with R+Docker issue (also outside of the future framework). Having this documented will increase the chances for them to find the solution.

@januz

This comment has been minimized.

Copy link
Author

commented Nov 30, 2018

to be able to use the same function independently of the user's OS, I now use

  master = ifelse(
    Sys.info()["sysname"] == "Linux", "localhost", "host.docker.internal"
  )

That should do the trick, right?

Thanks for all your help on this!

@januz januz closed this Nov 30, 2018

@HenrikBengtsson

This comment has been minimized.

Copy link
Owner

commented Nov 30, 2018

 master = ifelse(
   Sys.info()["sysname"] == "Linux", "localhost", "host.docker.internal"
 )

That should do the trick, right?

Not 100% sure, because there are Unix systems that are not Linux, so I'd expect those to return a different uname sysname, e.g. OpenBSD and SunOS(?). Not sure if those support Docker though? Searching around on the interwebs, I see other discussion on how to detect macOS from R. Maybe grepl("(Darwin|Windows)", Sys.info()["sysname"]) works to detect macOS & Windows?

BTW. I also wiped Windows (>= 10) under the rug; I think Windows 10 supports Docker containers natively so should probably use the default master="localhost" there, cf. https://docs.microsoft.com/en-us/virtualization/windowscontainers/quick-start/quick-start-windows-10.

PS. My rule of thumb is to always avoid ifelse() unless you really need vectorized if-else because one day you might not get what you intended. I'd write your's as if (Sys.info()["sysname"] == "Linux") "localhost" else "host.docker.internal".

@januz

This comment has been minimized.

Copy link
Author

commented Nov 30, 2018

Maybe grepl("(Darwin|Windows)", Sys.info()["sysname"]) works to detect macOS & Windows

My rule of thumb is to always avoid ifelse() unless you really need vectorized if-else

Thanks for the tips!

I think Windows 10 supports Docker containers natively so should probably use the default master="localhost" there,

Oh, I didn't know that! From what I read yesterday, I assumed that what is most likely to happen is that host.docker.internal will be implemented for Linux as well, so that one could switch to setting master = "host.docker.internal" as default then...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.