Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addprocs() does not work any more with SlurmManager() #18762

Closed
axsk opened this issue Oct 2, 2016 · 5 comments
Closed

addprocs() does not work any more with SlurmManager() #18762

axsk opened this issue Oct 2, 2016 · 5 comments
Assignees
Labels
domain:parallelism Parallel or distributed computation

Comments

@axsk
Copy link
Contributor

axsk commented Oct 2, 2016

I already reported this issue at ClusterManagers.jl, but looking at it's code I think the problem might be well within Julia itself.

using ClusterManagers
addprocs_slurm(1)

results in

srun: job 676 queued and waiting for resources
srun: job 676 has been allocated resources
Worker 2 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
 in process_hdr(::TCPSocket, ::Bool) at ./multi.jl:1410
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1299
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1276
 in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at ./event.jl:68
srun: error: htc032: task 0: Exited with exit code 1

after which nothing happens (no repl read).
Pressing Ctrl+C terminates Julia with

^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_run_once at /home/centos/buildbot/slave/package_tarball64/build/src/jl_uv.c:142
process_events at ./libuv.jl:82
wait at ./event.jl:147
task_done_hook at ./task.jl:191
unknown function (ip: 0x7f1186016ae2)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1942
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
finish_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:214 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:261
unknown function (ip: 0xffffffffffffffff)

Here is my Julia version

Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: Quad-Core AMD Opteron(tm) Processor 8384
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Barcelona)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, amdfam10)

It worked fine on the previous 0.4.6 version.

@omalled
Copy link

omalled commented Nov 17, 2016

I have encountered a similar issue, but with regular addprocs, not addprocs_slurm. My code looks like this

addprocs(64)
addprocs(fill("remotemachine", 64))

On julia 0.4.6, this code works fine, but fails on 0.5.0. I have to hit CTRL+C twice to get it to stop hanging. The last handful of lines of the output when it fails is below. I have worked around this issue by reversing the order of the addproc calls. That is,

addprocs(fill("remotemachine", 64))
addprocs(64)

works for me on 0.5.0.

...
ERROR: connect: connection refused (ECONNREFUSED)
in yieldto(::Task, ::ANY) at ./event.jl:136
in wait() at ./event.jl:169
in wait(::Condition) at ./event.jl:27
in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
in wait_connected(::TCPSocket) at ./stream.jl:265
in connect at ./stream.jl:960 [inlined]
in connect_to_worker(::String, ::Int16) at ./managers.jl:483
in connect_w2w(::Int64, ::WorkerConfig) at ./managers.jl:446
in connect(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./managers.jl:380
in connect_to_peer(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./multi.jl:1479
in (::Base.##637#639)() at ./task.jl:360
Error [connect: connection refused (ECONNREFUSED)] on 34 while connecting to peer 2. Exiting.ERROR: connect: connection refused (ECONNREFUSED)
in yieldto(::Task, ::ANY) at ./event.jl:136
in wait() at ./event.jl:169
in wait(::Condition) at ./event.jl:27
in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
in wait_connected(::TCPSocket) at ./stream.jl:265
in connect at ./stream.jl:960 [inlined]
in connect_to_worker(::String, ::Int16) at ./managers.jl:483
in connect_w2w(::Int64, ::WorkerConfig) at ./managers.jl:446
in connect(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./managers.jl:380
in connect_to_peer(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./multi.jl:1479
in (::Base.##637#639)() at ./task.jl:360

Worker 34 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_run_once at /home/centos/buildbot/slave/package_tarball64/build/src/jl_uv.c:142
process_events at ./libuv.jl:82
wait at ./event.jl:147
task_done_hook at ./task.jl:191
unknown function (ip: 0x7faf1c939b62)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1942
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
finish_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:214 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:261
unknown function (ip: 0xffffffffffffffff)
^CInterruptException
atexit hook threw an error: ErrorException("schedule: Task not runnable")

@amitmurthy
Copy link
Contributor

This is a different issue. On 0.5, local addprocs only binds to the localhost by default. So this may work for you

addprocs(64; restrict=false)
addprocs(fill("remotemachine", 64))

It allows for workers started later to connect to the previously started local workers if the network allows.

@andreasnoack
Copy link
Member

@amitmurthy Should this issue be moved to ClusterManagers?

@amitmurthy
Copy link
Contributor

Let's leave it open here. The issue has already been created at ClusterManagers. I am not quite sure where the problem is.

@ViralBShah
Copy link
Member

This should go to https://github.com/JuliaParallel/ClusterManagers.jl/issues if still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

6 participants