addprocs() does not work any more with SlurmManager() #18762

axsk · 2016-10-02T09:16:31Z

I already reported this issue at ClusterManagers.jl, but looking at it's code I think the problem might be well within Julia itself.

using ClusterManagers
addprocs_slurm(1)

results in

srun: job 676 queued and waiting for resources
srun: job 676 has been allocated resources
Worker 2 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
 in process_hdr(::TCPSocket, ::Bool) at ./multi.jl:1410
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1299
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1276
 in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at ./event.jl:68
srun: error: htc032: task 0: Exited with exit code 1

after which nothing happens (no repl read).
Pressing Ctrl+C terminates Julia with

^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_run_once at /home/centos/buildbot/slave/package_tarball64/build/src/jl_uv.c:142
process_events at ./libuv.jl:82
wait at ./event.jl:147
task_done_hook at ./task.jl:191
unknown function (ip: 0x7f1186016ae2)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1942
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
finish_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:214 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:261
unknown function (ip: 0xffffffffffffffff)

Here is my Julia version

Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: Quad-Core AMD Opteron(tm) Processor 8384
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Barcelona)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, amdfam10)

It worked fine on the previous 0.4.6 version.

The text was updated successfully, but these errors were encountered:

omalled · 2016-11-17T19:39:57Z

I have encountered a similar issue, but with regular addprocs, not addprocs_slurm. My code looks like this

addprocs(64)
addprocs(fill("remotemachine", 64))

On julia 0.4.6, this code works fine, but fails on 0.5.0. I have to hit CTRL+C twice to get it to stop hanging. The last handful of lines of the output when it fails is below. I have worked around this issue by reversing the order of the addproc calls. That is,

addprocs(fill("remotemachine", 64))
addprocs(64)

works for me on 0.5.0.

...
ERROR: connect: connection refused (ECONNREFUSED)
in yieldto(::Task, ::ANY) at ./event.jl:136
in wait() at ./event.jl:169
in wait(::Condition) at ./event.jl:27
in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
in wait_connected(::TCPSocket) at ./stream.jl:265
in connect at ./stream.jl:960 [inlined]
in connect_to_worker(::String, ::Int16) at ./managers.jl:483
in connect_w2w(::Int64, ::WorkerConfig) at ./managers.jl:446
in connect(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./managers.jl:380
in connect_to_peer(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./multi.jl:1479
in (::Base.##637#639)() at ./task.jl:360
Error [connect: connection refused (ECONNREFUSED)] on 34 while connecting to peer 2. Exiting.ERROR: connect: connection refused (ECONNREFUSED)
in yieldto(::Task, ::ANY) at ./event.jl:136
in wait() at ./event.jl:169
in wait(::Condition) at ./event.jl:27
in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
in wait_connected(::TCPSocket) at ./stream.jl:265
in connect at ./stream.jl:960 [inlined]
in connect_to_worker(::String, ::Int16) at ./managers.jl:483
in connect_w2w(::Int64, ::WorkerConfig) at ./managers.jl:446
in connect(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./managers.jl:380
in connect_to_peer(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./multi.jl:1479
in (::Base.##637#639)() at ./task.jl:360

Worker 34 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_run_once at /home/centos/buildbot/slave/package_tarball64/build/src/jl_uv.c:142
process_events at ./libuv.jl:82
wait at ./event.jl:147
task_done_hook at ./task.jl:191
unknown function (ip: 0x7faf1c939b62)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1942
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
finish_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:214 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:261
unknown function (ip: 0xffffffffffffffff)
^CInterruptException
atexit hook threw an error: ErrorException("schedule: Task not runnable")

amitmurthy · 2016-11-18T03:33:20Z

This is a different issue. On 0.5, local addprocs only binds to the localhost by default. So this may work for you

addprocs(64; restrict=false)
addprocs(fill("remotemachine", 64))

It allows for workers started later to connect to the previously started local workers if the network allows.

andreasnoack · 2017-01-03T15:46:17Z

@amitmurthy Should this issue be moved to ClusterManagers?

amitmurthy · 2017-01-03T15:49:27Z

Let's leave it open here. The issue has already been created at ClusterManagers. I am not quite sure where the problem is.

ViralBShah · 2022-02-25T23:34:29Z

This should go to https://github.com/JuliaParallel/ClusterManagers.jl/issues if still relevant.

tkelman added the domain:parallelism Parallel or distributed computation label Oct 2, 2016

kshyatt assigned amitmurthy Oct 3, 2016

amitmurthy mentioned this issue Oct 20, 2016

"Cookie read failed" with julia5 --machinefile hosts #18424

Closed

ViralBShah closed this as completed Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addprocs() does not work any more with SlurmManager() #18762

addprocs() does not work any more with SlurmManager() #18762

axsk commented Oct 2, 2016 •

edited

omalled commented Nov 17, 2016

amitmurthy commented Nov 18, 2016

andreasnoack commented Jan 3, 2017

amitmurthy commented Jan 3, 2017

ViralBShah commented Feb 25, 2022

addprocs() does not work any more with SlurmManager() #18762

addprocs() does not work any more with SlurmManager() #18762

Comments

axsk commented Oct 2, 2016 • edited

omalled commented Nov 17, 2016

amitmurthy commented Nov 18, 2016

andreasnoack commented Jan 3, 2017

amitmurthy commented Jan 3, 2017

ViralBShah commented Feb 25, 2022

axsk commented Oct 2, 2016 •

edited