addprocs_slurm() fails since Julia 0.5 #48

axsk · 2016-10-01T22:32:22Z

When trying to start new procs on the Slurm cluster via addprocs(SlurmManager(n)) I get the following error message (this worked with 0.4):

julia> using ClusterManagers; addprocs_slurm(1)
srun: job 900 queued and waiting for resources
srun: job 900 has been allocated resources
Worker 2 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
 in process_hdr(::TCPSocket, ::Bool) at ./multi.jl:1410
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1299
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1276
 in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at ./event.jl:68
srun: error: worker50: task 0: Exited with exit code 1

Entering Ctrl+C after nothing happens after the error message crashes Julia :/

InterruptException()
jl_run_once at /home/centos/buildbot/slave/package_tarball64/build/src/jl_uv.c:142
process_events at ./libuv.jl:82
wait at ./event.jl:147
task_done_hook at ./task.jl:191
unknown function (ip: 0x7f8f0ed650a2)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1942
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
finish_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:214 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:261
unknown function (ip: 0xffffffffffffffff)

The text was updated successfully, but these errors were encountered:

andreasnoack · 2016-10-03T15:28:39Z

Do you have more than one version of Julia installed?

amitmurthy · 2016-10-04T04:36:47Z

This is most probably due to the workers being 0.4 and the master 0.5 .
Or the other way around.

axsk · 2016-10-06T14:34:31Z

I have just the official 0.5 precompiled binary of Julia on all systems.
I just rechecked and there is no other version of Julia flying around anywhere.

julia also runs fine on the workers when started manually.

Edit: I had 0.4.6 and 0.5.0 installed previously, then removed 0.4.6 and tried again with the same result.

axsk · 2016-10-15T03:08:30Z

I just tested using ClusterManagers; addprocs_slurm(1) with 0.4.7, and there it runs flawlessly.

amitmurthy · 2016-10-16T06:36:27Z

Have you done a Pkg.update() on 0.5 ? You will need a compatible version of ClusterManagers too.

axsk · 2016-10-17T22:33:47Z

I did Pkg.update(). I also tried Pkg.checkout for ClusterManagers, with the same result.

amitmurthy · 2016-10-18T04:22:13Z

Unfortunately I don't have access to a SLURM setup to try this out. Can you post any output from the dead worker. I think it is written as jobN.out files in the CWD on the login node.

Ref:

ClusterManagers.jl/src/slurm.jl

Line 49 in 00b1139

fn = "$exehome/job$(lpad(i, 4, "0")).out"

dmbates · 2016-10-18T16:10:12Z

I found that addprocs_slurm(n) failed for me but addprocs(SlurmManager(n),...) worked.

axsk · 2016-10-18T20:08:27Z

Here the job0000.out:

julia_worker:9009#(theip)
ErrorException("Process(1) - Invalid connection credentials sent by remote.")CapturedException(ErrorException("Process(1) - Invalid connection credentials sent by remote."),Any[( in process_hdr(::TCPSocket, ::Bool) at multi.jl:1400,1),( in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at multi.jl:1299,1),( in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at multi.jl:1276,1),( in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at event.jl:68,1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

I experience no difference between addprocs_slurm(n) and addprocs(SlurmManager(n))

amitmurthy · 2016-10-19T03:48:06Z

Can you checkout branch amitm/debug and try? I have added a couple of debug statements which prints the local cookie and the slurm command.

axsk · 2016-10-19T10:33:32Z

Here are the results:

cookie: 2UPo2qVQVRPUhqRD
VERSION: 0.5.0
worker_arg: `--worker 2UPo2qVQVRPUhqRD`
srun_cmd: `srun -J julia-29603 -n 1 -o job%4t.out -D /nfs/numerik/bzfsikor /nfs/datanumerik/bzfsikor/julia/julia-0.5.0/julia-3c9d75391c/bin/julia --worker 2UPo2qVQVRPUhqRD`

amitmurthy · 2016-10-19T13:51:20Z

I just pushed another update to amitm/debug which uses Base.start_worker in srun_cmd instead of --worker option. Can you try with that and also post the debug output here?

axsk · 2016-10-19T19:51:16Z

Seems you had the right nose here, it is working now :)

julia> using ClusterManagers; addprocs_slurm(1)
srun_cmd: `srun -J julia-19892 -n 1 -o job%4t.out -D /nfs/datanumerik/bzfsikor/julia/pkgdir/v0.5/ClusterManagers /nfs/datanumerik/bzfsikor/julia/julia-0.5.0/julia-3c9d75391c/bin/julia -e 'Base.start_worker("1E5ahz7GRYCxfGs8")'`
srun: job 1103 queued and waiting for resources
srun: job 1103 has been allocated resources

1-element Array{Int64,1}:
 2

So there is something wrong with the --worker command line option? Strange nobody else had problems...

amitmurthy · 2016-10-20T04:30:01Z

Maybe it has something to do with the local environment on your cluster. What is the locale on the worker nodes? Non-english language?

@vtjnash : do you have any ideas as to what could be the problem here?

The cookie is being passed as a required arg with --worker and is read here - https://github.com/JuliaLang/julia/blob/0faf8ce200103839577171d28a4d1545fa827336/base/client.jl#L224

The comparison which is failing on @axsk 's system is here - https://github.com/JuliaLang/julia/blob/0faf8ce200103839577171d28a4d1545fa827336/base/multi.jl#L1402-L1406 - basically the cookie read from the command line is compared to the one read from the socket.

@axsk : are you open to building julia from source for the workers? I can provide a patch with appropriate debug statements to track down this issue.

axsk · 2016-10-20T08:52:29Z

locale returns en_US.UTF-8 everywhere.

I'm open to building julia with the debug statetemets, but probably not today anymore, since I now got enough work with actually running the code I needed on the cluster :)

axsk · 2017-08-01T10:05:14Z

Here I am with another update:

I reinstalled all the packages (now on Julia 0.5.2), hence the Base.start_worker patch is gone.
The bug returned, i.e. I get the same version read error, but it only happens in about 50% of the times I try to run addprocs.

juliohm · 2020-10-06T19:26:16Z

Too old to reproduce. Please check the new release.

axsk mentioned this issue Oct 2, 2016

addprocs() does not work any more with SlurmManager() JuliaLang/julia#18762

Closed

axsk changed the title ~~Can't addprocs on Slurm since 0.5 upgrade.~~ addprocs_slurm() fails since Julia 0.5 Oct 15, 2016

amitmurthy mentioned this issue Oct 20, 2016

"Cookie read failed" with julia5 --machinefile hosts JuliaLang/julia#18424

Closed

juliohm added the SLURM label Oct 6, 2020

juliohm closed this as completed Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addprocs_slurm() fails since Julia 0.5 #48

addprocs_slurm() fails since Julia 0.5 #48

axsk commented Oct 1, 2016 •

edited

andreasnoack commented Oct 3, 2016 •

edited

amitmurthy commented Oct 4, 2016

axsk commented Oct 6, 2016 •

edited

axsk commented Oct 15, 2016 •

edited

amitmurthy commented Oct 16, 2016

axsk commented Oct 17, 2016

amitmurthy commented Oct 18, 2016

dmbates commented Oct 18, 2016

axsk commented Oct 18, 2016 •

edited

amitmurthy commented Oct 19, 2016

axsk commented Oct 19, 2016

amitmurthy commented Oct 19, 2016

axsk commented Oct 19, 2016 •

edited

amitmurthy commented Oct 20, 2016

axsk commented Oct 20, 2016

axsk commented Aug 1, 2017

juliohm commented Oct 6, 2020

addprocs_slurm() fails since Julia 0.5 #48

addprocs_slurm() fails since Julia 0.5 #48

Comments

axsk commented Oct 1, 2016 • edited

andreasnoack commented Oct 3, 2016 • edited

amitmurthy commented Oct 4, 2016

axsk commented Oct 6, 2016 • edited

axsk commented Oct 15, 2016 • edited

amitmurthy commented Oct 16, 2016

axsk commented Oct 17, 2016

amitmurthy commented Oct 18, 2016

dmbates commented Oct 18, 2016

axsk commented Oct 18, 2016 • edited

amitmurthy commented Oct 19, 2016

axsk commented Oct 19, 2016

amitmurthy commented Oct 19, 2016

axsk commented Oct 19, 2016 • edited

amitmurthy commented Oct 20, 2016

axsk commented Oct 20, 2016

axsk commented Aug 1, 2017

juliohm commented Oct 6, 2020

axsk commented Oct 1, 2016 •

edited

andreasnoack commented Oct 3, 2016 •

edited

axsk commented Oct 6, 2016 •

edited

axsk commented Oct 15, 2016 •

edited

axsk commented Oct 18, 2016 •

edited

axsk commented Oct 19, 2016 •

edited