Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addprocs_slurm() fails since Julia 0.5 #48

Closed
axsk opened this issue Oct 1, 2016 · 17 comments
Closed

addprocs_slurm() fails since Julia 0.5 #48

axsk opened this issue Oct 1, 2016 · 17 comments
Labels

Comments

@axsk
Copy link

axsk commented Oct 1, 2016

When trying to start new procs on the Slurm cluster via addprocs(SlurmManager(n)) I get the following error message (this worked with 0.4):

julia> using ClusterManagers; addprocs_slurm(1)
srun: job 900 queued and waiting for resources
srun: job 900 has been allocated resources
Worker 2 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
 in process_hdr(::TCPSocket, ::Bool) at ./multi.jl:1410
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1299
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1276
 in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at ./event.jl:68
srun: error: worker50: task 0: Exited with exit code 1

Entering Ctrl+C after nothing happens after the error message crashes Julia :/

InterruptException()
jl_run_once at /home/centos/buildbot/slave/package_tarball64/build/src/jl_uv.c:142
process_events at ./libuv.jl:82
wait at ./event.jl:147
task_done_hook at ./task.jl:191
unknown function (ip: 0x7f8f0ed650a2)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1942
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
finish_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:214 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:261
unknown function (ip: 0xffffffffffffffff)
@andreasnoack
Copy link
Member

andreasnoack commented Oct 3, 2016

Do you have more than one version of Julia installed?

@amitmurthy
Copy link
Contributor

This is most probably due to the workers being 0.4 and the master 0.5 .
Or the other way around.

@axsk
Copy link
Author

axsk commented Oct 6, 2016

I have just the official 0.5 precompiled binary of Julia on all systems.
I just rechecked and there is no other version of Julia flying around anywhere.

julia also runs fine on the workers when started manually.

Edit: I had 0.4.6 and 0.5.0 installed previously, then removed 0.4.6 and tried again with the same result.

@axsk axsk changed the title Can't addprocs on Slurm since 0.5 upgrade. addprocs_slurm() fails since Julia 0.5 Oct 15, 2016
@axsk
Copy link
Author

axsk commented Oct 15, 2016

I just tested using ClusterManagers; addprocs_slurm(1) with 0.4.7, and there it runs flawlessly.

@amitmurthy
Copy link
Contributor

Have you done a Pkg.update() on 0.5 ? You will need a compatible version of ClusterManagers too.

@axsk
Copy link
Author

axsk commented Oct 17, 2016

I did Pkg.update(). I also tried Pkg.checkout for ClusterManagers, with the same result.

@amitmurthy
Copy link
Contributor

Unfortunately I don't have access to a SLURM setup to try this out. Can you post any output from the dead worker. I think it is written as jobN.out files in the CWD on the login node.

Ref:

fn = "$exehome/job$(lpad(i, 4, "0")).out"

@dmbates
Copy link

dmbates commented Oct 18, 2016

I found that addprocs_slurm(n) failed for me but addprocs(SlurmManager(n),...) worked.

@axsk
Copy link
Author

axsk commented Oct 18, 2016

Here the job0000.out:

julia_worker:9009#(theip)
ErrorException("Process(1) - Invalid connection credentials sent by remote.")CapturedException(ErrorException("Process(1) - Invalid connection credentials sent by remote."),Any[( in process_hdr(::TCPSocket, ::Bool) at multi.jl:1400,1),( in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at multi.jl:1299,1),( in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at multi.jl:1276,1),( in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at event.jl:68,1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

I experience no difference between addprocs_slurm(n) and addprocs(SlurmManager(n))

@amitmurthy
Copy link
Contributor

Can you checkout branch amitm/debug and try? I have added a couple of debug statements which prints the local cookie and the slurm command.

@axsk
Copy link
Author

axsk commented Oct 19, 2016

Here are the results:

cookie: 2UPo2qVQVRPUhqRD
VERSION: 0.5.0
worker_arg: `--worker 2UPo2qVQVRPUhqRD`
srun_cmd: `srun -J julia-29603 -n 1 -o job%4t.out -D /nfs/numerik/bzfsikor /nfs/datanumerik/bzfsikor/julia/julia-0.5.0/julia-3c9d75391c/bin/julia --worker 2UPo2qVQVRPUhqRD`

@amitmurthy
Copy link
Contributor

I just pushed another update to amitm/debug which uses Base.start_worker in srun_cmd instead of --worker option. Can you try with that and also post the debug output here?

@axsk
Copy link
Author

axsk commented Oct 19, 2016

Seems you had the right nose here, it is working now :)

julia> using ClusterManagers; addprocs_slurm(1)
srun_cmd: `srun -J julia-19892 -n 1 -o job%4t.out -D /nfs/datanumerik/bzfsikor/julia/pkgdir/v0.5/ClusterManagers /nfs/datanumerik/bzfsikor/julia/julia-0.5.0/julia-3c9d75391c/bin/julia -e 'Base.start_worker("1E5ahz7GRYCxfGs8")'`
srun: job 1103 queued and waiting for resources
srun: job 1103 has been allocated resources

1-element Array{Int64,1}:
 2

So there is something wrong with the --worker command line option? Strange nobody else had problems...

@amitmurthy
Copy link
Contributor

Maybe it has something to do with the local environment on your cluster. What is the locale on the worker nodes? Non-english language?

@vtjnash : do you have any ideas as to what could be the problem here?

The cookie is being passed as a required arg with --worker and is read here - https://github.com/JuliaLang/julia/blob/0faf8ce200103839577171d28a4d1545fa827336/base/client.jl#L224

The comparison which is failing on @axsk 's system is here - https://github.com/JuliaLang/julia/blob/0faf8ce200103839577171d28a4d1545fa827336/base/multi.jl#L1402-L1406 - basically the cookie read from the command line is compared to the one read from the socket.

@axsk : are you open to building julia from source for the workers? I can provide a patch with appropriate debug statements to track down this issue.

@axsk
Copy link
Author

axsk commented Oct 20, 2016

locale returns en_US.UTF-8 everywhere.

I'm open to building julia with the debug statetemets, but probably not today anymore, since I now got enough work with actually running the code I needed on the cluster :)

@axsk
Copy link
Author

axsk commented Aug 1, 2017

Here I am with another update:

I reinstalled all the packages (now on Julia 0.5.2), hence the Base.start_worker patch is gone.
The bug returned, i.e. I get the same version read error, but it only happens in about 50% of the times I try to run addprocs.

@juliohm juliohm added the SLURM label Oct 6, 2020
@juliohm
Copy link
Collaborator

juliohm commented Oct 6, 2020

Too old to reproduce. Please check the new release.

@juliohm juliohm closed this as completed Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants