addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine #6

bjarthur · 2014-02-25T20:01:47Z

if i add local workers before adding remote SGE workers, then the SGE workers will terminate with an ECONNREFUSED error. if i reverse the order, and add the SGE workers before the local workers, then all is good. i presume this is not the desired behavior. let me know if there is anyway i can help debug. sample output and versioninfo below.

[arthurb@h01u14 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.3.0-prerelease (2014-02-24 14:04 UTC)
/ |_'|||__'| | master/457bca9* (fork: -1 commits, 126 days)
|__/ | x86_64-redhat-linux

julia> addprocs(16)
16-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

julia> using ClusterManagers

julia> addprocs(10, cman=SGEManager())
job id is 6365131, waiting for job to start ..............................
10-element Array{Any,1}:
18
19
20
21
22
23
24
25
26
27

julia> Worker 19 terminated.
Worker 20 terminated.
Worker 21 terminated.
Worker 22 terminated.
Worker 18 terminated.
Worker 25 terminated.
Worker 24 terminated.
Worker 23 terminated.
Worker 27 terminated.
Worker 26 terminated.
From worker 18: fatal error on 18: ERROR: connect: connection refused (ECONNREFUSED)
From worker 18: in wait_connected at stream.jl:265
From worker 18: in connect at stream.jl:871
From worker 18: in Worker at multi.jl:119
From worker 18: in anonymous at task.jl:866
From worker 19: fatal error on 19: ERROR: connect: connection refused (ECONNREFUSED)
From worker 19: in wait_connected at stream.jl:265
From worker 19: in connect at stream.jl:871
From worker 19: in Worker at multi.jl:119
From worker 19: in anonymous at task.jl:866
From worker 20: fatal error on 20: ERROR: connect: connection refused (ECONNREFUSED)
From worker 20: in wait_connected at stream.jl:265
From worker 20: in connect at stream.jl:871
From worker 20: in Worker at multi.jl:119
From worker 20: in anonymous at task.jl:866
From worker 21: fatal error on 21: ERROR: connect: connection refused (ECONNREFUSED)
From worker 21: in wait_connected at stream.jl:265
From worker 21: in connect at stream.jl:871
From worker 21: in Worker at multi.jl:119
From worker 21: in anonymous at task.jl:866
From worker 22: fatal error on 22: ERROR: connect: connection refused (ECONNREFUSED)
From worker 22: in wait_connected at stream.jl:265
From worker 22: in connect at stream.jl:871
From worker 22: in Worker at multi.jl:119
From worker 22: in anonymous at task.jl:866
From worker 23: fatal error on 23: ERROR: connect: connection refused (ECONNREFUSED)
From worker 23: in wait_connected at stream.jl:265
From worker 23: in connect at stream.jl:871
From worker 23: in Worker at multi.jl:119
From worker 23: in anonymous at task.jl:866
From worker 24: fatal error on 24: ERROR: connect: connection refused (ECONNREFUSED)
From worker 24: in wait_connected at stream.jl:265
From worker 24: in connect at stream.jl:871
From worker 24: in Worker at multi.jl:119
From worker 24: in anonymous at task.jl:866
From worker 25: fatal error on 25: ERROR: connect: connection refused (ECONNREFUSED)
From worker 25: in wait_connected at stream.jl:265
From worker 25: in connect at stream.jl:871
From worker 25: in Worker at multi.jl:119
From worker 25: in anonymous at task.jl:866
From worker 26: fatal error on 26: ERROR: connect: connection refused (ECONNREFUSED)
From worker 26: in wait_connected at stream.jl:265
From worker 26: in connect at stream.jl:871
From worker 26: in Worker at multi.jl:119
From worker 26: in anonymous at task.jl:866
From worker 27: fatal error on 27: ERROR: connect: connection refused (ECONNREFUSED)
From worker 27: in wait_connected at stream.jl:265
From worker 27: in connect at stream.jl:871
From worker 27: in Worker at multi.jl:119
From worker 27: in anonymous at task.jl:866

julia>
[arthurb@h01u14 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.3.0-prerelease (2014-02-24 14:04 UTC)
/ |_'|||__'| | master/457bca9* (fork: -1 commits, 126 days)
|__/ | x86_64-redhat-linux

julia> using ClusterManagers

julia> addprocs(10, cman=SGEManager())
job id is 6365134, waiting for job to start ..............................
10-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11

julia> addprocs(16)
16-element Array{Any,1}:
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

julia> versioninfo()
Julia Version 0.3.0-prerelease
Commit 457bca9* (2014-02-24 14:04 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
7 required packages:

ClusterManagers 0.0.1
DSP 0.0.1
Debug 0.0.0
Devectorize 0.2.1
Distributions 0.4.0
MAT 0.2.2
WAV 0.2.2
9 additional packages:
ArrayViews 0.4.1
BinDeps 0.2.12
HDF5 0.2.17
NumericExtensions 0.5.4
PDMats 0.1.0
Polynomial 0.0.0
StatsBase 0.3.7
URIParser 0.0.1
Zlib 0.1.5

julia>

amitmurthy · 2014-02-26T03:25:27Z

I am not familiar with SGE, but I can take a guess at what is happening.

In Julia, all workers are connected to each other. The way this works is that after the main process (pid 1) launches a worker, the worker writes the ip:port it is listening on to its stdout. pid 1 connects to this address and then sends the new worker a list of host:port addresses (of existing workers) it in turn should connect to. The later workers always initiate a connection to the previously launched workers.

What seems to be happening is that while workers on localhost can initiate connections to workers on SGE nodes, the reverse is not true, i.e., workers on SGE nodes are not being allowed to connect outside their local network.

Is this a configurable property of SGE?

amitmurthy · 2014-02-26T03:53:18Z

Or, more likely, a firewall in your localhost is not allowing incoming connections from SGE nodes.

bjarthur · 2014-02-28T17:14:05Z

thanks amit. my main julia process (pid 1) is on the cluster (i ssh in and run julia interactively), as are all the workers. the sysadmin tells me that there is no firewall between nodes.

i'm testing the tcp connection between workers. after starting julia and adding remote workers netstat reports one established tcp connection for each, with a port number corresponding to what's in the julia-xxx.oxxx.x files. nc -z succeeds going to the worker, but fails if i ssh into the worker and test the socket in the reverse direction.

so my question: should i expect a second tcp socket for the incoming traffic, and the problem is that it is not there? or should this sole socket be bidirectional?

it might be relevant that each node in this cluster has two NICs, one facing out to the world, the other facing towards the rest of the nodes in the cluster. julia correctly finds the latter ip addr.

here is a transcript of my test session:

julia> using ClusterManagers

julia> addprocs(1, cman=SGEManager())
job id is 6447451, waiting for job to start ......................................................
1-element Array{Any,1}:
2

julia>
[1]+ Stopped /home/arthurb/src/juliac/julia
[arthurb@h01u14 ~]$ cat julia-18256.o6447451.1
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
julia_worker:9009#172.38.104.21
[arthurb@h01u14 ~]$ hostname --ip-address
172.38.101.24
[arthurb@h01u14 ~]$ netstat -an | grep 172.38.104.21
tcp 0 0 172.38.101.24:40886 172.38.104.21:9009 ESTABLISHED
[arthurb@h01u14 ~]$ nc -z 172.38.104.21 9009; echo $?
Connection to 172.38.104.21 9009 port [tcp/pichat] succeeded!
0
[arthurb@h01u14 ~]$ ssh 172.38.104.21
arthurb@172.38.104.21's password:
[arthurb@h04u11 ~]$ netstat -an | grep 172.38.101.24
tcp 0 0 172.38.104.21:22 172.38.101.24:33349 ESTABLISHED
tcp 0 0 172.38.104.21:9009 172.38.101.24:40886 ESTABLISHED
[arthurb@h04u11 ~]$ nc -z 172.38.101.24 40886; echo $?
1

amitmurthy · 2014-03-01T05:42:02Z

Thanks. The main process is storing the address on localhost addprocs as the loopback address and hence the problem. I have opened an issue here - JuliaLang/julia#5995 .

nlhepler · 2014-03-24T06:20:08Z

Thanks for fixing this upstream, Amit! I take it the issue is resolved, now?

amitmurthy · 2014-03-24T07:47:59Z

The fix upstream has not yet been merged. But this can be closed here since it is not an issue with ClusterManagers.jl per se.

bjarthur · 2014-04-17T12:58:13Z

fixed here JuliaLang/julia#6030

amitmurthy mentioned this issue Mar 1, 2014

Store external ip in case of localhost addprocs JuliaLang/julia#5995

Closed

bjarthur mentioned this issue Apr 3, 2014

stdout not redirected to repl #10

Closed

bjarthur closed this as completed Apr 17, 2014

bjarthur mentioned this issue Jan 17, 2017

multiple addprocs is order dependent JuliaLang/julia#20011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine #6

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine #6

bjarthur commented Feb 25, 2014

amitmurthy commented Feb 26, 2014

amitmurthy commented Feb 26, 2014

bjarthur commented Feb 28, 2014

amitmurthy commented Mar 1, 2014

nlhepler commented Mar 24, 2014

amitmurthy commented Mar 24, 2014

bjarthur commented Apr 17, 2014

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine #6

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine #6

Comments

bjarthur commented Feb 25, 2014

amitmurthy commented Feb 26, 2014

amitmurthy commented Feb 26, 2014

bjarthur commented Feb 28, 2014

amitmurthy commented Mar 1, 2014

nlhepler commented Mar 24, 2014

amitmurthy commented Mar 24, 2014

bjarthur commented Apr 17, 2014