Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine #6

Closed
bjarthur opened this issue Feb 25, 2014 · 7 comments

Comments

@bjarthur
Copy link
Collaborator

if i add local workers before adding remote SGE workers, then the SGE workers will terminate with an ECONNREFUSED error. if i reverse the order, and add the SGE workers before the local workers, then all is good. i presume this is not the desired behavior. let me know if there is anyway i can help debug. sample output and versioninfo below.

[arthurb@h01u14 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease (2014-02-24 14:04 UTC)
/ |_'|||__'| | master/457bca9* (fork: -1 commits, 126 days)
|__/ | x86_64-redhat-linux

julia> addprocs(16)
16-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

julia> using ClusterManagers

julia> addprocs(10, cman=SGEManager())
job id is 6365131, waiting for job to start ..............................
10-element Array{Any,1}:
18
19
20
21
22
23
24
25
26
27

julia> Worker 19 terminated.
Worker 20 terminated.
Worker 21 terminated.
Worker 22 terminated.
Worker 18 terminated.
Worker 25 terminated.
Worker 24 terminated.
Worker 23 terminated.
Worker 27 terminated.
Worker 26 terminated.
From worker 18: fatal error on 18: ERROR: connect: connection refused (ECONNREFUSED)
From worker 18: in wait_connected at stream.jl:265
From worker 18: in connect at stream.jl:871
From worker 18: in Worker at multi.jl:119
From worker 18: in anonymous at task.jl:866
From worker 19: fatal error on 19: ERROR: connect: connection refused (ECONNREFUSED)
From worker 19: in wait_connected at stream.jl:265
From worker 19: in connect at stream.jl:871
From worker 19: in Worker at multi.jl:119
From worker 19: in anonymous at task.jl:866
From worker 20: fatal error on 20: ERROR: connect: connection refused (ECONNREFUSED)
From worker 20: in wait_connected at stream.jl:265
From worker 20: in connect at stream.jl:871
From worker 20: in Worker at multi.jl:119
From worker 20: in anonymous at task.jl:866
From worker 21: fatal error on 21: ERROR: connect: connection refused (ECONNREFUSED)
From worker 21: in wait_connected at stream.jl:265
From worker 21: in connect at stream.jl:871
From worker 21: in Worker at multi.jl:119
From worker 21: in anonymous at task.jl:866
From worker 22: fatal error on 22: ERROR: connect: connection refused (ECONNREFUSED)
From worker 22: in wait_connected at stream.jl:265
From worker 22: in connect at stream.jl:871
From worker 22: in Worker at multi.jl:119
From worker 22: in anonymous at task.jl:866
From worker 23: fatal error on 23: ERROR: connect: connection refused (ECONNREFUSED)
From worker 23: in wait_connected at stream.jl:265
From worker 23: in connect at stream.jl:871
From worker 23: in Worker at multi.jl:119
From worker 23: in anonymous at task.jl:866
From worker 24: fatal error on 24: ERROR: connect: connection refused (ECONNREFUSED)
From worker 24: in wait_connected at stream.jl:265
From worker 24: in connect at stream.jl:871
From worker 24: in Worker at multi.jl:119
From worker 24: in anonymous at task.jl:866
From worker 25: fatal error on 25: ERROR: connect: connection refused (ECONNREFUSED)
From worker 25: in wait_connected at stream.jl:265
From worker 25: in connect at stream.jl:871
From worker 25: in Worker at multi.jl:119
From worker 25: in anonymous at task.jl:866
From worker 26: fatal error on 26: ERROR: connect: connection refused (ECONNREFUSED)
From worker 26: in wait_connected at stream.jl:265
From worker 26: in connect at stream.jl:871
From worker 26: in Worker at multi.jl:119
From worker 26: in anonymous at task.jl:866
From worker 27: fatal error on 27: ERROR: connect: connection refused (ECONNREFUSED)
From worker 27: in wait_connected at stream.jl:265
From worker 27: in connect at stream.jl:871
From worker 27: in Worker at multi.jl:119
From worker 27: in anonymous at task.jl:866

julia>
[arthurb@h01u14 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease (2014-02-24 14:04 UTC)
/ |_'|||__'| | master/457bca9* (fork: -1 commits, 126 days)
|__/ | x86_64-redhat-linux

julia> using ClusterManagers

julia> addprocs(10, cman=SGEManager())
job id is 6365134, waiting for job to start ..............................
10-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11

julia> addprocs(16)
16-element Array{Any,1}:
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

julia> versioninfo()
Julia Version 0.3.0-prerelease
Commit 457bca9* (2014-02-24 14:04 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
7 required packages:

  • ClusterManagers 0.0.1
  • DSP 0.0.1
  • Debug 0.0.0
  • Devectorize 0.2.1
  • Distributions 0.4.0
  • MAT 0.2.2
  • WAV 0.2.2
    9 additional packages:
  • ArrayViews 0.4.1
  • BinDeps 0.2.12
  • HDF5 0.2.17
  • NumericExtensions 0.5.4
  • PDMats 0.1.0
  • Polynomial 0.0.0
  • StatsBase 0.3.7
  • URIParser 0.0.1
  • Zlib 0.1.5

julia>

@amitmurthy
Copy link
Contributor

I am not familiar with SGE, but I can take a guess at what is happening.

In Julia, all workers are connected to each other. The way this works is that after the main process (pid 1) launches a worker, the worker writes the ip:port it is listening on to its stdout. pid 1 connects to this address and then sends the new worker a list of host:port addresses (of existing workers) it in turn should connect to. The later workers always initiate a connection to the previously launched workers.

What seems to be happening is that while workers on localhost can initiate connections to workers on SGE nodes, the reverse is not true, i.e., workers on SGE nodes are not being allowed to connect outside their local network.

Is this a configurable property of SGE?

@amitmurthy
Copy link
Contributor

Or, more likely, a firewall in your localhost is not allowing incoming connections from SGE nodes.

@bjarthur
Copy link
Collaborator Author

thanks amit. my main julia process (pid 1) is on the cluster (i ssh in and run julia interactively), as are all the workers. the sysadmin tells me that there is no firewall between nodes.

i'm testing the tcp connection between workers. after starting julia and adding remote workers netstat reports one established tcp connection for each, with a port number corresponding to what's in the julia-xxx.oxxx.x files. nc -z succeeds going to the worker, but fails if i ssh into the worker and test the socket in the reverse direction.

so my question: should i expect a second tcp socket for the incoming traffic, and the problem is that it is not there? or should this sole socket be bidirectional?

it might be relevant that each node in this cluster has two NICs, one facing out to the world, the other facing towards the rest of the nodes in the cluster. julia correctly finds the latter ip addr.

here is a transcript of my test session:

julia> using ClusterManagers

julia> addprocs(1, cman=SGEManager())
job id is 6447451, waiting for job to start ......................................................
1-element Array{Any,1}:
2

julia>
[1]+ Stopped /home/arthurb/src/juliac/julia
[arthurb@h01u14 ~]$ cat julia-18256.o6447451.1
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
julia_worker:9009#172.38.104.21
[arthurb@h01u14 ~]$ hostname --ip-address
172.38.101.24
[arthurb@h01u14 ~]$ netstat -an | grep 172.38.104.21
tcp 0 0 172.38.101.24:40886 172.38.104.21:9009 ESTABLISHED
[arthurb@h01u14 ~]$ nc -z 172.38.104.21 9009; echo $?
Connection to 172.38.104.21 9009 port [tcp/pichat] succeeded!
0
[arthurb@h01u14 ~]$ ssh 172.38.104.21
arthurb@172.38.104.21's password:
[arthurb@h04u11 ~]$ netstat -an | grep 172.38.101.24
tcp 0 0 172.38.104.21:22 172.38.101.24:33349 ESTABLISHED
tcp 0 0 172.38.104.21:9009 172.38.101.24:40886 ESTABLISHED
[arthurb@h04u11 ~]$ nc -z 172.38.101.24 40886; echo $?
1

@amitmurthy
Copy link
Contributor

Thanks. The main process is storing the address on localhost addprocs as the loopback address and hence the problem. I have opened an issue here - JuliaLang/julia#5995 .

@nlhepler
Copy link
Contributor

Thanks for fixing this upstream, Amit! I take it the issue is resolved, now?

@amitmurthy
Copy link
Contributor

The fix upstream has not yet been merged. But this can be closed here since it is not an issue with ClusterManagers.jl per se.

@bjarthur
Copy link
Collaborator Author

fixed here JuliaLang/julia#6030

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants