Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting parallel julia and using it remotely on big machine is slow #3655

Closed
ViralBShah opened this issue Jul 9, 2013 · 25 comments
Closed
Labels
domain:parallelism Parallel or distributed computation

Comments

@ViralBShah
Copy link
Member

I often want to run in a configuration where my local julia on my laptop serves as a client where I develop, visualize, etc. and I want to use remote big iron for compute. Currently, if I try to start julia locally, and add 40 workers all of which are on the same big machine, it takes forever to ssh 40 times to the same machine.

It would be nice if we could have a concept of saying how many cores to use on each remote node, and startup could leverage that information.

Currently, id 1 communicates with all the workers, sending work over potentially slow links in this model. Instead, if we could use one of the remote workers as to repeat the message to all other workers, it would greatly cut the latencies.

Currently, parallel julia assumes that id 1 and all other workers are effectively on the same network, and the cost of communication between everyone is the same. Moving to a model, where id 1 can run locally and connect to compute resources over a relatively slow network is a slightly different model, but one that I believe has better usability when working interactively with large problems, while retaining the ability to plot results, etc.

@ViralBShah
Copy link
Member Author

A completely different solution to this problem could be to start ipython-nb on the remote big iron machine and interact through the browser.

@amitmurthy
Copy link
Contributor

How about:

  • A new startup parameter --port <port> which will make process id 1, startup, lose the terminal, and start listening on the said port. The REPL is not available.
  • A new API, clusterstart(host, port, args...) => ClusterHandle which starts julia on the specified host with the above argument. args are other relevant startup arguments. It listens on port for client connections
  • The cluster started in this way is not associated with the client starting it. It can also be started directly on the command line. id 1 still runs remotely as part of the cluster and is the "head" process of the cluster. The client is also process id 1, but there is no deep coupling with the cluster
  • A new API, clusterconnect(host, port) => ClusterHandle connects to an already running set of Julia processes and returns a ClusterHandle.
  • Both clusterstart and clusterconnect set up a regular ssh connection to process id 1 on the remote set of processes.
  • A bunch of remotecall_fetch, pmap and @parallel functions which take in a ClusterHandle instead of process ids. These are blocking functions and send in requests over the ssh connections and wait for a response.
  • You can either execute addprocs using remotecall_fetch(ClusterHandle, ...) or you would have specified additional workers in the command line as part of args in clusterstart
  • A new API, clusterstop(host, port) and clusterstop(ClusterHandle) to cleanly shut down the remote cluster.

@ViralBShah
Copy link
Member Author

In this arrangement, does the client have an id? It seems that the cluster will have processor IDs 1:np here. Does that mean that myid() will be 1 on both, the client and on the cluster head?

@amitmurthy
Copy link
Contributor

Yes - but they are independent instances. The client is only used to send commands over the ssh connection to the remote cluster and get the results - only in the context of ClusterHandle versions of remotecall_fetch, pmap and @parallel .

The usual async requests (using RemoteRefs) will not work. Think of the client as a julia shell into the remote cluster.

We can also have a mode where after executing a command, say clustershell(ClusterHandle), all commands typed at the local prompt will be shipped and executed remotely in a synchronous manner. In the shell mode, however, data returned cannot be bound to local variables.

@JeffBezanson
Copy link
Sponsor Member

A remote repl seems like a good solution to this. I believe keno's new repl
supports this.
On Jul 9, 2013 6:10 AM, "Amit Murthy" notifications@github.com wrote:

How about:

A new startup parameter --port which will make process id 1,
startup, lose the terminal, and start listening on the said port. The REPL
is not available.

A new API, clusterstart(host, port, args...) => ClusterHandle which
starts julia on the specified host with the above argument. args are
other relevant startup arguments. It listens on port for client
connections

The cluster started in this way is not associated with the client
starting it. It can also be started directly on the command line. id 1
still runs remotely as part of the cluster and is the "head" process of the
cluster. The client is also process id 1, but there is no deep coupling
with the cluster

A new API, clusterconnect(host, port) => ClusterHandle connects to an
already running set of Julia processes and returns a ClusterHandle.

Both clusterstart and clusterconnect set up a regular ssh connection
to process id 1 on the remote set of processes.

A bunch of remotecall_fetch, pmap and @parallel functions which take
in a ClusterHandle instead of process ids. These are blocking
functions and send in requests over the ssh connections and wait for a
response.

You can either execute addprocs using remotecall_fetch(ClusterHandle,
...) or you would have specified additional workers in the command
line as part of args in clusterstart

A new API, clusterstop(host, port) and clusterstop(ClusterHandle) to
cleanly shut down the remote cluster.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3655#issuecomment-20664675
.

@ViralBShah
Copy link
Member Author

@loladiro can you weigh in?

@Keno
Copy link
Member

Keno commented Jul 9, 2013

Yes, I have the code to do this. I should really just finish that REPL package

@amitmurthy
Copy link
Contributor

@loladiro, is it in a state to play around with?

It will be great if it has a mechanism where the local client is used for visualization (plots, graphs, etc) while the remote is used for computation.

@JeffBezanson
Copy link
Sponsor Member

If you look carefully, before commit 17b3986 I executed all the readsfrom commands for the workers first, before doing parse_connection_info on any of them. This allows the processes to start in parallel instead of waiting for one after the other. Today Alan was also noticing that addprocs (both local and remote) is much slower than it used to be. Please fix this.

@ViralBShah
Copy link
Member Author

@amitmurthy
Copy link
Contributor

@JeffBezanson , sorry about that - fixed in f5bfeb5

@Keno
Copy link
Member

Keno commented Jul 10, 2013

@amitmurthy I'm not sure I ever pushed that code. The foundations are there, but it will require still some work to complete. Give me a day or two.

What I have and works so far is at https://github.com/loladiro/REPL.jl . It's already separated into frontend and backend, I just need to merge the changes that allow the backend to be on another host.

@Keno
Copy link
Member

Keno commented Jul 11, 2013

@amitmurthy It surprisingly works just fine (I had never tested it), but there are probably issues. Using the latest master of REPL.jl, Terminals.jl and Readline.jl:

Kenos-MacBook-Pro:julia keno$ ./julia -p 2 repltest2.jl
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-2523.ra5120711.dirty
 _/ |\__'_|_|_|\__'_|  |  Commit a5120711e2 2013-07-10 16:33:42*
|__/                   |  x86_64-apple-darwin11.4.2

julia> x,y = RemoteRef(2),RemoteRef(2)
(RemoteRef(2,1,10),RemoteRef(2,1,11))

julia> @spawnat 2 begin
              REPL.start_repl_backend(x,y)
              end
RemoteRef(2,1,12)

julia> REPL.StreamREPL_frontend(StreamREPL(STDIN,julia_green,Base.text_colors[:white],Base.answer_color()),x,y)
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-2523.ra5120711.dirty
 _/ |\__'_|_|_|\__'_|  |  Commit a5120711e2 2013-07-10 16:33:42*
|__/                   |  x86_64-apple-darwin11.4.2

julia> myid()
2

@Keno
Copy link
Member

Keno commented Jul 11, 2013

I simplified the API a bit:

Kenos-MacBook-Pro:julia keno$ ./julia -p 2 repltest2.jl
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-2523.ra5120711.dirty
 _/ |\__'_|_|_|\__'_|  |  Commit a5120711e2 2013-07-10 16:33:42*
|__/                   |  x86_64-apple-darwin11.4.2

julia> x,y = RemoteRef(2),RemoteRef(2)
(RemoteRef(2,1,10),RemoteRef(2,1,11))

julia> @spawnat 2 begin
                     REPL.start_repl_backend(x,y)
                     end
RemoteRef(2,1,12)

julia> t = Terminals.Unix.UnixTerminal("xterm",STDIN,STDOUT,STDERR)
UnixTerminal("xterm",TTY(connected,0 bytes waiting),TTY(connected,0 bytes waiting),TTY(connected,0 bytes waiting))

julia> REPL.run_frontend(ReadlineREPL(t),x,y)
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-2523.ra5120711.dirty
 _/ |\__'_|_|_|\__'_|  |  Commit a5120711e2 2013-07-10 16:33:42*
|__/                   |  x86_64-apple-darwin11.4.2

julia> myid()
2

julia> run(`whoami`)
    From worker 2:  keno

julia> versioninfo()
    From worker 2:  Julia Version 0.2.0-2523.ra5120711.dirty
    From worker 2:  Commit a5120711e2 2013-07-10 16:33:42*
    From worker 2:  Platform Info:
    From worker 2:    System: Darwin (x86_64-apple-darwin11.4.2)
    From worker 2:    WORD_SIZE: 64
    From worker 2:    BLAS: libopenblas (USE64BITINT NO_AFFINITY)
    From worker 2:    LAPACK: libopenblas
    From worker 2:    LIBM: libopenlibm

julia> #This was a ^D, closing the worker connection

julia> myid()
1

This might not work with the libreadline-based repl since that messes up the terminal. My repltest2.jl file is

using Terminals
using Readline
using REPL
REPL.run_repl(Terminals.Unix.UnixTerminal("xterm",STDIN,STDOUT,STDERR))

@Keno
Copy link
Member

Keno commented Jul 11, 2013

On an unrelated note, I seem to have problems connecting to a remote machine:

julia> addprocs(["kfischer@julia.mit.edu"];tunnel=true,dir="~/julia/usr/bin")
1-element Any Array:
 2

julia> fetch(@spawnat 2 2)
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Worker 2 terminated.

@amitmurthy
Copy link
Contributor

I just tried it both on on localhost and an EC2 instance and it went through for me:

julia> ec2_addprocs(instances, "/home/amitm/keys/jublr.pem", dir="/usr/bin")
Warning: Permanently added 'ec2-54-227-73-154.compute-1.amazonaws.com,54.227.73.154' (ECDSA) to the list of known hosts.
1-element Any Array:
 2

julia> remotecall_fetch(2, myid)
2

and


julia> addprocs(["amitm@localhost"];tunnel=true)
1-element Any Array:
 2

julia> remotecall_fetch(2, myid)
2

julia> fetch(@spawnat 2 2)
2

Are you on the latest julia?

Also, did the worker terminate immediately? Was it similar to #3663

@Keno
Copy link
Member

Keno commented Jul 11, 2013

Yes, this is on latest master. Client is OS X while server is Linux though. Might have something to do with that. The worker did not terminate immediately. It waited the 60 seconds.

@amitmurthy
Copy link
Contributor

Both are 64-bit I presume.

@Keno
Copy link
Member

Keno commented Jul 11, 2013

Yes

@amitmurthy
Copy link
Contributor

Give me a couple of hours. Will get access to a Mac and figure this out.

@Keno
Copy link
Member

Keno commented Jul 11, 2013

Sounds good! I was gonna go to sleep anyway.

@amitmurthy
Copy link
Contributor

Just tested between a Mac and my Linux laptop and it went through.

After the addprocs can you see if a) there is a local connection to the tunnel port and b) tunnel ssh process is running

@amitmurthy
Copy link
Contributor

On julia.mit.edu, julia.mit.edu does not resolve to its correct address 18.4.43.32

I'll also change addprocs to use the hostname printed by the worker in the -L argument to the tunnel command, i.e., change

ssh -f -o ExitOnForwardFailure=yes $sshflags $(user)@$host -L $localp:$host:$(int(port)) -N

to

ssh -f -o ExitOnForwardFailure=yes $sshflags $(user)@$host -L $localp:$private_hostname:$(int(port)) -N

@ViralBShah
Copy link
Member Author

The big change here is that we now have all the ssh sessions starting in parallel. We still need a way to specify something like: use n CPUs on machine i and so on.

@ViralBShah
Copy link
Member Author

@amitm Does #11665 help with this? I think that RemoteREPL is what we need and a master free cluster ideally, and that is different from what is addressed with the topology work.

@vtjnash vtjnash closed this as completed Aug 4, 2016
KristofferC pushed a commit that referenced this issue Oct 17, 2023
Stdlib: Pkg
URL: https://github.com/JuliaLang/Pkg.jl.git
Stdlib branch: master
Julia branch: master
Old commit: b02fb9597
New commit: ffb6edf03
Julia version: 1.11.0-DEV
Pkg version: 1.11.0
Bump invoked by: @IanButterworth
Powered by:
[BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl)

Diff:
JuliaLang/Pkg.jl@b02fb95...ffb6edf

```
$ git log --oneline b02fb9597..ffb6edf03
ffb6edf03 cache pidlock tweaks (#3654)
550eadd7e Pin registry for MetaGraph tests (#3666)
ee39026b8 Remove test that depends on Random being in the sysimg (#3656)
561508db2 CI: Increase the CI timeout. Update actions. Fix double precompilation. (#3665)
7c7ed63b1 Remove change UUID script it should be uncessary on Julia v1.11-dev (#3655)
a8648f7c8 Precompile: Fix algorithmic complexity of cycle detection (#3651)
0e0cf4514 Switch datastructure Vector -> Set for algorithmic complexity (#3652)
894cc3f78 respect if load-time precompile is disabled (#3648)
3ffd1cf73 Make auto GC message use printpkgstyle (#3633)
```

Co-authored-by: Dilum Aluthge <dilum@aluthge.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

5 participants