-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting parallel julia and using it remotely on big machine is slow #3655
Comments
A completely different solution to this problem could be to start ipython-nb on the remote big iron machine and interact through the browser. |
How about:
|
In this arrangement, does the client have an id? It seems that the cluster will have processor IDs 1:np here. Does that mean that |
Yes - but they are independent instances. The client is only used to send commands over the ssh connection to the remote cluster and get the results - only in the context of The usual async requests (using RemoteRefs) will not work. Think of the client as a julia shell into the remote cluster. We can also have a mode where after executing a command, say |
A remote repl seems like a good solution to this. I believe keno's new repl
|
@loladiro can you weigh in? |
Yes, I have the code to do this. I should really just finish that REPL package |
@loladiro, is it in a state to play around with? It will be great if it has a mechanism where the local client is used for visualization (plots, graphs, etc) while the remote is used for computation. |
If you look carefully, before commit 17b3986 I executed all the |
@JeffBezanson , sorry about that - fixed in f5bfeb5 |
@amitmurthy I'm not sure I ever pushed that code. The foundations are there, but it will require still some work to complete. Give me a day or two. What I have and works so far is at https://github.com/loladiro/REPL.jl . It's already separated into frontend and backend, I just need to merge the changes that allow the backend to be on another host. |
@amitmurthy It surprisingly works just fine (I had never tested it), but there are probably issues. Using the latest master of REPL.jl, Terminals.jl and Readline.jl: Kenos-MacBook-Pro:julia keno$ ./julia -p 2 repltest2.jl
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" to list help topics
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.2.0-2523.ra5120711.dirty
_/ |\__'_|_|_|\__'_| | Commit a5120711e2 2013-07-10 16:33:42*
|__/ | x86_64-apple-darwin11.4.2
julia> x,y = RemoteRef(2),RemoteRef(2)
(RemoteRef(2,1,10),RemoteRef(2,1,11))
julia> @spawnat 2 begin
REPL.start_repl_backend(x,y)
end
RemoteRef(2,1,12)
julia> REPL.StreamREPL_frontend(StreamREPL(STDIN,julia_green,Base.text_colors[:white],Base.answer_color()),x,y)
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" to list help topics
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.2.0-2523.ra5120711.dirty
_/ |\__'_|_|_|\__'_| | Commit a5120711e2 2013-07-10 16:33:42*
|__/ | x86_64-apple-darwin11.4.2
julia> myid()
2 |
I simplified the API a bit: Kenos-MacBook-Pro:julia keno$ ./julia -p 2 repltest2.jl
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" to list help topics
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.2.0-2523.ra5120711.dirty
_/ |\__'_|_|_|\__'_| | Commit a5120711e2 2013-07-10 16:33:42*
|__/ | x86_64-apple-darwin11.4.2
julia> x,y = RemoteRef(2),RemoteRef(2)
(RemoteRef(2,1,10),RemoteRef(2,1,11))
julia> @spawnat 2 begin
REPL.start_repl_backend(x,y)
end
RemoteRef(2,1,12)
julia> t = Terminals.Unix.UnixTerminal("xterm",STDIN,STDOUT,STDERR)
UnixTerminal("xterm",TTY(connected,0 bytes waiting),TTY(connected,0 bytes waiting),TTY(connected,0 bytes waiting))
julia> REPL.run_frontend(ReadlineREPL(t),x,y)
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" to list help topics
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.2.0-2523.ra5120711.dirty
_/ |\__'_|_|_|\__'_| | Commit a5120711e2 2013-07-10 16:33:42*
|__/ | x86_64-apple-darwin11.4.2
julia> myid()
2
julia> run(`whoami`)
From worker 2: keno
julia> versioninfo()
From worker 2: Julia Version 0.2.0-2523.ra5120711.dirty
From worker 2: Commit a5120711e2 2013-07-10 16:33:42*
From worker 2: Platform Info:
From worker 2: System: Darwin (x86_64-apple-darwin11.4.2)
From worker 2: WORD_SIZE: 64
From worker 2: BLAS: libopenblas (USE64BITINT NO_AFFINITY)
From worker 2: LAPACK: libopenblas
From worker 2: LIBM: libopenlibm
julia> #This was a ^D, closing the worker connection
julia> myid()
1 This might not work with the libreadline-based repl since that messes up the terminal. My using Terminals
using Readline
using REPL
REPL.run_repl(Terminals.Unix.UnixTerminal("xterm",STDIN,STDOUT,STDERR)) |
On an unrelated note, I seem to have problems connecting to a remote machine:
|
I just tried it both on on localhost and an EC2 instance and it went through for me:
and
Are you on the latest julia? Also, did the worker terminate immediately? Was it similar to #3663 |
Yes, this is on latest master. Client is OS X while server is Linux though. Might have something to do with that. The worker did not terminate immediately. It waited the 60 seconds. |
Both are 64-bit I presume. |
Yes |
Give me a couple of hours. Will get access to a Mac and figure this out. |
Sounds good! I was gonna go to sleep anyway. |
Just tested between a Mac and my Linux laptop and it went through. After the |
On julia.mit.edu, julia.mit.edu does not resolve to its correct address 18.4.43.32 I'll also change addprocs to use the hostname printed by the worker in the
to
|
The big change here is that we now have all the ssh sessions starting in parallel. We still need a way to specify something like: use |
Stdlib: Pkg URL: https://github.com/JuliaLang/Pkg.jl.git Stdlib branch: master Julia branch: master Old commit: b02fb9597 New commit: ffb6edf03 Julia version: 1.11.0-DEV Pkg version: 1.11.0 Bump invoked by: @IanButterworth Powered by: [BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl) Diff: JuliaLang/Pkg.jl@b02fb95...ffb6edf ``` $ git log --oneline b02fb9597..ffb6edf03 ffb6edf03 cache pidlock tweaks (#3654) 550eadd7e Pin registry for MetaGraph tests (#3666) ee39026b8 Remove test that depends on Random being in the sysimg (#3656) 561508db2 CI: Increase the CI timeout. Update actions. Fix double precompilation. (#3665) 7c7ed63b1 Remove change UUID script it should be uncessary on Julia v1.11-dev (#3655) a8648f7c8 Precompile: Fix algorithmic complexity of cycle detection (#3651) 0e0cf4514 Switch datastructure Vector -> Set for algorithmic complexity (#3652) 894cc3f78 respect if load-time precompile is disabled (#3648) 3ffd1cf73 Make auto GC message use printpkgstyle (#3633) ``` Co-authored-by: Dilum Aluthge <dilum@aluthge.com>
I often want to run in a configuration where my local julia on my laptop serves as a client where I develop, visualize, etc. and I want to use remote big iron for compute. Currently, if I try to start julia locally, and add 40 workers all of which are on the same big machine, it takes forever to ssh 40 times to the same machine.
It would be nice if we could have a concept of saying how many cores to use on each remote node, and startup could leverage that information.
Currently, id 1 communicates with all the workers, sending work over potentially slow links in this model. Instead, if we could use one of the remote workers as to repeat the message to all other workers, it would greatly cut the latencies.
Currently, parallel julia assumes that id 1 and all other workers are effectively on the same network, and the cost of communication between everyone is the same. Moving to a model, where id 1 can run locally and connect to compute resources over a relatively slow network is a slightly different model, but one that I believe has better usability when working interactively with large problems, while retaining the ability to plot results, etc.
The text was updated successfully, but these errors were encountered: