Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross platform issues with Remote Workers / SSH Cluster Manager / Native Dependecies #22

Open
habemus-papadum opened this issue Mar 5, 2015 · 7 comments

Comments

@habemus-papadum
Copy link

Hi -- Using a head node (i.e. procid == 1) that is a mac on v0.3.6, I am trying to use linux based workers using SSHClusterManager.

I experience problems with e.g. using HDF5 --- the basic cause seems to be that

  • workers delegate include to node 1 (include == include_from_node1)
  • HDF5 (and many others) use BinDeps
  • BinDeps creates a custom deps.jl which is platform (and presumably even box) specific

so my linux boxes complain when the cannot locate the mac dylib

I've thought a bit about how to resolve this, but nothing obvious and elegant pops to mind. (For now I've just hacked my deps.jl on my mac to support both OS X and linux)

Have others seen this kind of issue? Is there some simple way to have the workers not pull code from node1 but simply rely on the locally installed packages?

I was thinking of hacking include_from_node1 in the .juliarc.jl on the linux boxes to simply not pull code from node1, but that seems a bit drastic -- any thoughts about whether this would work?

As an aside, while I can understand the motivation for include, using etc to work by delegating to node1 (e.g, simplify the need for code distribution), it does seems a bit difficult to do robustly or in a way that will scale nicely to dozens or hundreds of workers....

thanks

@JeffBezanson
Copy link
Sponsor Member

I believe it is possible to run all nodes, including node 1, remotely, and connect your local REPL to the remote node, thereby avoiding mixed-platform issues. @Keno are there instructions on how to do this?

@habemus-papadum
Copy link
Author

Hi,

Thanks, I actually saw the thread where the "repl into a remote" stuff was first done (Very cool!). I believe the link is: JuliaLang/julia#3655

There is also just starting node 1 via ssh/tmux, or running a remote IJulia....

For my particular work flow, I want to use the the remote boxes to condense and summarize a large amount of distributed data and then deliver it to my local box for more detailed analysis, keeping the entire flow as interactive as possible, going back and forth many times (so ultimately, the point is to flexibly get interesting subsets of data from one site to another, which makes a purely remote solution not so good)

I managed to hack things enough to get it to work for now. I don't endorse what follows, but just so it's documented in case others come across this issue:

  • To get HDF5 to work correctly I had to hack the deps.jl on my mac to something like:
    # Load dependencies
    @osx_only  begin 
      @checked_lib libhdf5 "/usr/local/lib/libhdf5.dylib"
    end

    @linux_only begin
      @checked_lib libhdf5 "/usr/lib64/libhdf5.so.7"
    end

(The exact details will depend on what version of hdf5 you have installed and so forth)

  • On the linux boxes, I also had to create a /Users dir and a symlink to my linux home dir (e.g. /Users/lilinjn -> /home/lilinjn) so that absolute paths from my Mac would map to the correct place on the linux boxes (I've forgotten exactly why this was needed, but it had something to do with the BinDeps logic)

So that is quite horrible and will likely crumble with every little change. On the flip side I was able to drive 60 workers on 7 boxes from my mac and everything seemed to work amazingly well in terms of connection times, throughput, and so forth, and so, for what it's worth, I'm a happy customer !

In case anyone is interested, it turns out remote workers slurp .juliarc.jl from node1, which has the potential for many odd issues...

thanks!

@rened
Copy link
Member

rened commented Apr 14, 2015

I tried to work around this in JuliaPackaging/BinDeps.jl#130 but did not completely follow through revising for the comments (yet). I'd also be interested in making the the cross-platform experience as seamless as possible - I'll try to continue with that PR as soon as I can.

@ViralBShah
Copy link
Member

Cc @amitmurthy

@habemus-papadum
Copy link
Author

I've been driving linux boxes from a mac for a few weeks now, and despite my ridiculous hacks it's been extremely useful.

My two cents is that rather than adjusting BinDeps and other packages to work around these issues, it might be better/cleaner to be able to launch workers with a command line switch that has them simply load julia code from their local drive rather than slurping from node1 -- I'm already rsync'ing datasets and non-julia code to various nodes so there is not much convenience gained on my end by the current behavior.

@kshyatt
Copy link
Contributor

kshyatt commented Jan 25, 2017

Is this still a pain to get working?

@tkelman
Copy link
Contributor

tkelman commented Jan 26, 2017

yes. comes up on discourse every few weeks.

@vtjnash vtjnash transferred this issue from JuliaLang/julia Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants