Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getipaddr and getprivipaddr disagree, causing problems with SharedArrays #6171

Closed
carlobaldassi opened this issue Mar 14, 2014 · 7 comments
Closed
Labels
domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior

Comments

@carlobaldassi
Copy link
Member

While testing out SharedArrays, I stumbled upon the fact that simple things like A * x where A is a SharedArray produce an "undefined reference" error.

The reason is that such expressions call similar, which calls the constructor which calls assert_same_host, where the pids are first checked by getprivipaddr and then compared to getipaddr. It happens that these two disagree on the same host, therefore the new SharedArray is not recognized as belonging to the current host and its data becomes inaccessible:

julia> A = SharedArray(Float64, 3)
3-element SharedArray{Float64,1}:
 0.0
 0.0
 0.0

julia> similar(A)
3-element SharedArray{Float64,1}:
 #undef
 #undef
 #undef

julia> sdata(similar(A))
ERROR: access to undefined reference
 in sdata at sharedarray.jl:117

A quick fix is using getprivipaddr(myid()) instead of getipaddr() insize assert_same_host; however I'm wondering if those two functions should actually always produce the same result and need to be fixed.

@carlobaldassi
Copy link
Member Author

I should have specified: "disagree" → "may disagree". I just restarted Julia and they now agree, but this is what I was getting before:

julia> getipaddr()
ip"192.168.1.135"

julia> Base.getprivipaddr(myid())
ip"192.168.0.105"

@ihnorton
Copy link
Member

see also #5995 and #5945

@ViralBShah
Copy link
Member

Cc: @amitmurthy

@amitmurthy
Copy link
Contributor

In your case I am guessing that the ip-address changed sometime after julia was started (laptop sleep-wakeup cycle) and hence the cached ip-address as returned by getprivipaddr did not match getipaddr()

#6030 does away with getprivipaddr and uses the --bind-to exe argument to implement a consistent value of the ip address bound to on all local processes. That should remove any issues caused by getprivipaddr and getipaddr returning different values on the same host.

However, I still cannot explain the "access to undefined reference", since assert_same_host effectively just throws an error if there is a mismatch. Is it reproducible? It is to be noted that both SharedArray(Float64, 3) and similar(A) would produce an uninitialized Shared Array. In the former you can pass an init function to the constructor or initialize it post creation. In the latter, you would need to initialize it post creation only.

@carlobaldassi
Copy link
Member Author

Yes, it's very likely that the issue showed up after I suspended and reawakened my laptop. The issue with assert_same_host is easily explained really, since it throws an error if the different workers are not the same (but they are all checked via getprivipaddr, so they all give the same results, at least if they are created at the same time, and so the check passes); then it returns true or false depending on whether that address is also the same as that of the current process (but this is fetched via getipaddr, which gives a potentially different result). The problem is partially fixed by performing the last check with getprivipaddr as well, with myid() as an argument, but would likely still show up if addprocs is used and new workers can give a different getprivipaddr despite actually being on the same host. Same goes for another possible, slightly better quick fix, namely just checking myid() in procs in order to determine the return value.

@amitmurthy
Copy link
Contributor

OK. I'll submit a PR that will test localhost workers differently, i.e., will not use ip-addresses. This would be a typical case for shared arrays. ip-address checks will be done only for non-localhost workers - which would anyway not survive a suspend/awaken cycle.

@carlobaldassi
Copy link
Member Author

Wait, I have actually got around to reproduce the situation when I get different results from getprivipaddr and getipaddr. It turns out that even when using addprocs, the new workers still use the previous cached value for getprivipaddr. So the quick fix I proposed before actually works, and we don't need to wait for #6030. I'll push the fix and close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants