-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Make Distributed.jl Worker struct thread-safe.
#38405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3b00732 to
95e6be5
Compare
|
I think #38487 may make this feasible now :) |
|
I hope so as well. |
95e6be5 to
810479f
Compare
|
@vchuravy bump 🙂 Could you please rebase this to give CI another go? |
810479f to
20fd68a
Compare
20fd68a to
300ed29
Compare
|
@JeffBezanson if you have the time I would appreciate a review. |
|
Seen failure: |
e287246 to
20a514f
Compare
|
I hope you don't mind Valentin: I took the liberty of rebasing this branch over master. (On 1.7 we consistently hit issues that this branch should address, so we would love to see it over the line. Let me know if there is anything I can do to lend a hand :).) |
|
Interestingly this branch does not fully resolve the issues we hit, but it does cause our test suite to fail much earlier/harder and with a better stacktrace 🎉 (as opposed to, without this branch, throwing a more opaque trace in the background and then stalling till timeout). |
|
Happy to hand this off :) let me know if you want to chat about what needs doing (Happy to have a co-working session on this). I suspect the least will be making the ClusterSerializer threadsafe as well. |
stdlib/Distributed/src/remotecall.jl
Outdated
| notify(any_gc_flag) | ||
| msg = (remoteref_id(rr), myid()) | ||
| # We cannot acquire locks from finalizers | ||
| @async begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
The MacOS tester hit: That's an assertion in |
5a64315 to
583004b
Compare
Makes the worker struct threadsafe as well as flushing the GC messages
583004b to
0c073cc
Compare
|
@JeffBezanson I removed unnecessary changes like actually using |
|
I am gonna land this, but we should revert if we see any increased flakiness in CI. |
|
|
||
| for wl in wlist | ||
| (wl.state === W_CREATED) && wait(wl.c_state) | ||
| if wl.state === W_CREATED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah merde...
(cherry picked from commit 740a33a)
…stributed_ts" (JuliaLang/julia#41722) Also reverts "fixup to pull request JuliaLang/julia#38405 (JuliaLang/julia#41641)" Seems to be causing hanging in CI testing. This reverts commit 740a33a and this reverts commit 5a1680533b461471f537f5f5d4ee3c2866b21e1f, reversing changes made to 02807b279a5e6d5acaeb7095e4c0527e2a5c190e. (cherry picked from commit bbc9429)
@jpsamaroo and I want to use Distributed in the presence of threading for Dagger. So this is the follow-up to #38134
@JeffBezanson you initially suggested that the
Conditionis replaced by anEvent, but the code seems to want to support multiple state transitions.wait_on_connre-usesc_statefor it's time-out and it not clear to me that timing out there implies timing out other waiters.Lastly we looked at the gc messages for remote references and made a first attempt at making them safe under multi-threading. Since we can't take locks in a finalizer this implied spawning a task.
cc: @jonas-schulze