Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moar threadsafe moar better #101

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

JamesWrigley
Copy link

This is a rebased version of #4, it should be ready to merge. Fixes #73.

(made after discussing with @jpsamaroo)

CC @vchuravy, @vtjnash

@JamesWrigley
Copy link
Author

I tracked the test failures down to: JuliaLang/julia#53326
The workers are loading the builtin version of Distributed from the sysimg while the master is using the development version, and because we changed the definition of the Worker struct in this PR we get a serialization error when doing a remotecall to another worker because the master and worker are using different types.

I think this should be fixed in Base, made a PR here: JuliaLang/julia#54571
That should be merged before this one.

Copy link

codecov bot commented May 25, 2024

Codecov Report

Attention: Patch coverage is 86.56716% with 9 lines in your changes missing coverage. Please review.

Project coverage is 79.30%. Comparing base (6a0383b) to head (abafb79).

Files Patch % Lines
src/cluster.jl 89.09% 6 Missing ⚠️
src/managers.jl 50.00% 1 Missing ⚠️
src/messages.jl 0.00% 1 Missing ⚠️
src/process_messages.jl 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage   79.18%   79.30%   +0.11%     
==========================================
  Files          10       10              
  Lines        1898     1918      +20     
==========================================
+ Hits         1503     1521      +18     
- Misses        395      397       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JamesWrigley
Copy link
Author

Alrighty, after a force push to trigger CI with latest nightly we are back in the green 🥳 I think this is ready to be merged now.

Copy link
Contributor

@jonas-schulze jonas-schulze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the original code snippet from #73 (comment) is working now. 🚀

Will this be compatible to or backported, if necessary, to the next Julia LTS?

src/cluster.jl Outdated
Comment on lines 709 to 716
@async manage(w.manager, w.id, w.config, :register)
# wait for rr_ntfy_join with timeout
timedout = false
@async (sleep($timeout); timedout = true; put!(rr_ntfy_join, 1))
@async begin
sleep($timeout)
timedout = true
put!(rr_ntfy_join, 1)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tasks need an errormonitor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense, added in 03d7384.

test/threads.jl Outdated
ws = ts = product(1:2, 1:2)
@testset "from worker $w1 to $w2 via 1" for (w1, w2) in ws
@testset "from thread $w1.$t1 to $w2.$t2" for (t1, t2) in ts
# We want (the default) lazyness, so that we wait for `Worker.c_state`!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We want (the default) lazyness, so that we wait for `Worker.c_state`!
# We want (the default) laziness, so that we wait for `Worker.c_state`!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f4576aa.

test/threads.jl Outdated
end

# Wait on the spawned tasks on the owner
@sync begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this sync point should fail fast, if necessary:

Suggested change
@sync begin
Base.Experimental.@sync begin

See JuliaLang/julia#42239 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that makes sense, I refactored the code to use timedwait() in f4576aa.

@JamesWrigley
Copy link
Author

Will this be compatible to or backported, if necessary, to the next Julia LTS?

I'll leave that for someone more qualified to properly answer, but FWIW if 1.11 is chosen as the next LTS then it'll be possible to upgrade Distributed now that it's an excised stdlib 🐙

@JamesWrigley
Copy link
Author

One other thing I noticed is that this should probably be using Threads.@spawn everywhere instead of @async, but I'll leave that for another PR.

Copy link
Contributor

@jonas-schulze jonas-schulze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any updates on this?

@JamesWrigley
Copy link
Author

I believe @JBlaschke was going to have a look at it. In the meantime I see the branch is out of date, so I'll rebase it.

@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch 2 times, most recently from e3205d8 to fa9d645 Compare June 24, 2024 14:11
@JBlaschke
Copy link

@JamesWrigley @jonas-schulze I'll be working on this this week. Just getting back up to speed after long travel...

@zsz00
Copy link

zsz00 commented Jul 21, 2024

Are there any updates on this? @JamesWrigley @JBlaschke

@JamesWrigley
Copy link
Author

JamesWrigley commented Jul 21, 2024

Not from me, still need someone to review it. I believe the hesitation to merge comes from Distributed being used to run the Julia tests, so it's quite critical that this works properly.

But in the meantime you can ] dev https://github.com/JamesWrigley/Distributed.jl.git#jps/threadsafe_workerstate on 1.11 to use this branch.

c_state::Condition # wait for state changes
ct_time::Float64 # creation time
conn_func::Any # used to setup connections lazily
@atomic state::WorkerState
Copy link
Member

@gbaraldi gbaraldi Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if state is always read/written from inside a lock this doesn't need to be atomic as the lock should have the correct barriers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's guaranteed? From a cursory grep through cluster.jl I see plenty of reads outside of a lock.

@gbaraldi
Copy link
Member

Since we are making things threadsafe I would look at all @async uses because we want to deprecate it.

@JamesWrigley
Copy link
Author

Ok, I replaced all uses of @async with Threads.@spawn in 7cac33f. But there are still a couple places where Base.@async_unwrap is used and I can't find an easy analogue for Threads.@spawn, I'll see if I can write a replacement.

@JamesWrigley
Copy link
Author

After thinking about it for a bit, I can't come up with a decent replacement short of basically reimplementing Threads.@spawn. We could replace @async_unwrap with Threads.@spawn, but it would change the type of any exceptions thrown from whatever they currently are to a TaskFailedException, which is technically breaking.

I'd suggest keeping it in for now, we can add support for unwrapping exceptions to Threads.@spawn and switch in a future release.

test/threads.jl Outdated
Comment on lines 47 to 52
# Wait on the spawned tasks on the owner. Note that we use
# timedwait() instead of @sync to avoid deadlocks.
t1 = Threads.@spawn fetch_from_owner(wait, recv)
t2 = Threads.@spawn fetch_from_owner(wait, send)
@test timedwait(() -> istaskdone(t1), 5) == :ok
@test timedwait(() -> istaskdone(t2), 5) == :ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to chime in so late after #101 (comment); I noticed because GitHub unfolded all my previous comments.

I like the timedwait, which is what I used in JuliaLang/julia#37905. However, the timedwait has been the main reason (I think) why my first PR was reverted (JuliaLang/julia#38112). The second attempt (https://github.com/JuliaLang/julia/pull/38134/files) didn't use timedwait. I remain in favor of timedwait but wanted to refresh the information, as it has been a while.

Copy link
Author

@JamesWrigley JamesWrigley Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. My view is that if this fails we're going to end up with a timeout somewhere no matter what, either in CI or the tests themselves. And my preference would be to have the timeout in the tests so we can have some control over it. I bumped it to 60s in abafb79 but I'm happy to increase that if people think it's too low. @vchuravy, @vtjnash, does that sound ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrency violation on interplay between Distributed and Base.Threads
7 participants