Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partr: Fix deadlock #34807

Merged
merged 3 commits into from
Feb 19, 2020
Merged

partr: Fix deadlock #34807

merged 3 commits into from
Feb 19, 2020

Conversation

Keno
Copy link
Member

@Keno Keno commented Feb 18, 2020

When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

  1. Somebody used a libuv primitive outside of julia, so the callback
    doesn't schedule any julia work.
  2. Another thread forcibly interrupted the waiting thread because it wants to
    take over the event loop for various reasons
  3. On Windows, we occasionally get spurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes #34769

When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occaisionally get supurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occaisionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes #34769

Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com>
Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
@Keno Keno added domain:multithreading Base.Threads and related functionality backport 1.4 labels Feb 18, 2020
test/threads.jl Outdated Show resolved Hide resolved
test/threads.jl Outdated Show resolved Hide resolved
test/threads.jl Outdated Show resolved Hide resolved
test/threads.jl Show resolved Hide resolved
test/threads.jl Outdated Show resolved Hide resolved
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
test/threads.jl Outdated Show resolved Hide resolved
@ViralBShah
Copy link
Member

Can this be backported to 1.3 and 1.4?

@Keno
Copy link
Member Author

Keno commented Feb 19, 2020

Will there be another 1.3 release?

@KristofferC
Copy link
Sponsor Member

Unlikely

@StefanKarpinski
Copy link
Sponsor Member

Wow, that’s some serious sleuth work. Also, new record for highest test to changed code ratio?

@ViralBShah
Copy link
Member

ViralBShah commented Feb 19, 2020

@vlandau can you test this once it is merged using a nightly or an rc. Also, @vlandau, thanks for persevering through this and getting us the debug information.

@vlandau
Copy link

vlandau commented Feb 19, 2020

@vlandau can you test this once it is merged using a nightly or an rc. Also, @vlandau, thanks for persevering through this and getting us the debug information.

Yes, happy to test this out in my use case! I learned a lot in the process, so glad to have been able to help get the debug info! Thanks very much @Keno et al.

Unlikely

@KristofferC Does you mean this will be likely only be released in 1.4? (not that it's a big deal for me, as users of my package will likely never encounter this bug in 1.3)

@KristofferC
Copy link
Sponsor Member

It means that it is unlikely that there will be another 1.3 release.

@vlandau
Copy link

vlandau commented Feb 19, 2020

It means that it is unlikely that there will be another 1.3 release.

yeah I got that part of it

@Keno Keno merged commit f36edc2 into master Feb 19, 2020
@Keno Keno deleted the kf/34769 branch February 19, 2020 18:20
KristofferC pushed a commit that referenced this pull request Feb 19, 2020
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occasionally get spurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes #34769

Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com>
Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
(cherry picked from commit f36edc2)
@KristofferC KristofferC mentioned this pull request Feb 19, 2020
26 tasks
birm pushed a commit to birm/julia that referenced this pull request Feb 22, 2020
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occasionally get spurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused JuliaLang#34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes JuliaLang#34769

Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com>
Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
KristofferC pushed a commit that referenced this pull request Apr 11, 2020
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occasionally get spurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes #34769

Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com>
Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-threading deadlock bug in Windows (Julia 1.3 and 1.4-rc1)
7 participants