-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partr: Fix deadlock #34807
partr: Fix deadlock #34807
Conversation
When there is no work to do, the first thread to be idle will attempt to run the event loop once, waiting for any notifications (which will usually create new work). However, there is an interesting corner case where a notification arrives, but no work was scheduled. That doesn't usually happen, but there are a few situations where it does: 1) Somebody used a libuv primitive outside of julia, so the callback doesn't schedule any julia work. 2) Another thread forbily interrupted the waiting thread because it wants to take over the event loop for various reasons 3) On Windows, we occaisionally get supurious wake ups of the event loop. The existing code in partr assumed that we were in situation 2, i.e. that there was another thread waiting to take over the event loop, so it released the event loop and simply put the current thread to sleep in the expectation that another thread will pick it up. However, if we instead are in one of the other two conditions, there may not be another thread there to pick up the event loop. Thus, with no thread owning the event loop, julia will stop responding to events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually enter the event loop until there was no work for 4 milliseconds (which is fairly rare), this condition rarely happens, but is occaisionally observable on Windows, where it caused #34769. To test that this fix works, we manually create situation 1 in the test by creating an idle callback, which will prevent the event loop from blocking, but only schedules julia work after it's been called 100 times. This reproduces the observed failure from the issue and is fixed by this PR. Fixes #34769 Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com> Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Can this be backported to 1.3 and 1.4? |
Will there be another 1.3 release? |
Unlikely |
Wow, that’s some serious sleuth work. Also, new record for highest test to changed code ratio? |
Yes, happy to test this out in my use case! I learned a lot in the process, so glad to have been able to help get the debug info! Thanks very much @Keno et al.
@KristofferC Does you mean this will be likely only be released in 1.4? (not that it's a big deal for me, as users of my package will likely never encounter this bug in 1.3) |
It means that it is unlikely that there will be another 1.3 release. |
yeah I got that part of it |
When there is no work to do, the first thread to be idle will attempt to run the event loop once, waiting for any notifications (which will usually create new work). However, there is an interesting corner case where a notification arrives, but no work was scheduled. That doesn't usually happen, but there are a few situations where it does: 1) Somebody used a libuv primitive outside of julia, so the callback doesn't schedule any julia work. 2) Another thread forbily interrupted the waiting thread because it wants to take over the event loop for various reasons 3) On Windows, we occasionally get spurious wake ups of the event loop. The existing code in partr assumed that we were in situation 2, i.e. that there was another thread waiting to take over the event loop, so it released the event loop and simply put the current thread to sleep in the expectation that another thread will pick it up. However, if we instead are in one of the other two conditions, there may not be another thread there to pick up the event loop. Thus, with no thread owning the event loop, julia will stop responding to events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually enter the event loop until there was no work for 4 milliseconds (which is fairly rare), this condition rarely happens, but is occasionally observable on Windows, where it caused #34769. To test that this fix works, we manually create situation 1 in the test by creating an idle callback, which will prevent the event loop from blocking, but only schedules julia work after it's been called 100 times. This reproduces the observed failure from the issue and is fixed by this PR. Fixes #34769 Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com> Co-authored-by: Jameson Nash <jameson@juliacomputing.com> (cherry picked from commit f36edc2)
When there is no work to do, the first thread to be idle will attempt to run the event loop once, waiting for any notifications (which will usually create new work). However, there is an interesting corner case where a notification arrives, but no work was scheduled. That doesn't usually happen, but there are a few situations where it does: 1) Somebody used a libuv primitive outside of julia, so the callback doesn't schedule any julia work. 2) Another thread forbily interrupted the waiting thread because it wants to take over the event loop for various reasons 3) On Windows, we occasionally get spurious wake ups of the event loop. The existing code in partr assumed that we were in situation 2, i.e. that there was another thread waiting to take over the event loop, so it released the event loop and simply put the current thread to sleep in the expectation that another thread will pick it up. However, if we instead are in one of the other two conditions, there may not be another thread there to pick up the event loop. Thus, with no thread owning the event loop, julia will stop responding to events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually enter the event loop until there was no work for 4 milliseconds (which is fairly rare), this condition rarely happens, but is occasionally observable on Windows, where it caused JuliaLang#34769. To test that this fix works, we manually create situation 1 in the test by creating an idle callback, which will prevent the event loop from blocking, but only schedules julia work after it's been called 100 times. This reproduces the observed failure from the issue and is fixed by this PR. Fixes JuliaLang#34769 Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com> Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
When there is no work to do, the first thread to be idle will attempt to run the event loop once, waiting for any notifications (which will usually create new work). However, there is an interesting corner case where a notification arrives, but no work was scheduled. That doesn't usually happen, but there are a few situations where it does: 1) Somebody used a libuv primitive outside of julia, so the callback doesn't schedule any julia work. 2) Another thread forbily interrupted the waiting thread because it wants to take over the event loop for various reasons 3) On Windows, we occasionally get spurious wake ups of the event loop. The existing code in partr assumed that we were in situation 2, i.e. that there was another thread waiting to take over the event loop, so it released the event loop and simply put the current thread to sleep in the expectation that another thread will pick it up. However, if we instead are in one of the other two conditions, there may not be another thread there to pick up the event loop. Thus, with no thread owning the event loop, julia will stop responding to events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually enter the event loop until there was no work for 4 milliseconds (which is fairly rare), this condition rarely happens, but is occasionally observable on Windows, where it caused #34769. To test that this fix works, we manually create situation 1 in the test by creating an idle callback, which will prevent the event loop from blocking, but only schedules julia work after it's been called 100 times. This reproduces the observed failure from the issue and is fixed by this PR. Fixes #34769 Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com> Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:
doesn't schedule any julia work.
take over the event loop for various reasons
The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.
Fixes #34769