New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to close race condition in FileWatching tests #38407
Conversation
oh my, the FreeBSD queue is a day behind. But when you say 'we know', how did you see that from the logs? |
I don't know, but it is something I expect the kernel to guarantee (it is guaranteed on linux for example that same thread write will be able to be dequeued from epoll). I don't know about FreeBSD, but I'd hope it's similar. Let's see if this helps. If not, we can take a closer look. |
Can you link to a failing build? Eventually hopefully CI will run on this (a couple times), so we can gain insight before merging this. |
Sure, this shows up on like half the freebsd runs, e.g. https://build.julialang.org/#/builders/33/builds/5401/steps/5/logs/stdio |
2c9160c
to
06a369a
Compare
Freebsd finally ran and the FileWatching test passed (though it failed in the log test that we fixed earlier). I've rebased this and let's try again. |
Isn't any sort of It also seems to me that the situation described is basically a timeout. We're using 0.001, so maybe we should just increase it a lot. |
The problem is the case where the event is already asserted when we ask for it, but we don't get the notification that the event was asserted until the next event poll. This is not a timeout, because the event was definitely there. The problem here is the schedule order of the callback tasks where the timeout callback takes priority over the event callback when both get asserted simultaneously. The iswaiting check checks for exactly this situation. It's not a racy. |
If it's not a timeout, why does the timer event fire? |
Because the time before we check for the poll event has fired is longer than the timeout time (on slow machines), so they end up firing on the same event check. |
I consider that a timeout; with a longer timeout it wouldn't happen. |
No, this is very bad. This means there is no way to write code that is guaranteed not to timeout, because event processing can always be delayed an arbitrary amount of time, so no matter when the event fires, we can always get a timeout. |
I agree with Jeff that this test is mostly nonsense. It's asking if the poll event happened within a millisecond. With careful coordination, that's almost true (we happen to arrange for it to happen on the next tick), so it mostly passes the test. But I agree that's a bit strong to rely upon. tl;dr People should not use timeouts on |
I completely disagree. If there is a poll event pending in the kernel by the time the user asks for it, there is absolutely no reason we be giving a timeout instead. |
Ah, I see what you're saying. The event happens, then a long time goes by, then the timer happens, then we look. So you can see both events even if the requested time is plenty long enough. |
Yes. |
Right, we could also unconditionally call |
Or even worse (which is the situation in the test):
And then the system tells us we get a timeout! Even though the event had already happened when we asked if it did. |
I don't understand what you're proposing |
|
Your proposal is something like |
What's it doing, when it's not in the queue anymore? |
acquiring locks I guess? julia/stdlib/FileWatching/src/FileWatching.jl Lines 530 to 532 in 95c023b
|
We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test.
06a369a
to
7fa83c9
Compare
I've added another interval duration that for me triggers the issue reliably and demonstrates that this fixes it. I'm happy with alternative suggestions that pass this test, but I'd also like to get this merged to improve CI reliability. |
Ah, right. I'm for this, but threading makes me a little nervous that it could become a minor issue again later, since it's possibly a benign MT-race here. I guess we could alternatively have the timer instead send a signal |
Can we merge this? We're continuing to see this error on CI. |
* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue (cherry picked from commit 9a8a675)
* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue (cherry picked from commit 9a8a675) Co-authored-by: Keno Fischer <keno@juliacomputing.com>
* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue (cherry picked from commit 9a8a675) Co-authored-by: Keno Fischer <keno@juliacomputing.com>
* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue
* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue (cherry picked from commit 9a8a675) Co-authored-by: Keno Fischer <keno@juliacomputing.com>
We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:
However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:
If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.