Try to close race condition in FileWatching tests #38407

Keno · 2020-11-12T05:05:28Z

We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:

Both the timer callback and the poll callback execute on the same libuv loop
They each schedule their respective tasks
Whichever task gets scheduled first first determines the result

However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:

If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.

vtjnash · 2020-11-12T15:23:03Z

oh my, the FreeBSD queue is a day behind. But when you say 'we know', how did you see that from the logs?

Keno · 2020-11-12T19:00:42Z

But when you say 'we know'

I don't know, but it is something I expect the kernel to guarantee (it is guaranteed on linux for example that same thread write will be able to be dequeued from epoll). I don't know about FreeBSD, but I'd hope it's similar. Let's see if this helps. If not, we can take a closer look.

vtjnash · 2020-11-12T21:16:44Z

Can you link to a failing build? Eventually hopefully CI will run on this (a couple times), so we can gain insight before merging this.

Keno · 2020-11-13T03:19:29Z

Sure, this shows up on like half the freebsd runs, e.g. https://build.julialang.org/#/builders/33/builds/5401/steps/5/logs/stdio

Keno · 2020-11-18T04:29:09Z

Freebsd finally ran and the FileWatching test passed (though it failed in the log test that we fixed earlier). I've rebased this and let's try again.

JeffBezanson · 2020-11-24T15:43:27Z

Isn't any sort of iswaiting check also a race condition (though possibly much less likely to be a problem in this case)?

It also seems to me that the situation described is basically a timeout. We're using 0.001, so maybe we should just increase it a lot.

Keno · 2020-11-24T15:47:07Z

The problem is the case where the event is already asserted when we ask for it, but we don't get the notification that the event was asserted until the next event poll. This is not a timeout, because the event was definitely there. The problem here is the schedule order of the callback tasks where the timeout callback takes priority over the event callback when both get asserted simultaneously. The iswaiting check checks for exactly this situation. It's not a racy.

JeffBezanson · 2020-11-24T15:49:52Z

If it's not a timeout, why does the timer event fire?

Keno · 2020-11-24T15:52:18Z

If it's not a timeout, why does the timer event fire?

Because the time before we check for the poll event has fired is longer than the timeout time (on slow machines), so they end up firing on the same event check.

JeffBezanson · 2020-11-24T15:57:21Z

I consider that a timeout; with a longer timeout it wouldn't happen.

Keno · 2020-11-24T16:00:35Z

No, this is very bad. This means there is no way to write code that is guaranteed not to timeout, because event processing can always be delayed an arbitrary amount of time, so no matter when the event fires, we can always get a timeout.

vtjnash · 2020-11-24T16:01:10Z

I agree with Jeff that this test is mostly nonsense. It's asking if the poll event happened within a millisecond. With careful coordination, that's almost true (we happen to arrange for it to happen on the next tick), so it mostly passes the test. But I agree that's a bit strong to rely upon.

tl;dr People should not use timeouts on wait

Keno · 2020-11-24T16:03:15Z

I completely disagree. If there is a poll event pending in the kernel by the time the user asks for it, there is absolutely no reason we be giving a timeout instead.

JeffBezanson · 2020-11-24T16:07:14Z

Ah, I see what you're saying. The event happens, then a long time goes by, then the timer happens, then we look. So you can see both events even if the requested time is plenty long enough.

Keno · 2020-11-24T16:09:28Z

Yes.

vtjnash · 2020-11-24T16:10:36Z

Right, we could also unconditionally call yield here and be guaranteed an event, the iswaiting test is just to make it faster after it gets a timeout

Keno · 2020-11-24T16:10:38Z

Or even worse (which is the situation in the test):

The event happens
We ask if it happened and start the timer
The timer fires
We check for events

And then the system tells us we get a timeout! Even though the event had already happened when we asked if it did.

Keno · 2020-11-24T16:12:27Z

Right, we could also unconditionally call yield here and be guaranteed an event

I don't understand what you're proposing

vtjnash · 2020-11-24T16:17:07Z

wait(t) and yield() are guaranteed to give the same results

Keno · 2020-11-24T16:26:39Z

Your proposal is something like istaskdone(t) || yield()? I tried that and I got the timeout failure, because there is a multi-task dance of task scheduling happening, so we come back to the original task before t finishes.

vtjnash · 2020-11-24T16:30:24Z

What's it doing, when it's not in the queue anymore?

Keno · 2020-11-24T16:40:03Z

acquiring locks I guess?

julia/stdlib/FileWatching/src/FileWatching.jl

Lines 530 to 532 in 95c023b

    
           unlock(fdw.notify) 
        
           iolock_begin() 
        
           lock(fdw.notify)

We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test.

Keno · 2020-11-27T03:26:30Z

I've added another interval duration that for me triggers the issue reliably and demonstrates that this fixes it. I'm happy with alternative suggestions that pass this test, but I'd also like to get this merged to improve CI reliability.

vtjnash · 2020-11-28T00:23:00Z

Ah, right. I'm for this, but threading makes me a little nervous that it could become a minor issue again later, since it's possibly a benign MT-race here. I guess we could alternatively have the timer instead send a signal notify(fwd.notify.waitq, FDEvent()) and make this an unconditional wait?

Keno · 2020-12-09T18:45:08Z

Can we merge this? We're continuing to see this error on CI.

* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue (cherry picked from commit 9a8a675)

* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue (cherry picked from commit 9a8a675) Co-authored-by: Keno Fischer <keno@juliacomputing.com>

* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue

* Try to close race condition in FreeBSD tests We're seeing frequent test failures in the FileWatching test on FreeBSD. Here's my theory of what happens: - Both the timer callback and the poll callback execute on the same libuv loop - They each schedule their respective tasks - Whichever task gets scheduled first first determines the result However, in the test, we expect that, if the poll callback ran, (which we know because we know there was an event pending), then that result does actually get delivered to the toplevel task. This PR tries to close this hole by adding the following condition: If the task is no longer waiting on the file watcher (because libuv already scheduled it), then wait for the task to run to completion, independent of any timeout. I believe this should close the above race condition and hopefully fix the test. * Add another super-short timeout to try to trigger the same-tick issue (cherry picked from commit 9a8a675) Co-authored-by: Keno Fischer <keno@juliacomputing.com>

Keno requested review from vtjnash and JeffBezanson November 12, 2020 05:05

Keno changed the title ~~Try to close race condition in FreeBSD tests~~ Try to close race condition in FileWatching tests Nov 12, 2020

Keno force-pushed the kf/tryfixfilewatching branch from 2c9160c to 06a369a Compare November 18, 2020 04:28

JeffBezanson added the ci Continuous integration label Nov 24, 2020

Keno added 2 commits November 26, 2020 22:24

Add another super-short timeout to try to trigger the same-tick issue

7fa83c9

Keno force-pushed the kf/tryfixfilewatching branch from 06a369a to 7fa83c9 Compare November 27, 2020 03:25

Keno merged commit 9a8a675 into master Dec 9, 2020

Keno deleted the kf/tryfixfilewatching branch December 9, 2020 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to close race condition in FileWatching tests #38407

Try to close race condition in FileWatching tests #38407

Keno commented Nov 12, 2020

vtjnash commented Nov 12, 2020

Keno commented Nov 12, 2020

vtjnash commented Nov 12, 2020

Keno commented Nov 13, 2020

Keno commented Nov 18, 2020

JeffBezanson commented Nov 24, 2020

Keno commented Nov 24, 2020

JeffBezanson commented Nov 24, 2020

Keno commented Nov 24, 2020

JeffBezanson commented Nov 24, 2020

Keno commented Nov 24, 2020

vtjnash commented Nov 24, 2020

Keno commented Nov 24, 2020

JeffBezanson commented Nov 24, 2020

Keno commented Nov 24, 2020

vtjnash commented Nov 24, 2020

Keno commented Nov 24, 2020

Keno commented Nov 24, 2020

vtjnash commented Nov 24, 2020

Keno commented Nov 24, 2020

vtjnash commented Nov 24, 2020

Keno commented Nov 24, 2020

Keno commented Nov 27, 2020

vtjnash commented Nov 28, 2020

Keno commented Dec 9, 2020

Try to close race condition in FileWatching tests #38407

Try to close race condition in FileWatching tests #38407

Conversation

Keno commented Nov 12, 2020

vtjnash commented Nov 12, 2020

Keno commented Nov 12, 2020

vtjnash commented Nov 12, 2020

Keno commented Nov 13, 2020

Keno commented Nov 18, 2020

JeffBezanson commented Nov 24, 2020

Keno commented Nov 24, 2020

JeffBezanson commented Nov 24, 2020

Keno commented Nov 24, 2020

JeffBezanson commented Nov 24, 2020

Keno commented Nov 24, 2020

vtjnash commented Nov 24, 2020

Keno commented Nov 24, 2020

JeffBezanson commented Nov 24, 2020

Keno commented Nov 24, 2020

vtjnash commented Nov 24, 2020

Keno commented Nov 24, 2020

Keno commented Nov 24, 2020

vtjnash commented Nov 24, 2020

Keno commented Nov 24, 2020

vtjnash commented Nov 24, 2020

Keno commented Nov 24, 2020

Keno commented Nov 27, 2020

vtjnash commented Nov 28, 2020

Keno commented Dec 9, 2020