Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to close race condition in FileWatching tests #38407

Merged
merged 2 commits into from Dec 9, 2020
Merged

Conversation

Keno
Copy link
Member

@Keno Keno commented Nov 12, 2020

We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:

  • Both the timer callback and the poll callback execute on the same libuv loop
  • They each schedule their respective tasks
  • Whichever task gets scheduled first first determines the result

However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:

If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.

@Keno Keno changed the title Try to close race condition in FreeBSD tests Try to close race condition in FileWatching tests Nov 12, 2020
@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 12, 2020

oh my, the FreeBSD queue is a day behind. But when you say 'we know', how did you see that from the logs?

@Keno
Copy link
Member Author

Keno commented Nov 12, 2020

But when you say 'we know'

I don't know, but it is something I expect the kernel to guarantee (it is guaranteed on linux for example that same thread write will be able to be dequeued from epoll). I don't know about FreeBSD, but I'd hope it's similar. Let's see if this helps. If not, we can take a closer look.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 12, 2020

Can you link to a failing build? Eventually hopefully CI will run on this (a couple times), so we can gain insight before merging this.

@Keno
Copy link
Member Author

Keno commented Nov 13, 2020

Sure, this shows up on like half the freebsd runs, e.g. https://build.julialang.org/#/builders/33/builds/5401/steps/5/logs/stdio

@Keno
Copy link
Member Author

Keno commented Nov 18, 2020

Freebsd finally ran and the FileWatching test passed (though it failed in the log test that we fixed earlier). I've rebased this and let's try again.

@JeffBezanson
Copy link
Sponsor Member

Isn't any sort of iswaiting check also a race condition (though possibly much less likely to be a problem in this case)?

It also seems to me that the situation described is basically a timeout. We're using 0.001, so maybe we should just increase it a lot.

@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

The problem is the case where the event is already asserted when we ask for it, but we don't get the notification that the event was asserted until the next event poll. This is not a timeout, because the event was definitely there. The problem here is the schedule order of the callback tasks where the timeout callback takes priority over the event callback when both get asserted simultaneously. The iswaiting check checks for exactly this situation. It's not a racy.

@JeffBezanson
Copy link
Sponsor Member

If it's not a timeout, why does the timer event fire?

@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

If it's not a timeout, why does the timer event fire?

Because the time before we check for the poll event has fired is longer than the timeout time (on slow machines), so they end up firing on the same event check.

@JeffBezanson
Copy link
Sponsor Member

I consider that a timeout; with a longer timeout it wouldn't happen.

@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

No, this is very bad. This means there is no way to write code that is guaranteed not to timeout, because event processing can always be delayed an arbitrary amount of time, so no matter when the event fires, we can always get a timeout.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 24, 2020

I agree with Jeff that this test is mostly nonsense. It's asking if the poll event happened within a millisecond. With careful coordination, that's almost true (we happen to arrange for it to happen on the next tick), so it mostly passes the test. But I agree that's a bit strong to rely upon.

tl;dr People should not use timeouts on wait

@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

I completely disagree. If there is a poll event pending in the kernel by the time the user asks for it, there is absolutely no reason we be giving a timeout instead.

@JeffBezanson
Copy link
Sponsor Member

Ah, I see what you're saying. The event happens, then a long time goes by, then the timer happens, then we look. So you can see both events even if the requested time is plenty long enough.

@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

Yes.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 24, 2020

Right, we could also unconditionally call yield here and be guaranteed an event, the iswaiting test is just to make it faster after it gets a timeout

@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

Or even worse (which is the situation in the test):

  1. The event happens
  2. We ask if it happened and start the timer
  3. The timer fires
  4. We check for events

And then the system tells us we get a timeout! Even though the event had already happened when we asked if it did.

@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

Right, we could also unconditionally call yield here and be guaranteed an event

I don't understand what you're proposing

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 24, 2020

wait(t) and yield() are guaranteed to give the same results

@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

Your proposal is something like istaskdone(t) || yield()? I tried that and I got the timeout failure, because there is a multi-task dance of task scheduling happening, so we come back to the original task before t finishes.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 24, 2020

What's it doing, when it's not in the queue anymore?

@JeffBezanson JeffBezanson added the ci Continuous integration label Nov 24, 2020
@Keno
Copy link
Member Author

Keno commented Nov 24, 2020

acquiring locks I guess?

unlock(fdw.notify)
iolock_begin()
lock(fdw.notify)

We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:
 - Both the timer callback and the poll callback execute on the same libuv loop
 - They each schedule their respective tasks
 - Whichever task gets scheduled first first determines the result

However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:

If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.
@Keno
Copy link
Member Author

Keno commented Nov 27, 2020

I've added another interval duration that for me triggers the issue reliably and demonstrates that this fixes it. I'm happy with alternative suggestions that pass this test, but I'd also like to get this merged to improve CI reliability.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 28, 2020

Ah, right. I'm for this, but threading makes me a little nervous that it could become a minor issue again later, since it's possibly a benign MT-race here. I guess we could alternatively have the timer instead send a signal notify(fwd.notify.waitq, FDEvent()) and make this an unconditional wait?

@Keno
Copy link
Member Author

Keno commented Dec 9, 2020

Can we merge this? We're continuing to see this error on CI.

@Keno Keno merged commit 9a8a675 into master Dec 9, 2020
@Keno Keno deleted the kf/tryfixfilewatching branch December 9, 2020 22:36
ararslan pushed a commit that referenced this pull request Mar 10, 2021
* Try to close race condition in FreeBSD tests

We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:
 - Both the timer callback and the poll callback execute on the same libuv loop
 - They each schedule their respective tasks
 - Whichever task gets scheduled first first determines the result

However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:

If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.

* Add another super-short timeout to try to trigger the same-tick issue

(cherry picked from commit 9a8a675)
KristofferC pushed a commit that referenced this pull request Mar 10, 2021
* Try to close race condition in FreeBSD tests

We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:
 - Both the timer callback and the poll callback execute on the same libuv loop
 - They each schedule their respective tasks
 - Whichever task gets scheduled first first determines the result

However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:

If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.

* Add another super-short timeout to try to trigger the same-tick issue

(cherry picked from commit 9a8a675)

Co-authored-by: Keno Fischer <keno@juliacomputing.com>
KristofferC pushed a commit that referenced this pull request Mar 10, 2021
* Try to close race condition in FreeBSD tests

We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:
 - Both the timer callback and the poll callback execute on the same libuv loop
 - They each schedule their respective tasks
 - Whichever task gets scheduled first first determines the result

However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:

If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.

* Add another super-short timeout to try to trigger the same-tick issue

(cherry picked from commit 9a8a675)

Co-authored-by: Keno Fischer <keno@juliacomputing.com>
ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
* Try to close race condition in FreeBSD tests

We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:
 - Both the timer callback and the poll callback execute on the same libuv loop
 - They each schedule their respective tasks
 - Whichever task gets scheduled first first determines the result

However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:

If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.

* Add another super-short timeout to try to trigger the same-tick issue
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
* Try to close race condition in FreeBSD tests

We're seeing frequent test failures in the FileWatching test on
FreeBSD. Here's my theory of what happens:
 - Both the timer callback and the poll callback execute on the same libuv loop
 - They each schedule their respective tasks
 - Whichever task gets scheduled first first determines the result

However, in the test, we expect that, if the poll callback ran, (which we know
because we know there was an event pending), then that result does
actually get delivered to the toplevel task. This PR tries to close
this hole by adding the following condition:

If the task is no longer waiting on the file watcher (because libuv already
scheduled it), then wait for the task to run to completion, independent
of any timeout. I believe this should close the above race condition
and hopefully fix the test.

* Add another super-short timeout to try to trigger the same-tick issue

(cherry picked from commit 9a8a675)

Co-authored-by: Keno Fischer <keno@juliacomputing.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants