Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed mailboxes cannot be handled by client code #64

Closed
Geod24 opened this issue May 6, 2020 · 1 comment
Closed

Closed mailboxes cannot be handled by client code #64

Geod24 opened this issue May 6, 2020 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@Geod24
Copy link
Owner

Geod24 commented May 6, 2020

Found when investigating this issue with Agora: bosagora/agora#796

Context

A few months ago we unintentionally (and carelessly, to be fair) did a subtle change: we transitioned from silently ignoring put on closed message boxes to a hard failure (assert). See 814377a#diff-7e552a35c6e12cbbfb124195a3e45cd2R1126-R1127
This is something we do want, as we're transitioning from message boxes to channels. In the channel model, calling put on a closed channel leads to a hard failure (panic in Go).

Today's bug

As I was trying to upgrade LocalRest, the test-suite started to fail. I quickly tracked it down to the aforementioned commit, then spent a few more hours understanding what was going on.

In Agora we have a few asynchronous tasks that perform recurrent actions on timers (e.g. catching-up).
Those tasks currently live independently, and cannot be stopped. In normal mode, they are just run by Vibe.d, while in LocalRest, they are a fiber that gets scheduled periodically.

This is all nice until one tries to shut down the node. Because they do periodic send, they will hit the assert (no one told them the connection was closed, heh), and that'll lead to weird crashes, because threads. I originally thought we needed to stop those tasks, but it's a client-side solution for a library-side problem, and a very un-elegant one.

Tentative solution

My original approach was simply to add handling code in the sender code to simulate a timeout (which is what one would most likely see in the real world). Spoiler alert: It didn't work. Since it's a race condition. sometimes the task that dies is the one that does the request, so when the node respond, poof, closed message box.

This then led to another realization: doing the following is subject to the same race condition:

if (!tid.mbox.isClosed())
    send(tid, ...);

Because essentially, the message box could have been closed between the check and the send.
This means we need to provide, at the MessageBox level, a way to ignore failures, as we need the check to happen after the lock is taken.

@Geod24 Geod24 added the bug Something isn't working label May 6, 2020
@Geod24 Geod24 self-assigned this May 6, 2020
@Geod24
Copy link
Owner Author

Geod24 commented Jul 21, 2020

Fixed by #65

@Geod24 Geod24 closed this as completed Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant