Closed mailboxes cannot be handled by client code #64

Geod24 · 2020-05-06T17:52:35Z

Found when investigating this issue with Agora: bosagora/agora#796

Context

A few months ago we unintentionally (and carelessly, to be fair) did a subtle change: we transitioned from silently ignoring put on closed message boxes to a hard failure (assert). See 814377a#diff-7e552a35c6e12cbbfb124195a3e45cd2R1126-R1127
This is something we do want, as we're transitioning from message boxes to channels. In the channel model, calling put on a closed channel leads to a hard failure (panic in Go).

Today's bug

As I was trying to upgrade LocalRest, the test-suite started to fail. I quickly tracked it down to the aforementioned commit, then spent a few more hours understanding what was going on.

In Agora we have a few asynchronous tasks that perform recurrent actions on timers (e.g. catching-up).
Those tasks currently live independently, and cannot be stopped. In normal mode, they are just run by Vibe.d, while in LocalRest, they are a fiber that gets scheduled periodically.

This is all nice until one tries to shut down the node. Because they do periodic send, they will hit the assert (no one told them the connection was closed, heh), and that'll lead to weird crashes, because threads. I originally thought we needed to stop those tasks, but it's a client-side solution for a library-side problem, and a very un-elegant one.

Tentative solution

My original approach was simply to add handling code in the sender code to simulate a timeout (which is what one would most likely see in the real world). Spoiler alert: It didn't work. Since it's a race condition. sometimes the task that dies is the one that does the request, so when the node respond, poof, closed message box.

This then led to another realization: doing the following is subject to the same race condition:

if (!tid.mbox.isClosed())
    send(tid, ...);

Because essentially, the message box could have been closed between the check and the send.
This means we need to provide, at the MessageBox level, a way to ignore failures, as we need the check to happen after the lock is taken.

The text was updated successfully, but these errors were encountered:

Geod24 · 2020-07-21T01:30:05Z

Fixed by #65

Geod24 added the bug Something isn't working label May 6, 2020

Geod24 self-assigned this May 6, 2020

Geod24 mentioned this issue May 9, 2020

Properly report internal errors & fix issue 64 #65

Merged

Geod24 closed this as completed Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closed mailboxes cannot be handled by client code #64

Closed mailboxes cannot be handled by client code #64

Geod24 commented May 6, 2020

Geod24 commented Jul 21, 2020

Closed mailboxes cannot be handled by client code #64

Closed mailboxes cannot be handled by client code #64

Comments

Geod24 commented May 6, 2020

Context

Today's bug

Tentative solution

Geod24 commented Jul 21, 2020