You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few months ago we unintentionally (and carelessly, to be fair) did a subtle change: we transitioned from silently ignoring put on closed message boxes to a hard failure (assert). See 814377a#diff-7e552a35c6e12cbbfb124195a3e45cd2R1126-R1127
This is something we do want, as we're transitioning from message boxes to channels. In the channel model, calling put on a closed channel leads to a hard failure (panic in Go).
Today's bug
As I was trying to upgrade LocalRest, the test-suite started to fail. I quickly tracked it down to the aforementioned commit, then spent a few more hours understanding what was going on.
In Agora we have a few asynchronous tasks that perform recurrent actions on timers (e.g. catching-up).
Those tasks currently live independently, and cannot be stopped. In normal mode, they are just run by Vibe.d, while in LocalRest, they are a fiber that gets scheduled periodically.
This is all nice until one tries to shut down the node. Because they do periodic send, they will hit the assert (no one told them the connection was closed, heh), and that'll lead to weird crashes, because threads. I originally thought we needed to stop those tasks, but it's a client-side solution for a library-side problem, and a very un-elegant one.
Tentative solution
My original approach was simply to add handling code in the sender code to simulate a timeout (which is what one would most likely see in the real world). Spoiler alert: It didn't work. Since it's a race condition. sometimes the task that dies is the one that does the request, so when the node respond, poof, closed message box.
This then led to another realization: doing the following is subject to the same race condition:
if (!tid.mbox.isClosed())
send(tid, ...);
Because essentially, the message box could have been closed between the check and the send.
This means we need to provide, at the MessageBox level, a way to ignore failures, as we need the check to happen after the lock is taken.
The text was updated successfully, but these errors were encountered:
Found when investigating this issue with Agora: bosagora/agora#796
Context
A few months ago we unintentionally (and carelessly, to be fair) did a subtle change: we transitioned from silently ignoring
put
on closed message boxes to a hard failure (assert
). See 814377a#diff-7e552a35c6e12cbbfb124195a3e45cd2R1126-R1127This is something we do want, as we're transitioning from message boxes to channels. In the channel model, calling
put
on a closed channel leads to a hard failure (panic
in Go).Today's bug
As I was trying to upgrade LocalRest, the test-suite started to fail. I quickly tracked it down to the aforementioned commit, then spent a few more hours understanding what was going on.
In Agora we have a few asynchronous tasks that perform recurrent actions on timers (e.g. catching-up).
Those tasks currently live independently, and cannot be stopped. In normal mode, they are just run by Vibe.d, while in LocalRest, they are a fiber that gets scheduled periodically.
This is all nice until one tries to shut down the node. Because they do periodic send, they will hit the assert (no one told them the connection was closed, heh), and that'll lead to weird crashes, because threads. I originally thought we needed to stop those tasks, but it's a client-side solution for a library-side problem, and a very un-elegant one.
Tentative solution
My original approach was simply to add handling code in the sender code to simulate a timeout (which is what one would most likely see in the real world). Spoiler alert: It didn't work. Since it's a race condition. sometimes the task that dies is the one that does the request, so when the node respond, poof, closed message box.
This then led to another realization: doing the following is subject to the same race condition:
Because essentially, the message box could have been closed between the check and the send.
This means we need to provide, at the
MessageBox
level, a way to ignore failures, as we need the check to happen after the lock is taken.The text was updated successfully, but these errors were encountered: