Clean up connection state on non-StandardError aborts#68
Merged
Conversation
Async::Stop (and other scheduler cancellation signals) inherit from Exception but not StandardError, so they bypass the rescue chain in Base#request entirely. Without an ensure clause, close and abort_request! never run — @request_in_progress stays true and partial response bytes are left on the socket for a subsequent caller to misread, surfacing as parse errors or wrong values under load. Add an ensure gated on $ERROR_INFO && request_in_progress? so the cleanup runs on any abnormal exit without disturbing pipelined_get's happy path (which intentionally leaves request_in_progress=true until the caller drains responses). Finalize the fiber concurrency test to assert the connection is left clean after a non-StandardError abort. The yield-based interleave race is separately covered by Dalli::Threadsafe's fiber-aware Monitor in the default config, so no test is needed for that.
Author
|
prior to the fix the test does fail as we would expect |
nherson
approved these changes
Apr 22, 2026
drinkbeer
approved these changes
Apr 23, 2026
There was a problem hiding this comment.
:nice This will greatly reduce the errors caused by unexpected dirty socket.
Do you have some metrics/logs about the errors/exceptions caused by unexpected dirty socket? How often does this happen? I guess it will happen every time a Async::Stop was sent or some other network exceptions occurred.
NVM. I found the original thread and there are error logs.
This was referenced Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Under heavy async load, we've been seeing parse errors and unexpected values from Dalli that look like stale data on the socket. Root cause: scheduler-driven fiber cancellation signals (
Async::Stop,Polyphony::Terminate, anything that inherits fromExceptionbut notStandardError) sail past the rescue chain inBase#request, soclose/abort_request!never run. The connection is left with@request_in_progress == trueand any partial response bytes remain on the wire for the next caller to misread.The bug
Async::Stop < Exception(intentionally, so generic rescues don't swallow cancellation).rescue StandardErrordoesn't catch it → noclose, noabort_request!. Quick REPL confirmation:Next caller then either:
confirm_ready!tear down the dirty socket and reconnect → reconnect storms amplify the damage under load.The fix
Add an
ensurethat cleans up on any abnormal exit:Why
ensure+$ERROR_INFOand notrescue Async::Stoprescue Async::Stopensure+$ERROR_INFOasyncgemSignalException/ Ctrl-C mid-requestWhy the
request_in_progress?guard mattersTwo invariants have to hold together:
$ERROR_INFO— only clean up on abnormal exit. Without this,pipelined_get's happy path (which intentionally leavesrequest_in_progress == truefor the response drain) gets torn down.request_in_progress?— only close if the state is actually dirty. Without this, a rare scheduler-injected exception afterfinish_request!but before the implicit return would close a healthy socket, forcing an unnecessary reconnect. Under the heavy-load conditions that triggered this investigation, that's exactly the pattern that amplifies.Test
test/integration/test_fiber_concurrency.rbinjects aFiberCancellation < Exception(stand-in forAsync::Stop, so we don't pull the async gem as a test dep) into the response read and asserts:@request_in_progress?is false after (connection is clean).closewas called at least once (cleanup ran).The yield-based fiber interleave race is intentionally not covered by a test — in the default
threadsafe: trueconfig,Dalli::Threadsafe#requestwraps inMonitor.synchronizeand MRI'sMonitoris fiber-aware, raisingThreadError: deadlockrather than corrupting state.Test plan
bundle exec ruby -Ilib:test test/integration/test_fiber_concurrency.rb— new test passesbundle exec rake test— full suite passes (no regressions from the newensure)bundle exec rubocop— clean