Skip to content

Clean up connection state on non-StandardError aborts#68

Merged
danmayer merged 2 commits intomainfrom
fiber_concurrency_bug
Apr 23, 2026
Merged

Clean up connection state on non-StandardError aborts#68
danmayer merged 2 commits intomainfrom
fiber_concurrency_bug

Conversation

@danmayer
Copy link
Copy Markdown

@danmayer danmayer commented Apr 22, 2026

Summary

Under heavy async load, we've been seeing parse errors and unexpected values from Dalli that look like stale data on the socket. Root cause: scheduler-driven fiber cancellation signals (Async::Stop, Polyphony::Terminate, anything that inherits from Exception but not StandardError) sail past the rescue chain in Base#request, so close / abort_request! never run. The connection is left with @request_in_progress == true and any partial response bytes remain on the wire for the next caller to misread.

The bug

# lib/dalli/protocol/base.rb (before)
def request(opkey, *args)
  verify_state(opkey)

  begin
    @connection_manager.start_request!
    response = send(opkey, *args)
    @connection_manager.finish_request! unless opkey == :pipelined_get
    response
  rescue Dalli::MarshalError => e
    log_marshal_err(args.first, e); raise
  rescue Dalli::DalliError
    raise
  rescue StandardError => e
    log_unexpected_err(e); close; raise
  end
end

Async::Stop < Exception (intentionally, so generic rescues don't swallow cancellation). rescue StandardError doesn't catch it → no close, no abort_request!. Quick REPL confirmation:

StandardError > Async::Stop  # => nil
Exception > Async::Stop      # => true

Next caller then either:

  • Reads stale bytes off the shared socket → parse error / wrong value, or
  • Has confirm_ready! tear down the dirty socket and reconnect → reconnect storms amplify the damage under load.

The fix

Add an ensure that cleans up on any abnormal exit:

ensure
  # Catches non-StandardError exceptions (e.g. Async::Stop) that sail
  # past the rescues above.  $ERROR_INFO is non-nil only when an
  # exception is unwinding, so pipelined_get's happy path — which
  # intentionally leaves @request_in_progress = true until the
  # caller drains the responses — isn't torn down.
  close if $ERROR_INFO && @connection_manager.request_in_progress?
end

Why ensure + $ERROR_INFO and not rescue Async::Stop

rescue Async::Stop ensure + $ERROR_INFO
Extra dep on async gem yes no
Covers other schedulers (Polyphony, custom Fiber.scheduler) no yes
Covers SignalException / Ctrl-C mid-request no yes
Respects Async's "don't catch me" contract no (catches) yes (just cleans up, re-raises)
Gated on actual dirty state no yes

Why the request_in_progress? guard matters

Two invariants have to hold together:

  1. $ERROR_INFO — only clean up on abnormal exit. Without this, pipelined_get's happy path (which intentionally leaves request_in_progress == true for the response drain) gets torn down.
  2. request_in_progress? — only close if the state is actually dirty. Without this, a rare scheduler-injected exception after finish_request! but before the implicit return would close a healthy socket, forcing an unnecessary reconnect. Under the heavy-load conditions that triggered this investigation, that's exactly the pattern that amplifies.

Test

test/integration/test_fiber_concurrency.rb injects a FiberCancellation < Exception (stand-in for Async::Stop, so we don't pull the async gem as a test dep) into the response read and asserts:

  • The exception propagates (not swallowed).
  • @request_in_progress? is false after (connection is clean).
  • close was called at least once (cleanup ran).

The yield-based fiber interleave race is intentionally not covered by a test — in the default threadsafe: true config, Dalli::Threadsafe#request wraps in Monitor.synchronize and MRI's Monitor is fiber-aware, raising ThreadError: deadlock rather than corrupting state.

Test plan

  • bundle exec ruby -Ilib:test test/integration/test_fiber_concurrency.rb — new test passes
  • bundle exec rake test — full suite passes (no regressions from the new ensure)
  • bundle exec rubocop — clean

Async::Stop (and other scheduler cancellation signals) inherit from
Exception but not StandardError, so they bypass the rescue chain in
Base#request entirely.  Without an ensure clause, close and
abort_request! never run — @request_in_progress stays true and partial
response bytes are left on the socket for a subsequent caller to
misread, surfacing as parse errors or wrong values under load.

Add an ensure gated on $ERROR_INFO && request_in_progress? so the
cleanup runs on any abnormal exit without disturbing pipelined_get's
happy path (which intentionally leaves request_in_progress=true until
the caller drains responses).

Finalize the fiber concurrency test to assert the connection is left
clean after a non-StandardError abort.  The yield-based interleave race
is separately covered by Dalli::Threadsafe's fiber-aware Monitor in the
default config, so no test is needed for that.
@danmayer
Copy link
Copy Markdown
Author

prior to the fix the test does fail as we would expect

  1) Failure:
fiber concurrency#test_0001_does not leave the connection dirty when a non-StandardError aborts a request [test/integration/test_fiber_concurrency.rb:54]:
connection left with @request_in_progress == true after non-StandardError abort.
Expected #<Dalli::Protocol::ConnectionManager:0x00000001256abff0 @hostname="127.0.0.1", @port=21345, @socket_type=:tcp, @options={down_retry_delay: 30, socket_timeout: 1, socket_max_failures: 2, socket_failure_delay: 0.1, keepalive: true}, @request_in_progress=true, @sock=#<Dalli::Socket::TCP:fd 7, AF_INET, 127.0.0.1, 54031>, @pid=39877, @fail_count=0, @down_at=nil, @last_down_at=nil, @msg=nil, @error=nil> to not be request_in_progress?.

1 runs, 2 assertions, 1 failures, 0 errors, 0 skips

@danmayer danmayer requested a review from nherson April 22, 2026 17:49
Copy link
Copy Markdown

@drinkbeer drinkbeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:nice This will greatly reduce the errors caused by unexpected dirty socket.

Do you have some metrics/logs about the errors/exceptions caused by unexpected dirty socket? How often does this happen? I guess it will happen every time a Async::Stop was sent or some other network exceptions occurred.

NVM. I found the original thread and there are error logs.

Comment thread lib/dalli/protocol/base.rb
@danmayer danmayer merged commit 87c218a into main Apr 23, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants