Clean up connection state on non-StandardError aborts by danmayer · Pull Request #68 · Shopify/dalli

danmayer · 2026-04-22T17:17:23Z

Summary

Under heavy async load, we've been seeing parse errors and unexpected values from Dalli that look like stale data on the socket. Root cause: scheduler-driven fiber cancellation signals (Async::Stop, Polyphony::Terminate, anything that inherits from Exception but not StandardError) sail past the rescue chain in Base#request, so close / abort_request! never run. The connection is left with @request_in_progress == true and any partial response bytes remain on the wire for the next caller to misread.

The bug

# lib/dalli/protocol/base.rb (before)
def request(opkey, *args)
  verify_state(opkey)

  begin
    @connection_manager.start_request!
    response = send(opkey, *args)
    @connection_manager.finish_request! unless opkey == :pipelined_get
    response
  rescue Dalli::MarshalError => e
    log_marshal_err(args.first, e); raise
  rescue Dalli::DalliError
    raise
  rescue StandardError => e
    log_unexpected_err(e); close; raise
  end
end

Async::Stop < Exception (intentionally, so generic rescues don't swallow cancellation). rescue StandardError doesn't catch it → no close, no abort_request!. Quick REPL confirmation:

StandardError > Async::Stop  # => nil
Exception > Async::Stop      # => true

Next caller then either:

Reads stale bytes off the shared socket → parse error / wrong value, or
Has confirm_ready! tear down the dirty socket and reconnect → reconnect storms amplify the damage under load.

The fix

Add an ensure that cleans up on any abnormal exit:

ensure
  # Catches non-StandardError exceptions (e.g. Async::Stop) that sail
  # past the rescues above.  $ERROR_INFO is non-nil only when an
  # exception is unwinding, so pipelined_get's happy path — which
  # intentionally leaves @request_in_progress = true until the
  # caller drains the responses — isn't torn down.
  close if $ERROR_INFO && @connection_manager.request_in_progress?
end

Why `ensure` + `$ERROR_INFO` and not `rescue Async::Stop`

	`rescue Async::Stop`	`ensure` + `$ERROR_INFO`
Extra dep on `async` gem	yes	no
Covers other schedulers (Polyphony, custom Fiber.scheduler)	no	yes
Covers `SignalException` / Ctrl-C mid-request	no	yes
Respects Async's "don't catch me" contract	no (catches)	yes (just cleans up, re-raises)
Gated on actual dirty state	no	yes

Why the `request_in_progress?` guard matters

Two invariants have to hold together:

$ERROR_INFO — only clean up on abnormal exit. Without this, pipelined_get's happy path (which intentionally leaves request_in_progress == true for the response drain) gets torn down.
request_in_progress? — only close if the state is actually dirty. Without this, a rare scheduler-injected exception after finish_request! but before the implicit return would close a healthy socket, forcing an unnecessary reconnect. Under the heavy-load conditions that triggered this investigation, that's exactly the pattern that amplifies.

Test

test/integration/test_fiber_concurrency.rb injects a FiberCancellation < Exception (stand-in for Async::Stop, so we don't pull the async gem as a test dep) into the response read and asserts:

The exception propagates (not swallowed).
@request_in_progress? is false after (connection is clean).
close was called at least once (cleanup ran).

The yield-based fiber interleave race is intentionally not covered by a test — in the default threadsafe: true config, Dalli::Threadsafe#request wraps in Monitor.synchronize and MRI's Monitor is fiber-aware, raising ThreadError: deadlock rather than corrupting state.

Test plan

bundle exec ruby -Ilib:test test/integration/test_fiber_concurrency.rb — new test passes
bundle exec rake test — full suite passes (no regressions from the new ensure)
bundle exec rubocop — clean

Async::Stop (and other scheduler cancellation signals) inherit from Exception but not StandardError, so they bypass the rescue chain in Base#request entirely. Without an ensure clause, close and abort_request! never run — @request_in_progress stays true and partial response bytes are left on the socket for a subsequent caller to misread, surfacing as parse errors or wrong values under load. Add an ensure gated on $ERROR_INFO && request_in_progress? so the cleanup runs on any abnormal exit without disturbing pipelined_get's happy path (which intentionally leaves request_in_progress=true until the caller drains responses). Finalize the fiber concurrency test to assert the connection is left clean after a non-StandardError abort. The yield-based interleave race is separately covered by Dalli::Threadsafe's fiber-aware Monitor in the default config, so no test is needed for that.

danmayer · 2026-04-22T17:31:00Z

prior to the fix the test does fail as we would expect

  1) Failure:
fiber concurrency#test_0001_does not leave the connection dirty when a non-StandardError aborts a request [test/integration/test_fiber_concurrency.rb:54]:
connection left with @request_in_progress == true after non-StandardError abort.
Expected #<Dalli::Protocol::ConnectionManager:0x00000001256abff0 @hostname="127.0.0.1", @port=21345, @socket_type=:tcp, @options={down_retry_delay: 30, socket_timeout: 1, socket_max_failures: 2, socket_failure_delay: 0.1, keepalive: true}, @request_in_progress=true, @sock=#<Dalli::Socket::TCP:fd 7, AF_INET, 127.0.0.1, 54031>, @pid=39877, @fail_count=0, @down_at=nil, @last_down_at=nil, @msg=nil, @error=nil> to not be request_in_progress?.

1 runs, 2 assertions, 1 failures, 0 errors, 0 skips

drinkbeer

:nice This will greatly reduce the errors caused by unexpected dirty socket.

Do you have some metrics/logs about the errors/exceptions caused by unexpected dirty socket? How often does this happen? I guess it will happen every time a Async::Stop was sent or some other network exceptions occurred.

NVM. I found the original thread and there are error logs.

danmayer added 2 commits April 22, 2026 10:42

verify fiber issue on request error

b706515

danmayer requested a review from nherson April 22, 2026 17:49

nherson approved these changes Apr 22, 2026

View reviewed changes

drinkbeer approved these changes Apr 23, 2026

View reviewed changes

Comment thread lib/dalli/protocol/base.rb

danmayer merged commit 87c218a into main Apr 23, 2026
16 checks passed

This was referenced Apr 23, 2026

Harden ConnectionManager#close against non-StandardError during @sock.close #69

Merged

Use local flag instead of $ERROR_INFO to gate ensure-block close #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up connection state on non-StandardError aborts#68

Clean up connection state on non-StandardError aborts#68
danmayer merged 2 commits intomainfrom
fiber_concurrency_bug

danmayer commented Apr 22, 2026 •

edited

Loading

Uh oh!

danmayer commented Apr 22, 2026

Uh oh!

drinkbeer left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danmayer commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

The fix

Why ensure + $ERROR_INFO and not rescue Async::Stop

Why the request_in_progress? guard matters

Test

Test plan

Uh oh!

danmayer commented Apr 22, 2026

Uh oh!

drinkbeer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danmayer commented Apr 22, 2026 •

edited

Loading

Why `ensure` + `$ERROR_INFO` and not `rescue Async::Stop`

Why the `request_in_progress?` guard matters

drinkbeer left a comment •

edited

Loading