Skip to content

Cap exponential backoff in consume() at 5s#41

Merged
justin-reid merged 4 commits into
mainfrom
justin-reid/cap-consume-backoff
Jun 2, 2026
Merged

Cap exponential backoff in consume() at 5s#41
justin-reid merged 4 commits into
mainfrom
justin-reid/cap-consume-backoff

Conversation

@justin-reid
Copy link
Copy Markdown
Contributor

@justin-reid justin-reid commented May 28, 2026

TL;DR

Problem. Coordinators::RedisCoordinator#consume doubles its XREADGROUP BLOCK <ms> interval on every empty poll with no upper bound. Once doubling has gone on long enough (around 15 empty iterations from the 10 ms start), a single BLOCK call can last 5–10 minutes, during which the worker cannot re-check complete? / abort?. Any transient moment when complete? is momentarily false after the queue has functionally drained — pipelined XACKs not yet visible, a worker that died holding a claim and not yet reaped, a slow ack write on a high-latency Redis path — is enough to push the worker into one of those multi-minute blocks, where it then sits until the BLOCK expires or the platform kills the job. The unbounded BLOCK is the amplifier: without it, the next iteration would be a short poll, the worker would re-check on its own, and the same transient inconsistency would heal in milliseconds.

Fix. Cap the doubled value at MAX_BACKOFF = 5_000 ms. Extract the doubling into a private next_backoff method so it's unit-testable without a live Redis. Bounds the worst-case unresponsiveness to a single MAX_BACKOFF window while keeping the same exponential ramp-up for the common case.

Details

Coordinators::RedisCoordinator#consume doubles exponential_backoff on every empty iteration with exponential_backoff <<= 1 and no upper bound. That value is passed to redis.xreadgroup(... block: <ms>), i.e. Redis XREADGROUP ... BLOCK <ms>, which the client implements as poll() on the socket.

Starting from INITIAL_BACKOFF = 10 ms, after 15 consecutive empty iterations a single BLOCK call lasts ~5m27s; after 16, ~10m. The break if combined_results.complete? and break if combined_results.abort? checks only run after the BLOCK returns, so once a worker has entered a multi-minute BLOCK it cannot escape until the BLOCK expires or the platform kills the job.

Under the right conditions this can surface as a post-test teardown hang: Minitest reports N/N 100%, but bin/rails test then sits in ppoll() on the coordinator's Redis socket for several minutes until the build times out. We've observed this path in production. It doesn't happen on every run — it requires the transient inconsistency to land between two specific iterations — but the worst case it leads to is unbounded.

The fix caps the doubled value at MAX_BACKOFF = 5_000 ms (reached after ~9 iterations, ~10 s of cumulative empty polls). It bounds the worst case where a worker is unresponsive to complete? / abort? after the queue drains to a single MAX_BACKOFF window, while keeping the same exponential ramp-up behaviour for the common case.

The doubling logic is extracted into a private next_backoff method so it can be unit-tested without a live Redis. New tests cover the doubling, the clamp, and the bounded iteration count.

Testing

  • Added test/minitest/distributed/coordinators/redis_coordinator_test.rb covering: doubling below the cap, clamping at the cap, and the bounded-iteration count.

`Coordinators::RedisCoordinator#consume` doubled `exponential_backoff`
on every empty iteration with `exponential_backoff <<= 1` and no upper
bound. That value is passed to `redis.xreadgroup(... block: <ms>)`,
i.e. Redis `XREADGROUP ... BLOCK <ms>`, which the client implements
as `poll()` on the socket.

Starting from `INITIAL_BACKOFF = 10` ms, after 15 consecutive empty
iterations a single BLOCK call lasts ~5m27s; after 16, ~10m. The
`break if combined_results.complete?` / `abort?` checks only run
*after* the BLOCK returns, so once a worker is in a multi-minute
BLOCK it cannot escape until the BLOCK expires or the platform kills
the job.

In real CI runs this surfaces as a post-test "teardown hang":
Minitest reports `N/N 100%`, then `bin/rails test` sits in `ppoll()`
on the coordinator's Redis socket for 5-10 minutes until the build
times out. `complete?` is `acks == size`, and with pipelined XACKs
racing the progress reporter, the first post-100% iteration can find
`complete?` still false even when the queue is in fact drained. That
single missed check is enough to slip into the doubling spiral.

Cap the doubled value at `MAX_BACKOFF = 5_000` ms (reached after ~9
iterations, ~10 s of cumulative empty polls). This bounds the worst
case where a worker is unresponsive to `complete?`/`abort?` after
the queue drains to a single `MAX_BACKOFF` window, while still
keeping the same exponential ramp-up for the common case.

The doubling logic is extracted into a private `next_backoff` method
so it can be unit-tested without a live Redis. Behavior tests cover
the doubling, the clamp, and the bounded iteration count.

Concrete repro captured in shop/world support-core CI build 19541
(2026-05-27): tests reached 7163/7163 in 3m54s, then `bin/rails
test` hung in `do_sys_poll` for ~8 min with the coordinator Redis
socket still ESTABlished, memory flat at 1.5 GiB, leader thread state
`S (sleeping)`, syscall 271 (ppoll). Fingerprint matches this code
path exactly.
@justin-reid justin-reid force-pushed the justin-reid/cap-consume-backoff branch 3 times, most recently from b31c89f to bdd5250 Compare May 28, 2026 21:29
`tests = T.let([], T::Array[Minitest::Runnable])` followed by
`tests = if … elsif … end` triggers `Lint/UselessAssignment` (the
initial assignment is overwritten unconditionally). Rubocop's
autocorrect is to delete it, but the bare assignment was doing type
narrowing for Sorbet — removing it regresses 5 `NilClass`-inference
errors on subsequent `tests.size`/`tests.first` calls.

Wrap the conditional in `T.let(if … end, T::Array[Minitest::Runnable])`
instead, and replace the inner `T.let([], …)` branches with bare `[]`.
This preserves the type narrowing while clearing the lint offence.

The offence is pre-existing on `main` since 2020; surfaced here because
`.github/workflows/ruby.yml` runs `on: push` and main's last non-
dependabot push pre-dates a Rubocop version that flagged it. Fixed in
this PR because the lint job complains about the file this PR touches.

Co-authored-by: Claude <noreply@anthropic.com>
Orchestrated-by: ae <noreply@shopify.com>
@justin-reid justin-reid force-pushed the justin-reid/cap-consume-backoff branch from a1dc9a7 to 78e37f6 Compare May 28, 2026 22:16
@justin-reid
Copy link
Copy Markdown
Contributor Author

Heads-up on the red typecheck job: that failure is pre-existing on main, not introduced by this PR. The workflow runs on: push and main's last non-dependabot push pre-dates the Rubocop/Sorbet versions bundler now resolves, so the typecheck just hadn't run against current tooling until this branch.

The errors fall into four buckets, all unrelated to the backoff change:

  1. ~4360 sig-less-method errors across sorbet/rbi/gems/*.rbi (autogenerated files defaulting to # typed: strict).
  2. sorbet/rbi/minitest.rbi is # typed: strict but missing sigs for ~17 methods, and two existing sigs are wrong (assert_includes rejects String; assert_raises rejects blocks).
  3. T::Array[Module] on redis_coordinator.rb:289 (custom_middlewares) hits Sorbet 5046 because Module needs type args.
  4. prism-1.9.0's gem-shipped RBI references Prism::LexCompat::Result, a constant the gem removed in the same release.

Happy to clean these up in a separate PR so this one stays scoped to the actual fix.

🤖 AE · ✅ approved by @justin-reid

@justin-reid justin-reid marked this pull request as ready for review May 28, 2026 22:47
@justin-reid justin-reid requested a review from ChrisBr May 28, 2026 22:47
Copy link
Copy Markdown

@jose-shopify jose-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a non-blocking nit, code looks good

Comment thread lib/minitest/distributed/coordinators/redis_coordinator.rb
justin-reid and others added 2 commits June 2, 2026 10:37
Address Jose's review nit on PR #41: the existing tests exercise the
private `next_backoff` helper in isolation, but they don't pin its
call site at line 256 of consume(). A revert to a bare
`exponential_backoff <<= 1` here would still pass the suite.

Add a test that stubs out `claim_stale_runnables`/`claim_fresh_runnables`/
`process_batch`/`cleanup` and a fake `combined_results`, then drives
`consume()` through ~25 idle iterations while capturing the `block:`
argument passed to `claim_fresh_runnables` each time. The assertion is
that the last five captured values are all exactly `MAX_BACKOFF` (5_000)
— if the call site were unbounded, those values would be growing powers
of two well past 5_000_000.

Verified by reverting the call site locally to `<<= 1`: the new test
fails with `[10485760, 20971520, 41943040, 83886080, 167772160]` instead
of `[5000] * 5`.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Orchestrated-by: ae <noreply@shopify.com>
@justin-reid justin-reid merged commit e924c6f into main Jun 2, 2026
6 of 7 checks passed
justin-reid added a commit that referenced this pull request Jun 2, 2026
The `typecheck` workflow job has been red on `main` against current
Sorbet for some time. `.github/workflows/ruby.yml` runs `on: push`
and main's last non-dependabot push pre-dates the versions bundler
now resolves, so the job had not run against current tooling until
PR #41 forced it.

This commit clears the remaining typecheck errors:

- Add `sorbet/typed_overrides.yaml` downgrading the 17 autogenerated
  `sorbet/rbi/gems/*.rbi` files that are shipped at `# typed: strict` /
  `# typed: true` but omit sigs on many methods (~4360 sig-less-method
  errors). Editing the sigil in place would be wiped on the next
  `srb rbi gems` regeneration; the yaml + `--typed-override` in
  `sorbet/config` survives regen.

- Demote `sorbet/rbi/minitest.rbi` from `strict` to `true` (it omits
  sigs on ~17 hand-written methods) and fix two existing sigs that
  this repo's own tests exercise:
  - `assert_includes`'s `collection` is `T.untyped`, not
    `T::Enumerable[T.untyped]`. Minitest dispatches via `.include?`
    and accepts `String` (used by
    `redis_coordinator_integration_test.rb:362-364`).
  - `assert_raises` takes a block. The sig was missing `&blk`, which
    rejected `defined_runnable_test.rb:13`.

- Replace `T::Array[Module]` on the `custom_middlewares` accessor in
  `redis_coordinator.rb` with `T::Array[T.untyped]`. Sorbet 5046
  requires `Module` to carry type arguments; `T::Module[…]` would
  type-check but raises `NameError: uninitialized constant T::Module`
  at load time on `sorbet-runtime` 0.5.12443 (the version CI
  resolves), so the broader `T.untyped` is the runtime-safe choice.

- Add `--suppress-error-code=5002` to `sorbet/config`. The remaining
  error is `prism-1.9.0`'s gem-shipped RBI referencing
  `Prism::LexCompat::Result`, a constant the gem removed in the same
  release. The RBI is auto-loaded via the bundler-gem discovery path
  (not the filesystem walk subject to `--ignore`), so the project
  can't filter it by path. `bin/srb tc --isolate-error-code 5002`
  confirms this is the only such error in the repo, so suppression
  has zero false-negative risk today. When prism's upstream RBI is
  fixed (or `rubocop-ast` stops depending on prism), this line
  should be removed.

After these changes, `bin/srb tc` reports 0 errors and `bin/rubocop`
reports no offences. All 69 existing test runs / 337 assertions still
pass.

The `Lint/UselessAssignment` fix originally bundled with this PR
landed independently in #41 (`78e37f6`) and was dropped on rebase.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Orchestrated-by: ae <noreply@shopify.com>
justin-reid added a commit that referenced this pull request Jun 2, 2026
The `typecheck` workflow job has been red on `main` against current
Sorbet for some time. `.github/workflows/ruby.yml` runs `on: push`
and main's last non-dependabot push pre-dates the versions bundler
now resolves, so the job had not run against current tooling until
PR #41 forced it.

This commit clears the remaining typecheck errors:

- Add `sorbet/typed_overrides.yaml` downgrading the 17 autogenerated
  `sorbet/rbi/gems/*.rbi` files that are shipped at `# typed: strict` /
  `# typed: true` but omit sigs on many methods (~4360 sig-less-method
  errors). Editing the sigil in place would be wiped on the next
  `srb rbi gems` regeneration; the yaml + `--typed-override` in
  `sorbet/config` survives regen.

- Demote `sorbet/rbi/minitest.rbi` from `strict` to `true` (it omits
  sigs on ~17 hand-written methods) and fix two existing sigs that
  this repo's own tests exercise:
  - `assert_includes`'s `collection` is `T.untyped`, not
    `T::Enumerable[T.untyped]`. Minitest dispatches via `.include?`
    and accepts `String` (used by
    `redis_coordinator_integration_test.rb:362-364`).
  - `assert_raises` takes a block. The sig was missing `&blk`, which
    rejected `defined_runnable_test.rb:13`.

- Replace `T::Array[Module]` on the `custom_middlewares` accessor in
  `redis_coordinator.rb` with `T::Array[T.untyped]`. Sorbet 5046
  requires `Module` to carry type arguments; `T::Module[…]` would
  type-check but raises `NameError: uninitialized constant T::Module`
  at load time on `sorbet-runtime` 0.5.12443 (the version CI
  resolves), so the broader `T.untyped` is the runtime-safe choice.

- Add `--suppress-error-code=5002` to `sorbet/config`. The remaining
  error is `prism-1.9.0`'s gem-shipped RBI referencing
  `Prism::LexCompat::Result`, a constant the gem removed in the same
  release. The RBI is auto-loaded via the bundler-gem discovery path
  (not the filesystem walk subject to `--ignore`), so the project
  can't filter it by path. `bin/srb tc --isolate-error-code 5002`
  confirms this is the only such error in the repo, so suppression
  has zero false-negative risk today. When prism's upstream RBI is
  fixed (or `rubocop-ast` stops depending on prism), this line
  should be removed.

After these changes, `bin/srb tc` reports 0 errors and `bin/rubocop`
reports no offences. All 69 existing test runs / 337 assertions still
pass.

The `Lint/UselessAssignment` fix originally bundled with this PR
landed independently in #41 (`78e37f6`) and was dropped on rebase.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Orchestrated-by: ae <noreply@shopify.com>
justin-reid added a commit that referenced this pull request Jun 3, 2026
The `typecheck` workflow job has been red on `main` against current
Sorbet for some time. `.github/workflows/ruby.yml` runs `on: push`
and main's last non-dependabot push pre-dates the versions bundler
now resolves, so the job had not run against current tooling until
PR #41 forced it.

This commit clears the remaining typecheck errors:

- Add `sorbet/typed_overrides.yaml` downgrading the 17 autogenerated
  `sorbet/rbi/gems/*.rbi` files that are shipped at `# typed: strict` /
  `# typed: true` but omit sigs on many methods (~4360 sig-less-method
  errors). Editing the sigil in place would be wiped on the next
  `srb rbi gems` regeneration; the yaml + `--typed-override` in
  `sorbet/config` survives regen.

- Demote `sorbet/rbi/minitest.rbi` from `strict` to `true` (it omits
  sigs on ~17 hand-written methods) and fix two existing sigs that
  this repo's own tests exercise:
  - `assert_includes`'s `collection` is `T.untyped`, not
    `T::Enumerable[T.untyped]`. Minitest dispatches via `.include?`
    and accepts `String` (used by
    `redis_coordinator_integration_test.rb:362-364`).
  - `assert_raises` takes a block. The sig was missing `&blk`, which
    rejected `defined_runnable_test.rb:13`.

- Replace `T::Array[Module]` on the `custom_middlewares` accessor in
  `redis_coordinator.rb` with `T::Array[T.untyped]`. Sorbet 5046
  requires `Module` to carry type arguments; `T::Module[…]` would
  type-check but raises `NameError: uninitialized constant T::Module`
  at load time on `sorbet-runtime` 0.5.12443 (the version CI
  resolves), so the broader `T.untyped` is the runtime-safe choice.

- Add `--suppress-error-code=5002` to `sorbet/config`. The remaining
  error is `prism-1.9.0`'s gem-shipped RBI referencing
  `Prism::LexCompat::Result`, a constant the gem removed in the same
  release. The RBI is auto-loaded via the bundler-gem discovery path
  (not the filesystem walk subject to `--ignore`), so the project
  can't filter it by path. `bin/srb tc --isolate-error-code 5002`
  confirms this is the only such error in the repo, so suppression
  has zero false-negative risk today. When prism's upstream RBI is
  fixed (or `rubocop-ast` stops depending on prism), this line
  should be removed.

After these changes, `bin/srb tc` reports 0 errors and `bin/rubocop`
reports no offences. All 69 existing test runs / 337 assertions still
pass.

The `Lint/UselessAssignment` fix originally bundled with this PR
landed independently in #41 (`78e37f6`) and was dropped on rebase.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Orchestrated-by: ae <noreply@shopify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants