Cap exponential backoff in consume() at 5s by justin-reid · Pull Request #41 · Shopify/minitest-distributed

justin-reid · 2026-05-28T20:10:36Z

TL;DR

Problem. Coordinators::RedisCoordinator#consume doubles its XREADGROUP BLOCK <ms> interval on every empty poll with no upper bound. Once doubling has gone on long enough (around 15 empty iterations from the 10 ms start), a single BLOCK call can last 5–10 minutes, during which the worker cannot re-check complete? / abort?. Any transient moment when complete? is momentarily false after the queue has functionally drained — pipelined XACKs not yet visible, a worker that died holding a claim and not yet reaped, a slow ack write on a high-latency Redis path — is enough to push the worker into one of those multi-minute blocks, where it then sits until the BLOCK expires or the platform kills the job. The unbounded BLOCK is the amplifier: without it, the next iteration would be a short poll, the worker would re-check on its own, and the same transient inconsistency would heal in milliseconds.

Fix. Cap the doubled value at MAX_BACKOFF = 5_000 ms. Extract the doubling into a private next_backoff method so it's unit-testable without a live Redis. Bounds the worst-case unresponsiveness to a single MAX_BACKOFF window while keeping the same exponential ramp-up for the common case.

Details

Coordinators::RedisCoordinator#consume doubles exponential_backoff on every empty iteration with exponential_backoff <<= 1 and no upper bound. That value is passed to redis.xreadgroup(... block: <ms>), i.e. Redis XREADGROUP ... BLOCK <ms>, which the client implements as poll() on the socket.

Starting from INITIAL_BACKOFF = 10 ms, after 15 consecutive empty iterations a single BLOCK call lasts ~5m27s; after 16, ~10m. The break if combined_results.complete? and break if combined_results.abort? checks only run after the BLOCK returns, so once a worker has entered a multi-minute BLOCK it cannot escape until the BLOCK expires or the platform kills the job.

Under the right conditions this can surface as a post-test teardown hang: Minitest reports N/N 100%, but bin/rails test then sits in ppoll() on the coordinator's Redis socket for several minutes until the build times out. We've observed this path in production. It doesn't happen on every run — it requires the transient inconsistency to land between two specific iterations — but the worst case it leads to is unbounded.

The fix caps the doubled value at MAX_BACKOFF = 5_000 ms (reached after ~9 iterations, ~10 s of cumulative empty polls). It bounds the worst case where a worker is unresponsive to complete? / abort? after the queue drains to a single MAX_BACKOFF window, while keeping the same exponential ramp-up behaviour for the common case.

The doubling logic is extracted into a private next_backoff method so it can be unit-tested without a live Redis. New tests cover the doubling, the clamp, and the bounded iteration count.

Testing

Added test/minitest/distributed/coordinators/redis_coordinator_test.rb covering: doubling below the cap, clamping at the cap, and the bounded-iteration count.

`Coordinators::RedisCoordinator#consume` doubled `exponential_backoff` on every empty iteration with `exponential_backoff <<= 1` and no upper bound. That value is passed to `redis.xreadgroup(... block: <ms>)`, i.e. Redis `XREADGROUP ... BLOCK <ms>`, which the client implements as `poll()` on the socket. Starting from `INITIAL_BACKOFF = 10` ms, after 15 consecutive empty iterations a single BLOCK call lasts ~5m27s; after 16, ~10m. The `break if combined_results.complete?` / `abort?` checks only run *after* the BLOCK returns, so once a worker is in a multi-minute BLOCK it cannot escape until the BLOCK expires or the platform kills the job. In real CI runs this surfaces as a post-test "teardown hang": Minitest reports `N/N 100%`, then `bin/rails test` sits in `ppoll()` on the coordinator's Redis socket for 5-10 minutes until the build times out. `complete?` is `acks == size`, and with pipelined XACKs racing the progress reporter, the first post-100% iteration can find `complete?` still false even when the queue is in fact drained. That single missed check is enough to slip into the doubling spiral. Cap the doubled value at `MAX_BACKOFF = 5_000` ms (reached after ~9 iterations, ~10 s of cumulative empty polls). This bounds the worst case where a worker is unresponsive to `complete?`/`abort?` after the queue drains to a single `MAX_BACKOFF` window, while still keeping the same exponential ramp-up for the common case. The doubling logic is extracted into a private `next_backoff` method so it can be unit-tested without a live Redis. Behavior tests cover the doubling, the clamp, and the bounded iteration count. Concrete repro captured in shop/world support-core CI build 19541 (2026-05-27): tests reached 7163/7163 in 3m54s, then `bin/rails test` hung in `do_sys_poll` for ~8 min with the coordinator Redis socket still ESTABlished, memory flat at 1.5 GiB, leader thread state `S (sleeping)`, syscall 271 (ppoll). Fingerprint matches this code path exactly.

`tests = T.let([], T::Array[Minitest::Runnable])` followed by `tests = if … elsif … end` triggers `Lint/UselessAssignment` (the initial assignment is overwritten unconditionally). Rubocop's autocorrect is to delete it, but the bare assignment was doing type narrowing for Sorbet — removing it regresses 5 `NilClass`-inference errors on subsequent `tests.size`/`tests.first` calls. Wrap the conditional in `T.let(if … end, T::Array[Minitest::Runnable])` instead, and replace the inner `T.let([], …)` branches with bare `[]`. This preserves the type narrowing while clearing the lint offence. The offence is pre-existing on `main` since 2020; surfaced here because `.github/workflows/ruby.yml` runs `on: push` and main's last non- dependabot push pre-dates a Rubocop version that flagged it. Fixed in this PR because the lint job complains about the file this PR touches. Co-authored-by: Claude <noreply@anthropic.com> Orchestrated-by: ae <noreply@shopify.com>

justin-reid · 2026-05-28T22:36:15Z

Heads-up on the red typecheck job: that failure is pre-existing on main, not introduced by this PR. The workflow runs on: push and main's last non-dependabot push pre-dates the Rubocop/Sorbet versions bundler now resolves, so the typecheck just hadn't run against current tooling until this branch.

The errors fall into four buckets, all unrelated to the backoff change:

~4360 sig-less-method errors across sorbet/rbi/gems/*.rbi (autogenerated files defaulting to # typed: strict).
sorbet/rbi/minitest.rbi is # typed: strict but missing sigs for ~17 methods, and two existing sigs are wrong (assert_includes rejects String; assert_raises rejects blocks).
T::Array[Module] on redis_coordinator.rb:289 (custom_middlewares) hits Sorbet 5046 because Module needs type args.
prism-1.9.0's gem-shipped RBI references Prism::LexCompat::Result, a constant the gem removed in the same release.

Happy to clean these up in a separate PR so this one stays scoped to the actual fix.

_{🤖 AE · ✅ approved by @justin-reid}

jose-shopify

left a non-blocking nit, code looks good

Address Jose's review nit on PR #41: the existing tests exercise the private `next_backoff` helper in isolation, but they don't pin its call site at line 256 of consume(). A revert to a bare `exponential_backoff <<= 1` here would still pass the suite. Add a test that stubs out `claim_stale_runnables`/`claim_fresh_runnables`/ `process_batch`/`cleanup` and a fake `combined_results`, then drives `consume()` through ~25 idle iterations while capturing the `block:` argument passed to `claim_fresh_runnables` each time. The assertion is that the last five captured values are all exactly `MAX_BACKOFF` (5_000) — if the call site were unbounded, those values would be growing powers of two well past 5_000_000. Verified by reverting the call site locally to `<<= 1`: the new test fails with `[10485760, 20971520, 41943040, 83886080, 167772160]` instead of `[5000] * 5`. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Orchestrated-by: ae <noreply@shopify.com>

The `typecheck` workflow job has been red on `main` against current Sorbet for some time. `.github/workflows/ruby.yml` runs `on: push` and main's last non-dependabot push pre-dates the versions bundler now resolves, so the job had not run against current tooling until PR #41 forced it. This commit clears the remaining typecheck errors: - Add `sorbet/typed_overrides.yaml` downgrading the 17 autogenerated `sorbet/rbi/gems/*.rbi` files that are shipped at `# typed: strict` / `# typed: true` but omit sigs on many methods (~4360 sig-less-method errors). Editing the sigil in place would be wiped on the next `srb rbi gems` regeneration; the yaml + `--typed-override` in `sorbet/config` survives regen. - Demote `sorbet/rbi/minitest.rbi` from `strict` to `true` (it omits sigs on ~17 hand-written methods) and fix two existing sigs that this repo's own tests exercise: - `assert_includes`'s `collection` is `T.untyped`, not `T::Enumerable[T.untyped]`. Minitest dispatches via `.include?` and accepts `String` (used by `redis_coordinator_integration_test.rb:362-364`). - `assert_raises` takes a block. The sig was missing `&blk`, which rejected `defined_runnable_test.rb:13`. - Replace `T::Array[Module]` on the `custom_middlewares` accessor in `redis_coordinator.rb` with `T::Array[T.untyped]`. Sorbet 5046 requires `Module` to carry type arguments; `T::Module[…]` would type-check but raises `NameError: uninitialized constant T::Module` at load time on `sorbet-runtime` 0.5.12443 (the version CI resolves), so the broader `T.untyped` is the runtime-safe choice. - Add `--suppress-error-code=5002` to `sorbet/config`. The remaining error is `prism-1.9.0`'s gem-shipped RBI referencing `Prism::LexCompat::Result`, a constant the gem removed in the same release. The RBI is auto-loaded via the bundler-gem discovery path (not the filesystem walk subject to `--ignore`), so the project can't filter it by path. `bin/srb tc --isolate-error-code 5002` confirms this is the only such error in the repo, so suppression has zero false-negative risk today. When prism's upstream RBI is fixed (or `rubocop-ast` stops depending on prism), this line should be removed. After these changes, `bin/srb tc` reports 0 errors and `bin/rubocop` reports no offences. All 69 existing test runs / 337 assertions still pass. The `Lint/UselessAssignment` fix originally bundled with this PR landed independently in #41 (`78e37f6`) and was dropped on rebase. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Orchestrated-by: ae <noreply@shopify.com>

justin-reid force-pushed the justin-reid/cap-consume-backoff branch 3 times, most recently from b31c89f to bdd5250 Compare May 28, 2026 21:29

justin-reid force-pushed the justin-reid/cap-consume-backoff branch from a1dc9a7 to 78e37f6 Compare May 28, 2026 22:16

justin-reid mentioned this pull request May 28, 2026

Clean up CI rot: sorbet and vendored RBI noise #42

Open

justin-reid marked this pull request as ready for review May 28, 2026 22:47

justin-reid requested a review from ChrisBr May 28, 2026 22:47

jose-shopify approved these changes Jun 1, 2026

View reviewed changes

Comment thread lib/minitest/distributed/coordinators/redis_coordinator.rb

justin-reid and others added 2 commits June 2, 2026 10:37

Bump version

b1fb717

justin-reid merged commit e924c6f into main Jun 2, 2026
6 of 7 checks passed

shopify-shipit Bot temporarily deployed to rubygems June 2, 2026 20:51 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cap exponential backoff in consume() at 5s#41

Cap exponential backoff in consume() at 5s#41
justin-reid merged 4 commits into
mainfrom
justin-reid/cap-consume-backoff

justin-reid commented May 28, 2026 •

edited

Loading

Uh oh!

justin-reid commented May 28, 2026

Uh oh!

jose-shopify left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justin-reid commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Details

Testing

Uh oh!

justin-reid commented May 28, 2026

Uh oh!

jose-shopify left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justin-reid commented May 28, 2026 •

edited

Loading