Cap exponential backoff in consume() at 5s#41
Merged
Conversation
`Coordinators::RedisCoordinator#consume` doubled `exponential_backoff` on every empty iteration with `exponential_backoff <<= 1` and no upper bound. That value is passed to `redis.xreadgroup(... block: <ms>)`, i.e. Redis `XREADGROUP ... BLOCK <ms>`, which the client implements as `poll()` on the socket. Starting from `INITIAL_BACKOFF = 10` ms, after 15 consecutive empty iterations a single BLOCK call lasts ~5m27s; after 16, ~10m. The `break if combined_results.complete?` / `abort?` checks only run *after* the BLOCK returns, so once a worker is in a multi-minute BLOCK it cannot escape until the BLOCK expires or the platform kills the job. In real CI runs this surfaces as a post-test "teardown hang": Minitest reports `N/N 100%`, then `bin/rails test` sits in `ppoll()` on the coordinator's Redis socket for 5-10 minutes until the build times out. `complete?` is `acks == size`, and with pipelined XACKs racing the progress reporter, the first post-100% iteration can find `complete?` still false even when the queue is in fact drained. That single missed check is enough to slip into the doubling spiral. Cap the doubled value at `MAX_BACKOFF = 5_000` ms (reached after ~9 iterations, ~10 s of cumulative empty polls). This bounds the worst case where a worker is unresponsive to `complete?`/`abort?` after the queue drains to a single `MAX_BACKOFF` window, while still keeping the same exponential ramp-up for the common case. The doubling logic is extracted into a private `next_backoff` method so it can be unit-tested without a live Redis. Behavior tests cover the doubling, the clamp, and the bounded iteration count. Concrete repro captured in shop/world support-core CI build 19541 (2026-05-27): tests reached 7163/7163 in 3m54s, then `bin/rails test` hung in `do_sys_poll` for ~8 min with the coordinator Redis socket still ESTABlished, memory flat at 1.5 GiB, leader thread state `S (sleeping)`, syscall 271 (ppoll). Fingerprint matches this code path exactly.
b31c89f to
bdd5250
Compare
`tests = T.let([], T::Array[Minitest::Runnable])` followed by `tests = if … elsif … end` triggers `Lint/UselessAssignment` (the initial assignment is overwritten unconditionally). Rubocop's autocorrect is to delete it, but the bare assignment was doing type narrowing for Sorbet — removing it regresses 5 `NilClass`-inference errors on subsequent `tests.size`/`tests.first` calls. Wrap the conditional in `T.let(if … end, T::Array[Minitest::Runnable])` instead, and replace the inner `T.let([], …)` branches with bare `[]`. This preserves the type narrowing while clearing the lint offence. The offence is pre-existing on `main` since 2020; surfaced here because `.github/workflows/ruby.yml` runs `on: push` and main's last non- dependabot push pre-dates a Rubocop version that flagged it. Fixed in this PR because the lint job complains about the file this PR touches. Co-authored-by: Claude <noreply@anthropic.com> Orchestrated-by: ae <noreply@shopify.com>
a1dc9a7 to
78e37f6
Compare
Contributor
Author
|
Heads-up on the red The errors fall into four buckets, all unrelated to the backoff change:
Happy to clean these up in a separate PR so this one stays scoped to the actual fix. 🤖 AE · ✅ approved by @justin-reid |
jose-shopify
approved these changes
Jun 1, 2026
jose-shopify
left a comment
There was a problem hiding this comment.
left a non-blocking nit, code looks good
Address Jose's review nit on PR #41: the existing tests exercise the private `next_backoff` helper in isolation, but they don't pin its call site at line 256 of consume(). A revert to a bare `exponential_backoff <<= 1` here would still pass the suite. Add a test that stubs out `claim_stale_runnables`/`claim_fresh_runnables`/ `process_batch`/`cleanup` and a fake `combined_results`, then drives `consume()` through ~25 idle iterations while capturing the `block:` argument passed to `claim_fresh_runnables` each time. The assertion is that the last five captured values are all exactly `MAX_BACKOFF` (5_000) — if the call site were unbounded, those values would be growing powers of two well past 5_000_000. Verified by reverting the call site locally to `<<= 1`: the new test fails with `[10485760, 20971520, 41943040, 83886080, 167772160]` instead of `[5000] * 5`. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Orchestrated-by: ae <noreply@shopify.com>
justin-reid
added a commit
that referenced
this pull request
Jun 2, 2026
The `typecheck` workflow job has been red on `main` against current Sorbet for some time. `.github/workflows/ruby.yml` runs `on: push` and main's last non-dependabot push pre-dates the versions bundler now resolves, so the job had not run against current tooling until PR #41 forced it. This commit clears the remaining typecheck errors: - Add `sorbet/typed_overrides.yaml` downgrading the 17 autogenerated `sorbet/rbi/gems/*.rbi` files that are shipped at `# typed: strict` / `# typed: true` but omit sigs on many methods (~4360 sig-less-method errors). Editing the sigil in place would be wiped on the next `srb rbi gems` regeneration; the yaml + `--typed-override` in `sorbet/config` survives regen. - Demote `sorbet/rbi/minitest.rbi` from `strict` to `true` (it omits sigs on ~17 hand-written methods) and fix two existing sigs that this repo's own tests exercise: - `assert_includes`'s `collection` is `T.untyped`, not `T::Enumerable[T.untyped]`. Minitest dispatches via `.include?` and accepts `String` (used by `redis_coordinator_integration_test.rb:362-364`). - `assert_raises` takes a block. The sig was missing `&blk`, which rejected `defined_runnable_test.rb:13`. - Replace `T::Array[Module]` on the `custom_middlewares` accessor in `redis_coordinator.rb` with `T::Array[T.untyped]`. Sorbet 5046 requires `Module` to carry type arguments; `T::Module[…]` would type-check but raises `NameError: uninitialized constant T::Module` at load time on `sorbet-runtime` 0.5.12443 (the version CI resolves), so the broader `T.untyped` is the runtime-safe choice. - Add `--suppress-error-code=5002` to `sorbet/config`. The remaining error is `prism-1.9.0`'s gem-shipped RBI referencing `Prism::LexCompat::Result`, a constant the gem removed in the same release. The RBI is auto-loaded via the bundler-gem discovery path (not the filesystem walk subject to `--ignore`), so the project can't filter it by path. `bin/srb tc --isolate-error-code 5002` confirms this is the only such error in the repo, so suppression has zero false-negative risk today. When prism's upstream RBI is fixed (or `rubocop-ast` stops depending on prism), this line should be removed. After these changes, `bin/srb tc` reports 0 errors and `bin/rubocop` reports no offences. All 69 existing test runs / 337 assertions still pass. The `Lint/UselessAssignment` fix originally bundled with this PR landed independently in #41 (`78e37f6`) and was dropped on rebase. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Orchestrated-by: ae <noreply@shopify.com>
justin-reid
added a commit
that referenced
this pull request
Jun 2, 2026
The `typecheck` workflow job has been red on `main` against current Sorbet for some time. `.github/workflows/ruby.yml` runs `on: push` and main's last non-dependabot push pre-dates the versions bundler now resolves, so the job had not run against current tooling until PR #41 forced it. This commit clears the remaining typecheck errors: - Add `sorbet/typed_overrides.yaml` downgrading the 17 autogenerated `sorbet/rbi/gems/*.rbi` files that are shipped at `# typed: strict` / `# typed: true` but omit sigs on many methods (~4360 sig-less-method errors). Editing the sigil in place would be wiped on the next `srb rbi gems` regeneration; the yaml + `--typed-override` in `sorbet/config` survives regen. - Demote `sorbet/rbi/minitest.rbi` from `strict` to `true` (it omits sigs on ~17 hand-written methods) and fix two existing sigs that this repo's own tests exercise: - `assert_includes`'s `collection` is `T.untyped`, not `T::Enumerable[T.untyped]`. Minitest dispatches via `.include?` and accepts `String` (used by `redis_coordinator_integration_test.rb:362-364`). - `assert_raises` takes a block. The sig was missing `&blk`, which rejected `defined_runnable_test.rb:13`. - Replace `T::Array[Module]` on the `custom_middlewares` accessor in `redis_coordinator.rb` with `T::Array[T.untyped]`. Sorbet 5046 requires `Module` to carry type arguments; `T::Module[…]` would type-check but raises `NameError: uninitialized constant T::Module` at load time on `sorbet-runtime` 0.5.12443 (the version CI resolves), so the broader `T.untyped` is the runtime-safe choice. - Add `--suppress-error-code=5002` to `sorbet/config`. The remaining error is `prism-1.9.0`'s gem-shipped RBI referencing `Prism::LexCompat::Result`, a constant the gem removed in the same release. The RBI is auto-loaded via the bundler-gem discovery path (not the filesystem walk subject to `--ignore`), so the project can't filter it by path. `bin/srb tc --isolate-error-code 5002` confirms this is the only such error in the repo, so suppression has zero false-negative risk today. When prism's upstream RBI is fixed (or `rubocop-ast` stops depending on prism), this line should be removed. After these changes, `bin/srb tc` reports 0 errors and `bin/rubocop` reports no offences. All 69 existing test runs / 337 assertions still pass. The `Lint/UselessAssignment` fix originally bundled with this PR landed independently in #41 (`78e37f6`) and was dropped on rebase. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Orchestrated-by: ae <noreply@shopify.com>
justin-reid
added a commit
that referenced
this pull request
Jun 3, 2026
The `typecheck` workflow job has been red on `main` against current Sorbet for some time. `.github/workflows/ruby.yml` runs `on: push` and main's last non-dependabot push pre-dates the versions bundler now resolves, so the job had not run against current tooling until PR #41 forced it. This commit clears the remaining typecheck errors: - Add `sorbet/typed_overrides.yaml` downgrading the 17 autogenerated `sorbet/rbi/gems/*.rbi` files that are shipped at `# typed: strict` / `# typed: true` but omit sigs on many methods (~4360 sig-less-method errors). Editing the sigil in place would be wiped on the next `srb rbi gems` regeneration; the yaml + `--typed-override` in `sorbet/config` survives regen. - Demote `sorbet/rbi/minitest.rbi` from `strict` to `true` (it omits sigs on ~17 hand-written methods) and fix two existing sigs that this repo's own tests exercise: - `assert_includes`'s `collection` is `T.untyped`, not `T::Enumerable[T.untyped]`. Minitest dispatches via `.include?` and accepts `String` (used by `redis_coordinator_integration_test.rb:362-364`). - `assert_raises` takes a block. The sig was missing `&blk`, which rejected `defined_runnable_test.rb:13`. - Replace `T::Array[Module]` on the `custom_middlewares` accessor in `redis_coordinator.rb` with `T::Array[T.untyped]`. Sorbet 5046 requires `Module` to carry type arguments; `T::Module[…]` would type-check but raises `NameError: uninitialized constant T::Module` at load time on `sorbet-runtime` 0.5.12443 (the version CI resolves), so the broader `T.untyped` is the runtime-safe choice. - Add `--suppress-error-code=5002` to `sorbet/config`. The remaining error is `prism-1.9.0`'s gem-shipped RBI referencing `Prism::LexCompat::Result`, a constant the gem removed in the same release. The RBI is auto-loaded via the bundler-gem discovery path (not the filesystem walk subject to `--ignore`), so the project can't filter it by path. `bin/srb tc --isolate-error-code 5002` confirms this is the only such error in the repo, so suppression has zero false-negative risk today. When prism's upstream RBI is fixed (or `rubocop-ast` stops depending on prism), this line should be removed. After these changes, `bin/srb tc` reports 0 errors and `bin/rubocop` reports no offences. All 69 existing test runs / 337 assertions still pass. The `Lint/UselessAssignment` fix originally bundled with this PR landed independently in #41 (`78e37f6`) and was dropped on rebase. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Orchestrated-by: ae <noreply@shopify.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Problem.
Coordinators::RedisCoordinator#consumedoubles itsXREADGROUP BLOCK <ms>interval on every empty poll with no upper bound. Once doubling has gone on long enough (around 15 empty iterations from the 10 ms start), a singleBLOCKcall can last 5–10 minutes, during which the worker cannot re-checkcomplete?/abort?. Any transient moment whencomplete?is momentarily false after the queue has functionally drained — pipelinedXACKs not yet visible, a worker that died holding a claim and not yet reaped, a slow ack write on a high-latency Redis path — is enough to push the worker into one of those multi-minute blocks, where it then sits until theBLOCKexpires or the platform kills the job. The unboundedBLOCKis the amplifier: without it, the next iteration would be a short poll, the worker would re-check on its own, and the same transient inconsistency would heal in milliseconds.Fix. Cap the doubled value at
MAX_BACKOFF = 5_000ms. Extract the doubling into a privatenext_backoffmethod so it's unit-testable without a live Redis. Bounds the worst-case unresponsiveness to a singleMAX_BACKOFFwindow while keeping the same exponential ramp-up for the common case.Details
Coordinators::RedisCoordinator#consumedoublesexponential_backoffon every empty iteration withexponential_backoff <<= 1and no upper bound. That value is passed toredis.xreadgroup(... block: <ms>), i.e. RedisXREADGROUP ... BLOCK <ms>, which the client implements aspoll()on the socket.Starting from
INITIAL_BACKOFF = 10ms, after 15 consecutive empty iterations a singleBLOCKcall lasts ~5m27s; after 16, ~10m. Thebreak if combined_results.complete?andbreak if combined_results.abort?checks only run after theBLOCKreturns, so once a worker has entered a multi-minuteBLOCKit cannot escape until theBLOCKexpires or the platform kills the job.Under the right conditions this can surface as a post-test teardown hang: Minitest reports
N/N 100%, butbin/rails testthen sits inppoll()on the coordinator's Redis socket for several minutes until the build times out. We've observed this path in production. It doesn't happen on every run — it requires the transient inconsistency to land between two specific iterations — but the worst case it leads to is unbounded.The fix caps the doubled value at
MAX_BACKOFF = 5_000ms (reached after ~9 iterations, ~10 s of cumulative empty polls). It bounds the worst case where a worker is unresponsive tocomplete?/abort?after the queue drains to a singleMAX_BACKOFFwindow, while keeping the same exponential ramp-up behaviour for the common case.The doubling logic is extracted into a private
next_backoffmethod so it can be unit-tested without a live Redis. New tests cover the doubling, the clamp, and the bounded iteration count.Testing
test/minitest/distributed/coordinators/redis_coordinator_test.rbcovering: doubling below the cap, clamping at the cap, and the bounded-iteration count.