Skip to content

feat(act): per-reaction retry backoff (ACT-601)#724

Merged
Rotorsoft merged 2 commits into
masterfrom
feat/act-601-backoff
May 14, 2026
Merged

feat(act): per-reaction retry backoff (ACT-601)#724
Rotorsoft merged 2 commits into
masterfrom
feat/act-601-backoff

Conversation

@Rotorsoft
Copy link
Copy Markdown
Owner

Summary

Closes #687.

Adds a backoff option on reaction handlers that paces inter-attempt timing — fixed, linear, or exponential with optional jitter. Closes the last gap in drain's retry semantics: today the framework re-claims a failed stream on the next cycle (typically within ms), turning transient outages into exhausted retry budgets.

.on("OrderPlaced")
  .do(handler, {
    maxRetries: 5,
    backoff: { strategy: "exponential", baseMs: 200, maxMs: 30_000, jitter: true },
  })
  .to(resolver)

Design notes

  • No DB schema change, no Store port change. The DrainController owns a Map<stream, nextAttemptAt> in process memory. Deferred streams hold their existing lease via a new claim-but-skip path in runDrainCycle — the lease itself is the per-worker pacing primitive. A setTimeout re-arms drain at the earliest pending expiry.
  • Per-worker semantics, by design. With N competing workers, each paces only its own re-attempts; the shared retry_count on the watermark climbs across workers, so blockOnError fires up to N× sooner than configured. Transient per-worker faults recover faster, poison messages quarantine sooner. Documented in concepts/error-handling.md and a CLAUDE.md safety one-liner.
  • leaseMillis as floor. Because the controller holds the lease during the backoff window, effective_backoff = max(configured, leaseMillis). Never shorter than configured.
  • Default omitted = current behavior. Backwards-compatible by construction — existing reactions without backoff continue to retry as soon as the lease expires.

Why this isn't an "outbox" subsystem: drain already provides ordered, at-least-once delivery, retries, dead-lettering, and competing-consumer semantics via SKIP LOCKED. The only missing primitive was inter-attempt timing, which is one knob, not a parallel system.

Test plan

  • 12 new tests in libs/act/test/backoff.spec.ts cover all 4 strategies, jitter bounds, deferral behavior, success-clears-entry, block-clears-entry, and the no-backoff default
  • Full @rotorsoft/act suite passes (544 tests, no regressions)
  • Broader suite passes — libs/act, libs/act-sqlite, libs/act-tck, packages/wolfdesk (1172 tests total)
  • Typecheck clean across the workspace
  • Biome lint clean
  • Wolfdesk MessageAdded reaction wired to exponential+jitter with a flaky-delivery stub — observable in dev logs

Docs

  • docs/docs/concepts/error-handling.md — new Backoff section with strategy table, per-worker semantics, leaseMillis floor
  • CLAUDE.md — safety-critical one-liner for per-worker pacing
  • Memory: project_book_backoff.md book notes for the error-handling and scaling chapters

🤖 Generated with Claude Code

rotorsoft and others added 2 commits May 14, 2026 17:46
Adds a `backoff` option on reaction handlers that paces inter-attempt
timing — fixed, linear, or exponential with optional jitter. Closes the
last gap in drain's retry semantics: today the framework re-claims a
failed stream on the next cycle (typically within ms), turning transient
outages into exhausted retry budgets.

The controller maintains the backoff window in process memory; deferred
streams hold their existing lease via runDrainCycle's claim-but-skip
path, so no Store contract change and no DB schema change. With N
competing workers, retries escalate up to N× faster than configured —
intentional: per-worker pacing speeds recovery on transient per-worker
faults, and poison messages quarantine sooner.

Wires wolfdesk's MessageAdded delivery reaction to exponential backoff
with jitter, with a flaky-delivery stub to make the pacing observable in
dev logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop unreachable `dispose()` method and the now-redundant `size === 0`
  guard inside `scheduleBackoffWake` (caller already enforces it).
- Inline `gcExpiredBackoff` into the timer callback — separate method
  was only called from one place and made coverage harder to read.
- Drop optional chain on `unref()` — Node's setTimeout always returns a
  Timeout with `unref()`; the optional chain registered as an uncovered
  branch.
- Add a multi-stream test that puts two streams in the backoff map at
  different expiries, forcing the callback to iterate both entries and
  exercise the "delete expired / keep pending" branch.

drain-cycle.ts now 100% on statements, lines, functions, and branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Rotorsoft Rotorsoft merged commit 622e74c into master May 14, 2026
6 checks passed
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version @rotorsoft/act-v0.41.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ACT-601: per-reaction retry backoff

1 participant