Skip to content

feat(act): NonRetryableError + app.unblock recovery primitive#736

Merged
Rotorsoft merged 7 commits into
masterfrom
feat/act-604-non-retryable-error
May 16, 2026
Merged

feat(act): NonRetryableError + app.unblock recovery primitive#736
Rotorsoft merged 7 commits into
masterfrom
feat/act-604-non-retryable-error

Conversation

@Rotorsoft
Copy link
Copy Markdown
Owner

@Rotorsoft Rotorsoft commented May 16, 2026

Summary

Closes #735.

Implements ACT-604 plus the operational primitives that came up during implementation and review: the unblock recovery method, the string[] | StreamFilter union for reset/unblock, and the app.blocked_streams() discovery wrapper.

Three coupled additions, one coherent recovery loop:

  1. NonRetryableError — a handler-signaled "this is permanent, block now" class. The drain finalizer recognizes error instanceof NonRetryableError and forces block = blockOnError regardless of lease.retry. Closes the documented gap in ACT-602 between the webhook helper's "this is permanent" knowledge and the drain pipeline's retry-only classifier.

  2. app.unblock(input) / Store.unblock(input) — the operational recovery path. Clears the blocked flag (plus retry, error, lease) without touching the at watermark. The stream resumes from where it stopped, not from event 0. Required because ACT-604 makes streams block on the first permanent failure — falling back to app.reset() would mean re-firing every historical event for a webhook stream blocked on one bad payload.

  3. app.blocked_streams({ after?, limit? }) — convenience wrapper around store().query_streams(cb, { blocked: true }). Closes the discovery half of the loop ("show me what's broken") before recovery.

Both reset and unblock accept either an explicit string[] of stream names or a StreamFilter for bulk operations (the same shape prioritize already used, now canonically named StreamFilter with PrioritizeFilter retained as an alias). One filter shape across three Store methods; one _filterPredicate / _filterClause helper per adapter.

Why unblock and the filter union ended up in this PR

Store.block() and the framework documented blocked streams as something you "manually unblock," but the only path to clear the flag was app.reset() — a rebuild-from-zero primitive suited for projection rebuilds, wrong for poison-message recovery. The gap was hidden by patience while streams only blocked after exhausting maxRetries. Once non-retryable errors block on first attempt, the gap became load-bearing — recovery from a one-off validation failure can't require replaying the entire stream.

The filter union came out of an API audit: three Store-port methods select streams without per-row data (reset, unblock, prioritize), but only prioritize used a filter. After the audit, they share a single shape.

Core changes (libs/act)

  • NonRetryableError exported from @rotorsoft/act and @rotorsoft/act/types. Errors registry gains ERR_NON_RETRYABLE.
  • finalize() in internal/reactions.ts gains one branch: block = blockOnError && (nonRetryable || retry >= maxRetries). The operator's blockOnError: false still wins — non-retryable doesn't override "retry forever."
  • Store.unblock(input: string[] | StreamFilter) added to the port contract. Atomic single-statement UPDATE per adapter, always restricts to blocked = true. retry = -1 (matching the InMemoryStore convention) so claim's post-bump returns retry = 0 for the first post-unblock attempt.
  • Store.reset(input: string[] | StreamFilter) widened.
  • StreamFilter added to the type surface; PrioritizeFilter is now an alias (no breaking change).
  • Act.unblock(input) wraps store().unblock() and arms the drain flag so settled apps pick up the now-free streams on the next cycle. Symmetric with Act.reset().
  • Act.blocked_streams({ after?, limit? }) for discovery.

Adapter changes

  • InMemoryStore — extracted _filterPredicate helper reused across reset, unblock, prioritize.
  • PostgresStore (@rotorsoft/act-pg) — extracted _filterClause helper returning a WHERE fragment + parameter values; reused across the same three methods.
  • SqliteStore (@rotorsoft/act-sqlite) — same pattern, libSQL-positional placeholders.

Helper changes (libs/act-http) — breaking at 0.1.0

  • WebhookError split. WebhookError extends Error for retryable cases (5xx, network, timeout); NonRetryableWebhookError extends NonRetryableError for 4xx. The retryable: boolean field is removed — the class itself is the signal.
  • webhook() throws the appropriate subclass based on status. Drain finalizer auto-blocks on 4xx via the inherited NonRetryableError marker.
  • The original ACT-602 acceptance criterion ("4xx blocks on first attempt") now holds for real.

Pre-1.0 package; no external consumers.

TCK additions

  • unblock describe block with happy-path, no-op, mixed-input, filter-form (stream pattern, empty filter, explicit blocked: false) cases.
  • reset filter form describe block with pattern-match and blocked-only filter cases.
  • All three in-tree adapters (InMemory, act-pg, act-sqlite) pass the new blocks.

Docs (final pass)

  • CLAUDE.md — two new safety-critical one-liners: unblock vs reset distinction and NonRetryableError semantics with the blockOnError: false asymmetry.
  • docs/concepts/error-handling.md — "Non-retryable errors", "Recovering a blocked stream — app.unblock" (array + filter forms + comparison table), "Discovering blocked streams — app.blocked_streams()" sections. "Blocked Streams" section rewritten to cover both block paths.
  • docs/architecture/extension-points.md — Store interface listing updated to twelve methods; the shared StreamFilter type and reset-vs-unblock split called out.
  • docs/architecture/concurrency-model.md — "block" exit description mentions NonRetryableError and both forms (array + filter) of recovery.
  • docs/concepts/event-sourcing.md — projection rebuild section mentions the filter form and forward-links to unblock for the rebuild-vs-recovery distinction.
  • docs/guides/projections-to-database.md — bulk family-rebuild example via the filter form.
  • docs/guides/production-checklist.mdblocked_streamsunblock(filter) workflow as the recovery prescription.
  • libs/act-http/README.md — Behavior + Retry/block tables rewritten around the two-class split; "Recovering a blocked stream" with the family-unblock filter example and blocked_streams discovery snippet.
  • book/act-602-act-http.md — "4xx limitation" rewritten to point at ACT-604 as the resolution.
  • book/act-604-non-retryable.md — new essay (~200 lines): class-vs-flag design, blockOnError asymmetry, retry = -1 storage convention, the unblock recovery primitive, the names-or-filter API audit, the three-primitive recovery loop.

Test coverage

1583 tests passing total (up from 1513).

  • 8 new integration tests for NonRetryableError in libs/act/test/non-retryable.spec.ts (class shape, drain integration, unblock recovery flow including filter-form bulk recovery and blocked_streams discovery).
  • 7 new TCK cases for unblock and reset filter forms (run against InMemory, PG, SQLite).
  • 3 new act-pg fault-injection tests for defensive rowCount ?? 0 branches.
  • 2 new act-sqlite rollback-path tests for unblock transaction error handling.

Coverage: 100% statements / 100% branches / 100% functions / 100% lines.

Stability charter impact

All additive to charter-covered surface:

  • NonRetryableError — new exported class on @rotorsoft/act.
  • Store.unblock — new method on the Store interface; capability-gated in the TCK so existing adapters keep passing.
  • StreamFilter — new exported type; PrioritizeFilter retained as alias.
  • Act.unblock, Act.blocked_streams — new public methods.
  • Store.reset / Act.reset / Store.prioritize / Act.prioritize — signature widening (string[]string[] | StreamFilter, PrioritizeFilterStreamFilter), backwards-compatible at the call site.

No removals, no renames, no narrowed types. The WebhookError change in act-http is breaking but the package is at 0.1.0 (pre-1.0) and one release old.

Test plan

  • pnpm test — 1583 passed, 100% coverage on every metric
  • pnpm typecheck — clean
  • pnpm lint — clean (only pre-existing warnings)
  • TCK unblock + reset filter form blocks pass against InMemory, act-pg, act-sqlite
  • Integration tests exercise full NonRetryableError → block → unblock(filter) → reprocess flow
  • Fault-injection tests cover rowCount ?? 0 defensive branches and SQLite rollback paths
  • CI green on this PR
  • semantic-release on merge cuts @rotorsoft/act@X.Y.0 (minor, additive) and @rotorsoft/act-http@0.2.0 (minor with breaking change, but pre-1.0 conventional-commits → minor bump per .releaserc)

Follow-ups (parked)

  • Retry-After header parsing in webhook (parked in ACT-604 open questions).
  • Per-handler shouldBlock(error): boolean predicate (parked).
  • NonRetryableError from action handlers — different code path, separate design.

🤖 Generated with Claude Code

rotorsoft and others added 7 commits May 16, 2026 13:42
…(ACT-604)

Adds NonRetryableError class to core; drain finalizer recognizes it and
forces immediate block when blockOnError is true, regardless of
lease.retry. Closes the gap documented in ACT-602 between the helper's
"this is permanent" knowledge and the drain pipeline's retry-only
classifier.

Core changes (libs/act):
- new NonRetryableError class exported from @rotorsoft/act and
  @rotorsoft/act/types. Errors registry gains ERR_NON_RETRYABLE.
- finalize() in internal/reactions.ts gains one branch:
  block = blockOnError && (nonRetryable || retry >= maxRetries).
  operator's blockOnError: false still wins — non-retryable does not
  override "retry forever."
- 8 integration tests covering: class shape (name, cause, instanceof),
  first-attempt block with default options, no-block when
  blockOnError:false, plain Error still consumes retry budget,
  immediate block ignores backoff, batch handler path.

Helper changes (libs/act-http) — breaking at 0.1.0:
- WebhookError split into two classes. WebhookError extends Error for
  retryable cases (5xx, network, timeout); NonRetryableWebhookError
  extends NonRetryableError for 4xx. the "retryable: boolean" field is
  removed — the class itself is the signal.
- webhook() throws the appropriate subclass based on status. drain
  finalizer auto-blocks on 4xx via the inherited NonRetryableError
  marker. original ACT-602 acceptance criterion (4xx blocks on first
  attempt) now holds.
- webhook tests updated for the new class shape (instanceof checks
  instead of boolean field reads).

Docs:
- docs/docs/concepts/error-handling.md gains a "Non-retryable errors"
  section after the webhook one. the webhook section now describes the
  two-class split; the new section covers NonRetryableError as the
  general primitive with the validation-error example.
- libs/act-http/README.md "Behavior" and "Retry & block semantics"
  tables updated to reflect the class-based signal.
- book/act-602-act-http.md "limitation" section rewritten to point at
  ACT-604 as the resolution.
- book/act-604-non-retryable.md — new essay covering the design
  decisions: class vs. flag, the blockOnError-respect asymmetry, why
  the pattern generalizes to user handlers and other integration
  helpers.

Total test count: 1521 passed (8 new). 100% statements / 100% branches
/ 100% functions / 100% lines.

BREAKING CHANGE: @rotorsoft/act-http WebhookError no longer carries a
'retryable' field. Callers checking err.retryable should switch to
'err instanceof NonRetryableWebhookError' (or 'instanceof
NonRetryableError' for the framework-general check). The package is at
0.1.0 with no external consumers.

Closes ACT-604.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap where the only way to clear a blocked stream's flag was
app.reset() — a rebuild-from-zero primitive suited for projection
rebuilds, wrong for "I fixed the bug, please retry from where you
stopped." The gap predates ACT-604 but becomes acute with non-retryable
signaling: streams now block on the first permanent failure, so the
recovery path can't require a full event replay.

Adds Store.unblock(streams) to the port contract (additive — no
breaking change for adapters that don't implement it yet; capability-
gated in the TCK). Implemented across all three in-tree adapters:

- InMemoryStore: new InMemoryStream.unblock() that flips _blocked and
  returns whether the stream was actually flipped.
- PostgresStore: single UPDATE with WHERE blocked = true so rowCount
  reflects only streams that flipped.
- SqliteStore: transactional UPDATE per stream, mirrors the PG semantics.

All three set retry = -1 (matching the InMemoryStore convention) so the
first post-unblock claim returns retry = 0 ("first attempt"). Storing 0
would make claim's post-bump return 1, mis-reporting the post-recovery
attempt as a continuation of the failed sequence.

Adds Act.unblock(streams) that wraps store().unblock() and arms the
orchestrator's drain flag so a settled app picks up the now-free streams
on the next cycle. Symmetric with the existing Act.reset() wrapper.

TCK: new "unblock" describe block with four cases — happy path
(blocked → unblock → claim resumes at preserved watermark, retry = 0),
no-op on unblocked stream, no-op on unknown/empty, mixed input counts
only the actually-blocked streams.

Integration test in non-retryable.spec.ts exercises the full
NonRetryableError → block → unblock → reprocess flow: handler throws
permanent error, drain blocks immediately, app.unblock(streams) clears
the flag, next drain succeeds at the SAME event (not replayed from
zero).

Docs:
- docs/concepts/error-handling.md gains an "unblock" subsection
  contrasting it with reset.
- docs/architecture/concurrency-model.md's "block" exit description
  updated to mention NonRetryableError and the unblock/reset choice.
- docs/guides/production-checklist.md changes the recovery instruction
  from "Unblock with app.reset" to "recover with app.unblock; reset is
  for rebuilds."
- libs/act-http/README.md adds a "Recovering a blocked stream"
  subsection — important because 4xx blocks are now the common case
  and reset would re-fire all historical webhooks.
- book/act-604-non-retryable.md gains a section on the recovery
  primitive, including the retry = -1 convention rationale.

Tests: 1556 passed (3 new unblock tests in TCK, 2 new in non-retryable
spec). Coverage 99.95% branches globally — drops from 100% are in
defensive error paths (rowCount ?? 0, rollback) that mirror the
existing untested paths in reset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….blocked_streams

API audit on the back of ACT-604: only three Store-port methods select
streams without per-row data (reset, unblock, prioritize), but reset and
unblock were name-array-only while prioritize was filter-only. Aligns
them so all three accept the same filter shape, and surfaces the bulk
recovery use case that came up around poison-message storms (unblock
every blocked stream in a family in one call).

Signature changes:
- Store.reset(input: string[] | StreamFilter)
- Store.unblock(input: string[] | StreamFilter)
- Store.prioritize(filter: StreamFilter) — type rename only
- Act.reset / Act.unblock follow

New type `StreamFilter` is the canonical name; `PrioritizeFilter` stays
as a non-breaking alias. Identical shape — `Pick<QueryStreams, "stream"
| "stream_exact" | "source" | "source_exact" | "blocked">`.

Filter semantics:
- reset(filter): applies the filter as-is. `reset({ blocked: true })`
  rebuilds only blocked streams; `reset({ stream: "^proj-" })` rebuilds
  a projection family. Empty filter matches every registered stream
  (documented footgun, no runtime block — operators use it sparingly).
- unblock(filter): always forces `blocked = true` regardless of what
  the caller passes. There is no use case for "unblock unblocked
  streams," so the framework removes that footgun at the boundary. An
  explicit `blocked: false` matches nothing.

Adapter implementations:
- InMemoryStore: extracted _filterPredicate helper; reused across
  reset, unblock, and prioritize.
- PostgresStore: extracted _filterClause helper that returns a WHERE
  fragment + parameter values. UPDATE statements compose it with their
  fixed set clauses; reset/unblock/prioritize all reuse it.
- SqliteStore: same shape, libSQL-positional placeholders.

New Act method:
- app.blocked_streams({ after?, limit? }): convenience wrapper around
  store().query_streams(cb, { blocked: true, ... }). Returns an array
  of StreamPosition for the discover → unblock workflow.

TCK additions:
- "unblock" describe block gains three filter cases (stream-pattern
  match, empty-filter family scope, explicit blocked:false matches
  nothing).
- New "reset filter form" describe block (pattern match preserves
  unmatched watermarks; blocked-only filter restricts the rebuild
  scope).

Integration tests in non-retryable.spec.ts add:
- app.unblock(filter) for bulk recovery across a family.
- app.blocked_streams() discovers blocked streams and confirms the
  list goes empty after recovery.

Docs:
- docs/concepts/error-handling.md gains examples of all three forms
  plus a discovery-first workflow snippet.
- docs/guides/production-checklist.md updated to mention
  blocked_streams() as the discovery primitive.
- libs/act-http/README.md "Recovering a blocked stream" shows the
  filter form for webhook families.
- book/act-604-non-retryable.md gains "Names or filter" and
  "Discovering what's blocked" sections covering the design call.

Tests: 1573 passing (up from 1556). Coverage 99.87% branches globally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Missed in the prior commit due to a stale read; production-checklist.md
still referenced 'Unblock with app.reset' which is now wrong (reset
rebuilds from zero; unblock is the resume primitive).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…streams section

Three loose-end doc updates after the API audit:

1. CLAUDE.md gains two safety-critical one-liners:
   - "Blocked-stream recovery — unblock resumes, reset rebuilds":
     covers the load-bearing distinction (don't use reset to clear a
     blocked webhook — it'd re-fire every historical event), the
     string[] | StreamFilter shape, and pointers at app.blocked_streams
     for discovery.
   - "Non-retryable errors signal permanent failure": NonRetryableError
     as the handler-side block signal, the blockOnError: false respect
     asymmetry, and the act-http/webhook NonRetryableWebhookError.

2. docs/architecture/extension-points.md updates the Store interface
   reference to include unblock and reflect the string[] | StreamFilter
   shape for reset/unblock/prioritize. Adds a one-paragraph note on the
   shared filter type and the reset-vs-unblock semantic split.

3. docs/concepts/error-handling.md "Blocked Streams" section rewritten
   to (a) describe both paths streams can block (maxRetries exhausted
   *or* NonRetryableError on first attempt), (b) point at the new
   unblock / reset / blocked_streams recovery surface with anchor
   links, instead of the stale "they need an explicit app.reset() (or
   external unblock)" wording.

No code changes, no test changes. All existing tests still pass
(1573); lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctions-to-database

Three remaining docs referenced app.reset / app.unblock without
mentioning that they also accept a StreamFilter. Adds the filter
option callout in each:

- architecture/concurrency-model.md "block" exit: shows both forms
  with the bulk-recovery example and the post-incident "unblock
  everything blocked" sweep.
- concepts/event-sourcing.md "Projection Rebuild": new paragraph
  describing the StreamFilter shape (shared with unblock and
  prioritize) and a forward-link to error-handling.md for the
  rebuild-vs-recovery distinction.
- guides/projections-to-database.md "Batched replay": multi-projection
  family-rebuild example via the filter form.

Code examples in each doc keep the array form as the primary
illustration — concrete one-name calls read cleaner than filters in a
quickstart context. The prose around them now documents the broader
shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the coverage gaps surfaced after the filter-form additions:

- libs/act-pg: three new fault-injection tests in store.error.spec.ts
  cover the 'rowCount ?? 0' defensive branch on reset(filter),
  unblock(array), unblock(filter). Mirrors the existing prioritize
  test pattern (vi.spyOn pg.Pool.prototype.query → null rowCount).

- libs/act-sqlite: two new rollback-path tests in store.error.spec.ts
  cover the transaction error handler on unblock (both array and
  filter forms) via the existing mockClientFailOn fixture.

- libs/act-tck: the 'unblock preserves watermark' test was asserting
  that 's' wasn't in a subsequent claim() result. When the fixture
  state left claim() empty (no other claimable streams), the find()
  callback never ran and registered as uncovered. Switched to a
  query_streams() check on the blocked flag — deterministic, doesn't
  depend on what else the fixture has lying around.

Coverage: 100% statements / 100% branches / 100% functions / 100% lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Rotorsoft Rotorsoft merged commit 227c7de into master May 16, 2026
12 checks passed
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version @rotorsoft/act-v0.43.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version @rotorsoft/act-http-v0.2.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version @rotorsoft/act-pg-v0.23.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version @rotorsoft/act-sqlite-v0.7.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version @rotorsoft/act-tck-v0.2.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ACT-604: NonRetryableError for handler-signaled block-on-first-attempt

1 participant