Skip to content

fix(rc.12): publish ACK/allowance races + SPARQL parse-error classifier (#887, #888, #889)#896

Merged
Bojan131 merged 5 commits into
mainfrom
fix/rc12-publish-and-query
Jun 2, 2026
Merged

fix(rc.12): publish ACK/allowance races + SPARQL parse-error classifier (#887, #888, #889)#896
Bojan131 merged 5 commits into
mainfrom
fix/rc12-publish-and-query

Conversation

@Bojan131
Copy link
Copy Markdown
Contributor

@Bojan131 Bojan131 commented Jun 1, 2026

Summary

Three independent rc.12 bugs found during live devnet / edge-node testing. Each is a distinct root cause with its own deterministic unit tests; they're bundled here because they were all found in the same publish/query smoke pass. All three issues are open with the rc 12 label and no other PR addresses them (verified against open + recently-merged PRs).

Fixes #887 — fresh public-CG publish fails with NO_DATA_IN_SWM

A freshly-created public CG fails publish with storage_ack_insufficient because the publisher's ACK round races SWM-gossip propagation to its newly-assigned cores. NO_DATA_IN_SWM was already classified as a transient decline and retried, but it shared the transport budget (MAX_RETRIES = 3, ~3s total) — shorter than gossip propagation to fresh cores.

  • packages/publisher/src/ack-collector.ts: transient declines now get their own budget — MAX_TRANSIENT_DECLINE_RETRIES = 6 with capped-exponential backoff (~31s total, still well under the 120s ACK_TIMEOUT_MS). Transport errors keep the unchanged 3-attempt budget (a peer that never answers shouldn't hold the round open).
  • Added an injectable sleep dep so the multi-retry path is unit-testable without real delays.

Fixes #888 — intermittent TooLowAllowance(TRAC,0,1) on consecutive zero-cost publishes

This is the #870 residual. PR #876 closed #870 with tests only, explicitly naming "stale allowance from an RPC-side read cache" as the unaddressed candidate root cause. This adds the actual code fix:

  • packages/chain/src/evm-adapter.ts:
    • isTooLowAllowanceError(err) — robust classifier (decoded revert.name, message/shortMessage/reason, nested cause).
    • On a TooLowAllowance revert during populate/gas-estimation — which happens strictly pre-broadcast, before the onBroadcast WAL checkpoint — force a fresh approve up to the publish floor and confirm it is visible on the same read path (confirmAllowanceVisible, bounded poll) before retrying populate+sign exactly once. Any other error, or a second revert, propagates.
    • The visibility poll is gated on force, so the steady-state publish path issues exactly one allowance read + one approve as before (latency unchanged).
  • packages/chain/src/index.ts: export isTooLowAllowanceError.

Fixes #889POST /api/query returns 500 (not 400) for invalid SPARQL

The parse-error classifier missed oxigraph's native error at <line>:<col>: expected one of … shape, so malformed queries fell through to the 500 path.

  • packages/cli/src/daemon/routes/query.ts: widen the 400 classifier to match that shape, while still surfacing genuine server faults as 500.

Test plan

Notes / coordination

Made with Cursor

…er (#887, #888, #889)

Three independent rc.12 bugs found during live devnet/edge-node testing.

#887 — publisher: a freshly-created public CG fails publish with
storage_ack_insufficient because the ACK round races SWM gossip
propagation to its newly-assigned cores. NO_DATA_IN_SWM was already a
transient decline, but it shared the 3-attempt/~3s transport budget,
shorter than gossip propagation. Give transient declines their own
larger capped-exponential budget (6 retries, ~31s, still under the 120s
ACK timeout); transport errors keep the unchanged 3-attempt budget. Add
an injectable `sleep` so the multi-retry path is unit-testable instantly.

#888 — chain: intermittent TooLowAllowance(TRAC,0,1) on consecutive
zero-cost publishes (#870 residual). #876 closed #870 with tests only,
naming a stale RPC allowance read as the unaddressed root cause. Add the
code fix: `isTooLowAllowanceError` classifier; on that revert (which is
strictly pre-broadcast, before the WAL checkpoint) force a fresh approve
up to the publish floor and confirm it is visible on the read path
before retrying populate+sign exactly once. The visibility poll is gated
on `force`, so the steady-state publish path is unchanged.

#889 — cli: POST /api/query returned 500 instead of 400 for
syntactically-invalid SPARQL. The classifier missed oxigraph's native
`error at <line>:<col>: expected one of ...` shape; widen it to map
those to 400 while still surfacing genuine server faults as 500.

Tests: publisher suite 1097 pass / 6 skipped; chain unit 117 pass +
48 V10 publish-path; cli query 3 pass. All deterministic.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/chain/src/evm-adapter.ts Outdated
Comment thread packages/publisher/src/ack-collector.ts
Comment thread packages/publisher/src/ack-collector.ts
Comment thread packages/chain/test/evm-adapter.unit.test.ts
Comment thread packages/chain/src/index.ts Outdated
…ery + bail ACK retries on quorum

🔴 chain: the #888 stale-allowance `TooLowAllowance` recovery (force a fresh
approve + retry populate/sign once, strictly pre-broadcast) was inlined in the
publish path only, leaving `updateV10` — including metadata-only updates,
floored to a 1-wei approve — exposed to the same race on a load-balanced RPC.
Extract it into a shared `populateAndSignV10WithAllowanceRecovery` helper and
route BOTH V10 write paths through it. Behaviour of the publish path is
unchanged. New parametrized unit tests pin the recovery for publish AND update,
plus one-shot give-up and unrelated-revert pass-through.

🟡 publisher: with the widened #887 transient-decline budget (~31s), a losing
peer could keep dialing for ~30s after `collect()` already reached quorum,
creating avoidable ACK traffic + log noise. `requestACK` now checks a
`roundIsOver()` guard (quorum reached, or the round settled via timeout/
impossible-quorum) before every backoff sleep and bails on wake. New
deterministic test: a slow transient-declining peer issues exactly one dial
then abandons once quorum forms elsewhere (vs the full 7-send budget).

Also exempt the two new TS-private chain helpers from mock-adapter parity.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/publisher/src/ack-collector.ts
Comment thread packages/chain/src/evm-adapter.ts Outdated
…oll + post-sleep ACK guard

Three Codex review findings on commit 2b1a7e1:

🔴 chain: confirmAllowanceVisible() polled `token.allowance()` with a raw
read and no timeout. On a hung / read-stalled RPC that await never rejects,
so the supposedly-bounded #888 recovery poll could block publish/update
indefinitely. Wrap each read in the existing `withTimeout(...,
RPC_READ_STALL_TIMEOUT_MS)` so a stalled read is treated as not-yet-visible
and backs off like any other failed read. New fake-timer test proves the
poll resolves (does not hang) when every read stalls forever.

🔴 publisher: the ACK `roundIsOver()` quorum-bail guard was only checked
BEFORE each backoff sleep. If quorum settled WHILE a peer was mid-backoff,
the next `continue` still fired one more `sendP2P`. Add a second guard
immediately after each sleep (both the transient-decline and transport-retry
branches) so a decided round never leaks another dial. New deterministic
test: a peer that declines pre-quorum, sleeps, and wakes to a settled round
issues exactly one dial.

🟡 chain: `isTooLowAllowanceError` is only used inside the chain package
(evm-adapter + its unit test, which already imports it directly from
./evm-adapter.js). Drop the package-root re-export so it stays internal
rather than becoming public API surface.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review produced 1 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.

@Bojan131
Copy link
Copy Markdown
Contributor Author

Bojan131 commented Jun 2, 2026

Addressed the latest Codex review (commit `a07550341`) and resolved the threads:

  • 🔴 evm-adapter.ts confirmAllowanceVisible() — each token.allowance() poll read is now wrapped in withTimeout(..., RPC_READ_STALL_TIMEOUT_MS), so a hung/read-stalled RPC can no longer block the (supposedly bounded) recovery poll. New fake-timer test proves the poll resolves even when every read stalls forever.
  • 🔴 ack-collector.ts — the roundIsOver() quorum-bail guard is now also checked after each backoff sleep (both the transient-decline and transport-retry branches), so a round that settles during a sleep never leaks one more sendP2P. New deterministic test covers the settle-during-sleep case.
  • 🟡 index.ts — dropped the package-root re-export of isTooLowAllowanceError; it stays internal (the unit test already imports it directly from ./evm-adapter.js).
  • 🟡 earlier-round items (shared publish/update recovery helper + ACK quorum bail + retry-loop test coverage) remain addressed.

chain (144) + publisher (54) suites green locally, incl. the new tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment