fix(rc.12): publish ACK/allowance races + SPARQL parse-error classifier (#887, #888, #889)#896
Merged
Merged
Conversation
…er (#887, #888, #889) Three independent rc.12 bugs found during live devnet/edge-node testing. #887 — publisher: a freshly-created public CG fails publish with storage_ack_insufficient because the ACK round races SWM gossip propagation to its newly-assigned cores. NO_DATA_IN_SWM was already a transient decline, but it shared the 3-attempt/~3s transport budget, shorter than gossip propagation. Give transient declines their own larger capped-exponential budget (6 retries, ~31s, still under the 120s ACK timeout); transport errors keep the unchanged 3-attempt budget. Add an injectable `sleep` so the multi-retry path is unit-testable instantly. #888 — chain: intermittent TooLowAllowance(TRAC,0,1) on consecutive zero-cost publishes (#870 residual). #876 closed #870 with tests only, naming a stale RPC allowance read as the unaddressed root cause. Add the code fix: `isTooLowAllowanceError` classifier; on that revert (which is strictly pre-broadcast, before the WAL checkpoint) force a fresh approve up to the publish floor and confirm it is visible on the read path before retrying populate+sign exactly once. The visibility poll is gated on `force`, so the steady-state publish path is unchanged. #889 — cli: POST /api/query returned 500 instead of 400 for syntactically-invalid SPARQL. The classifier missed oxigraph's native `error at <line>:<col>: expected one of ...` shape; widen it to map those to 400 while still surfacing genuine server faults as 500. Tests: publisher suite 1097 pass / 6 skipped; chain unit 117 pass + 48 V10 publish-path; cli query 3 pass. All deterministic. Co-authored-by: Cursor <cursoragent@cursor.com>
3 tasks
…888) Co-authored-by: Cursor <cursoragent@cursor.com>
Jurij89
approved these changes
Jun 2, 2026
…ery + bail ACK retries on quorum 🔴 chain: the #888 stale-allowance `TooLowAllowance` recovery (force a fresh approve + retry populate/sign once, strictly pre-broadcast) was inlined in the publish path only, leaving `updateV10` — including metadata-only updates, floored to a 1-wei approve — exposed to the same race on a load-balanced RPC. Extract it into a shared `populateAndSignV10WithAllowanceRecovery` helper and route BOTH V10 write paths through it. Behaviour of the publish path is unchanged. New parametrized unit tests pin the recovery for publish AND update, plus one-shot give-up and unrelated-revert pass-through. 🟡 publisher: with the widened #887 transient-decline budget (~31s), a losing peer could keep dialing for ~30s after `collect()` already reached quorum, creating avoidable ACK traffic + log noise. `requestACK` now checks a `roundIsOver()` guard (quorum reached, or the round settled via timeout/ impossible-quorum) before every backoff sleep and bails on wake. New deterministic test: a slow transient-declining peer issues exactly one dial then abandons once quorum forms elsewhere (vs the full 7-send budget). Also exempt the two new TS-private chain helpers from mock-adapter parity. Co-authored-by: Cursor <cursoragent@cursor.com>
…oll + post-sleep ACK guard Three Codex review findings on commit 2b1a7e1: 🔴 chain: confirmAllowanceVisible() polled `token.allowance()` with a raw read and no timeout. On a hung / read-stalled RPC that await never rejects, so the supposedly-bounded #888 recovery poll could block publish/update indefinitely. Wrap each read in the existing `withTimeout(..., RPC_READ_STALL_TIMEOUT_MS)` so a stalled read is treated as not-yet-visible and backs off like any other failed read. New fake-timer test proves the poll resolves (does not hang) when every read stalls forever. 🔴 publisher: the ACK `roundIsOver()` quorum-bail guard was only checked BEFORE each backoff sleep. If quorum settled WHILE a peer was mid-backoff, the next `continue` still fired one more `sendP2P`. Add a second guard immediately after each sleep (both the transient-decline and transport-retry branches) so a decided round never leaks another dial. New deterministic test: a peer that declines pre-quorum, sleeps, and wakes to a settled round issues exactly one dial. 🟡 chain: `isTooLowAllowanceError` is only used inside the chain package (evm-adapter + its unit test, which already imports it directly from ./evm-adapter.js). Drop the package-root re-export so it stays internal rather than becoming public API surface. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Codex review produced 1 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.
Contributor
Author
|
Addressed the latest Codex review (commit `a07550341`) and resolved the threads:
|
This was referenced Jun 2, 2026
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent rc.12 bugs found during live devnet / edge-node testing. Each is a distinct root cause with its own deterministic unit tests; they're bundled here because they were all found in the same publish/query smoke pass. All three issues are open with the
rc 12label and no other PR addresses them (verified against open + recently-merged PRs).Fixes #887 — fresh public-CG publish fails with
NO_DATA_IN_SWMA freshly-created public CG fails publish with
storage_ack_insufficientbecause the publisher's ACK round races SWM-gossip propagation to its newly-assigned cores.NO_DATA_IN_SWMwas already classified as a transient decline and retried, but it shared the transport budget (MAX_RETRIES = 3, ~3s total) — shorter than gossip propagation to fresh cores.packages/publisher/src/ack-collector.ts: transient declines now get their own budget —MAX_TRANSIENT_DECLINE_RETRIES = 6with capped-exponential backoff (~31s total, still well under the 120sACK_TIMEOUT_MS). Transport errors keep the unchanged 3-attempt budget (a peer that never answers shouldn't hold the round open).sleepdep so the multi-retry path is unit-testable without real delays.Fixes #888 — intermittent
TooLowAllowance(TRAC,0,1)on consecutive zero-cost publishesThis is the
#870residual. PR #876 closed #870 with tests only, explicitly naming "stale allowance from an RPC-side read cache" as the unaddressed candidate root cause. This adds the actual code fix:packages/chain/src/evm-adapter.ts:isTooLowAllowanceError(err)— robust classifier (decodedrevert.name, message/shortMessage/reason, nested cause).TooLowAllowancerevert during populate/gas-estimation — which happens strictly pre-broadcast, before theonBroadcastWAL checkpoint — force a fresh approve up to the publish floor and confirm it is visible on the same read path (confirmAllowanceVisible, bounded poll) before retrying populate+sign exactly once. Any other error, or a second revert, propagates.force, so the steady-state publish path issues exactly one allowance read + one approve as before (latency unchanged).packages/chain/src/index.ts: exportisTooLowAllowanceError.Fixes #889 —
POST /api/queryreturns 500 (not 400) for invalid SPARQLThe parse-error classifier missed oxigraph's native
error at <line>:<col>: expected one of …shape, so malformed queries fell through to the 500 path.packages/cli/src/daemon/routes/query.ts: widen the 400 classifier to match that shape, while still surfacing genuine server faults as 500.Test plan
packages/publisher— 1097 passed / 6 skipped (incl. 2 newack-collector.test.tscases: retries past the old budget then succeeds; bounded give-up at 7 sends; and the 6 typed-decline edge-case tests updated for the new budget with collapsed backoff).packages/chainunit — 117 passed (9 new:isTooLowAllowanceErrorclassifier +ensureV10ApproveTracforced re-approve / visibility-poll / stale-high-skip / bounded-give-up), plus 48 V10 publish-path / decode tests.packages/cli—query.test.ts3 passed (oxigraph-parse→400 + a guard that real server faults stay 500).Notes / coordination
<contextGraphsServed>#556 (open) edits the same two#887files (ack-collector.ts,v10-ack-edge-cases.test.ts) for an orthogonal hosting-filter fix — whichever lands second needs a small rebase. Different logic, not a conflict in intent.console.warn); my changes are additive (separatedescribeblocks, no test collision).Made with Cursor