chore(rc.11): test-infra fixes captured during integration sweep by branarakic · Pull Request #673 · OriginTrail/dkg

branarakic · 2026-05-26T07:32:31Z

Summary

Three commits pulled forward from the release/v10.0.0-rc.11 integration
branch — pure test-infrastructure fixes that surfaced while running the
RFC-38 sweep + targeted devnet scenarios against the merged rc.11 PR
universe. None of them touch production code; they make the existing
test surface honest about what it's checking.

Commits

ecf2bdd3 fix(rc.11): lu8 sweep script — replace unbound OK_FLAG with EXPLICIT_OK
scripts/devnet-test-rfc38-lu8.sh summary block referenced an
$OK_FLAG variable that was never defined; the script ran with
set -u and crashed in the LAST line after all real assertions had
already passed. Round-1 of the integration sweep counted that as
1 FAIL out of 11. Fix: print $EXPLICIT_OK (the variable the
scenario actually sets) instead. Round-2 sweep then went 11/11
without changing any actual LU-8 behaviour.
c38b04df test(rc.11): two new devnet scenarios for async-promote + shutdown
Two targeted scenarios that the existing RFC-38 sweep didn't cover:
- devnet-test-rc11-promote-crash-recovery.sh —
  enqueues a 200-triple promote on node 5, tight-polls for
  state=running (worker claimed the lease), SIGKILLs the daemon,
  restarts node 5 via devnet.sh restart-node 5, asserts the
  post-restart job state is NOT in a forbidden RFC §6.2 state
  (no running → queued demote; no spurious abandonment of a
  still-valid lease). Validates the integration plumbing of the
  async-promote stack (PRs feat(publisher): TripleStoreAsyncPromoteQueue library (PR 1/4) #657 → feat(agent,daemon): wire async-promote queue + 5 HTTP routes (PR 2/4) #660 → feat(daemon): PR #3 — async-promote worker supervisor + lifecycle wiring #665 → feat(daemon): PR #4 — async-promote queue config knobs + E2E test + SKILL.md docs #667) survives a
  hard process kill. Requires the /api/assertion/<name>/promote-async
  route from PR feat(agent,daemon): wire async-promote queue + 5 HTTP routes (PR 2/4) #660 — will be exercisable as soon as that PR
  lands; sits dormant until then. Pass-class matrix in the script
  header documents every acceptable outcome.
- devnet-test-rc11-shutdown-mid-publish.sh —
  node 5 (edge) launches 5 concurrent publishes against cores 1–4
  to saturate StorageACK reads, then SIGTERMs core node 2
  mid-flight. Measures shutdown wall-clock and greps daemon.log
  for new [shutdown-timeout] lines. Classifies the interaction
  between PR fix(cli/daemon): hard timeout on graceful shutdown to recover from cleanup deadlocks #655 (15s hard shutdown cap) and PR feat(core): plumb AbortSignal through DKGNode.stop() into protocol-router reads (PR-6) #669
  (AbortSignal through DKGNode.stop()) — green when shutdown
  completes <15s with zero new [shutdown-timeout] lines (PR feat(core): plumb AbortSignal through DKGNode.stop() into protocol-router reads (PR-6) #669
  is the active drain path; PR fix(cli/daemon): hard timeout on graceful shutdown to recover from cleanup deadlocks #655 stays passive). Uses
  node-based timing (now_ms()) so it runs identically on
  macOS BSD date and Linux GNU date. Requires PRs fix(cli/daemon): hard timeout on graceful shutdown to recover from cleanup deadlocks #655 + feat(core): plumb AbortSignal through DKGNode.stop() into protocol-router reads (PR-6) #669
  at runtime for the GREEN classification; will exit with a
  SOFT-PASS or FAIL classification on stock rc.10 so it can also
  serve as a pre-merge sanity check.

What's not in this PR

A third fix from the integration branch — 36fe4c91 fix(rc.11): align async-promote e2e test with PR #660's 200 ship contract — cannot
cleanly cherry-pick onto main because the file it touches
(packages/cli/test/async-promote-queue-e2e.test.ts) is introduced by
PR #667, which hasn't landed yet. Once #667 lands I'll open a tiny
follow-up PR (or @branarakic can absorb the 1-line change directly into
#667). The fix is:

-    expect(status).toBe(202);
+    // PR #660's wiring ships HTTP 200 on enqueue success, which the
+    // 24 sibling tests in promote-async-routes.test.ts encode as the
+    // de-facto API contract. Aligning with the ship contract until a
+    // future RC can do the 200 → 202 (HTTP-canonical for async
+    // accepted) breaking change for external clients.
+    expect(status).toBe(200);

Test plan

scripts/devnet-test-rfc38-lu8.sh exits 0 with the same actual
LU-8 assertions as before (no behavioural change).
scripts/devnet-test-rfc38-all.sh was 11/11 PASS on the
integration branch after this fix.
devnet-test-rc11-promote-crash-recovery.sh: PASS on the
integration branch (job state=succeeded post-restart, integration
plumbing intact).
devnet-test-rc11-shutdown-mid-publish.sh: GREEN on the
integration branch (5459ms shutdown, zero new
[shutdown-timeout] lines).

Why now

These fixes were authored during the rc.11 integration sweep
(release/v10.0.0-rc.11 branch — informational, not for merge). The
runbook docs/runbooks/RELEASE_RC11_INTEGRATION.md on that branch
captures the full sweep evidence. This PR pulls forward the
behaviour-neutral test-infra parts so they land on main and end up
in the rc.11 tag — no other rc.11 PR carries them.

Made with Cursor

The LU-8 devnet sweep script set `set -u`, ran all 3 scenarios green, printed the success banner, and then crashed with `OK_FLAG: unbound variable` while emitting the summary because the author renamed the scenario-1 result variable to EXPLICIT_OK but forgot to update the summary block. Renames the summary reference and relabels it as "scenario 1b" to match what's actually being captured (the explicit-quads happy-path). Surfaced by the rc.11 integration round-1 sweep: lu8 itself was the sole "failure" (exit code 1 from the unbound-variable trap) even though every actual assertion passed. Not a regression in any of the merged PRs. Co-authored-by: Cursor <cursoragent@cursor.com>

Two scenarios that didn't exist in the RFC-38 sweep but cover invariants specific to the rc.11 PR universe: devnet-test-rc11-promote-crash-recovery.sh Validates the async-promote queue (#657 → #667) survives a hard SIGKILL/restart cycle without violating RFC §6.2 (no `running → queued` demote; no spurious abandonment of running jobs whose lease hasn't expired). On devnet the 5-minute default `leaseMs` makes the expired-lease path unreachable in a single run — that path is covered by the 27 unit tests in async-promote-queue.test.ts. The devnet scenario validates the integration plumbing (daemon ↔ queue ↔ store ↔ worker supervisor) end-to-end. Classifies the post-restart state per a matrix of acceptable outcomes (succeeded = worker drained pre-kill, running = preserved with valid lease, failed-with-lease-expiry = abandoned by recovery), all of which count as PASS; only queued or failed-with-other-reason counts as FAIL. devnet-test-rc11-shutdown-mid-publish.sh Exercises PR-1 (#655 — 15s hard shutdown timeout) and PR-6 (#669 — AbortSignal through DKGNode.stop() to libp2p reads) together. Edge node 5 launches 5 concurrent publishes against the core nodes, then the script SIGTERMs core node 2 mid-flight. Pass = shutdown completes under 15s; classification distinguishes: GREEN: <15s + zero new [shutdown-timeout] lines → PR-6 is the active path, PR-1 passive defense-in-depth. SOFT-PASS: <15.5s + ≥1 new [shutdown-timeout] line → PR-1 caught a real PR-6 gap; file a follow-up. FAIL: shutdown took >15.5s → both PR-1 and PR-6 broken. First run on the integration branch: GREEN — daemon dead in 5459ms with zero shutdown-timeout lines while 5 concurrent publishes were in flight. Both scripts include macOS-portable timing (node's Date.now() via now_ms()) and grep-counter helpers that survive `set -o pipefail`. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-26T07:35:50Z

+DAEMON_PID=$(cat "$PIDFILE")
+log ""
+log "SIGKILL daemon (pid=$DAEMON_PID) — bypassing graceful-shutdown to reproduce a hard crash..."
+kill -9 "$DAEMON_PID" 2>/dev/null || warn "kill -9 returned non-zero; pid may already be gone"


🔴 Bug: devnet.pid is the foreground supervisor, not the inner daemon worker. Sending SIGKILL only to this PID can leave daemon-foreground-worker alive, so the old process may keep the API port/store open and the restart path no longer reflects a real crash. Kill both devnet.pid and daemon.pid (or the whole process group) before asserting the node is down.

github-actions · 2026-05-26T07:35:50Z

+      warn "job drained before kill could land (state=succeeded). Worker faster than poll loop — rerun for a reliable crash window."
+      log "  raw status: $STATUS"
+      log "RESULT: INCONCLUSIVE (worker finished before kill)"
+      exit 0


🔴 Bug: this exits 0 on the exact path where the crash/restart scenario was never exercised (state=succeeded before the kill). In automated validation that becomes a false green for the recovery logic. Either retry until you actually observe running, or exit non-zero so callers can distinguish an inconclusive run from a pass.

github-actions · 2026-05-26T07:35:50Z

+      > "$TMP_OUT_DIR/write-$i.json" 2>&1 || true
+    api_call "$LOAD_NODE" POST /api/shared-memory/publish "$(cat <<EOF
+{ "contextGraphId": "$CG_ID",
+  "selection": { "kind": "sparql",


🔴 Bug: /api/shared-memory/publish does not accept a SPARQL-shaped selection object here; the route only supports "all" or a root-entity list. At runtime this request never creates publish load, and the surrounding || true hides the failure, so the script ends up timing an idle shutdown instead of shutdown under StorageACK pressure. Use an isolated CG/subgraph plus selection: "all", or resolve the root entities client-side and pass them as an array.

Bump root + 17 workspace packages from 10.0.0-rc.10 to 10.0.0-rc.11. Promote the CHANGELOG "Unreleased" block to the dated rc.11 section. Release contents (PR OriginTrail#680 — release/rc.11 integration branch): Core-stability hardening (rc.10 deadlock workstream): OriginTrail#655 hard shutdown timeout OriginTrail#657 async-promote queue library OriginTrail#659 auto-update install-source override OriginTrail#669 AbortSignal plumbing through DKGNode.stop() OriginTrail#670 chain provider filter log-spam silencer OriginTrail#666 dkg migrate-to-npm CLI subcommand OriginTrail#668 AutoNAT boot self-probe OriginTrail#661 core relay capability sanity check OriginTrail#662 relay metrics in /api/status OriginTrail#664 supervisor positive-liveness probe ERC-721 mint ordering: OriginTrail#681 CEI mint-last at every mint site (supersedes OriginTrail#663, which proposed _safeMint and was rejected as a public-API break for older Gnosis Safes / DAO timelocks / strategy wrappers). Keeps _mint; reorders so _mint is the last state-changing call. relock moves _burn before _mint. Async-promote queue stack: OriginTrail#660 /promote-async route wiring with worker-readiness gate OriginTrail#665 async-promote worker supervisor OriginTrail#667 async-promote queue config + e2e tests Honest ACK + tentative VM cleanup: OriginTrail#671 delete self-signed ACK fallback + tentative-VM concept OriginTrail#672 typed errors + LU-6 runbook + provenance telemetry Test infra: OriginTrail#673 rc.11 test infrastructure fixes Verification on the integration branch (release/rc.11): pnpm -r build clean pnpm --filter @origintrail-official/dkg test:unit 403/403 PASS evm-module 278/278 PASS (NFT + CG contract tests) devnet-test-rc11-promote-crash-recovery.sh GREEN devnet-test-rc11-shutdown-mid-publish.sh GREEN (549ms shutdown, 0 [shutdown-timeout] lines) devnet-test-rfc38-all.sh 10/11 PASS (lj is the pre-existing documented LU-6 cores-only gap) devnet-test.sh 343/347 PASS — 4 fails tracked in OriginTrail#676 as stale test expectations against OriginTrail#671's seal contract + V10 auto-registration. Co-authored-by: Cursor <cursoragent@cursor.com>

PR OriginTrail#673 follow-up — three false-green / wrong-target issues Codex flagged in the final review pass that the rc.11 cut deferred: 1. **promote-crash-recovery.sh** SIGKILL hit only the supervisor pid (OriginTrail#673#discussion_r3302023868). `devnet.pid` is the foreground supervisor process — `daemon-foreground-worker` keeps the API port and SQLite store open after `kill -9 <supervisor>`, so the restart path doesn't reproduce a real crash. Now reads BOTH `devnet.pid` (supervisor) and `daemon.pid` (worker) and kills both, with the alive-after-SIGKILL gate checking both pids. 2. **promote-crash-recovery.sh** exit code on the inconclusive path (OriginTrail#673#discussion_r3302023872). When the worker drains the job before the kill could land, the script used to `exit 0`, hiding the fact that the crash-recovery path was never exercised. Now `exit 2` so CI can distinguish a real pass from a missed window. 3. **shutdown-mid-publish.sh** SPARQL-shaped selection silently rejected (OriginTrail#673#discussion_r3302023873). `/api/shared-memory/publish` accepts `selection: "all"` or a root- entity string array — not `{ kind: "sparql", query: ... }`. The surrounding `|| true` was hiding the failure, so the script ended up timing an idle shutdown instead of shutdown under StorageACK pressure. Now passes the 8 generated root entities directly. Co-authored-by: Cursor <cursoragent@cursor.com>

branarakic and others added 2 commits May 26, 2026 09:31

branarakic requested review from Jurij89 and zsculac as code owners May 26, 2026 07:32

github-actions Bot reviewed May 26, 2026

View reviewed changes

This was referenced May 26, 2026

rc.12 follow-ups from rc.11 review feedback #676

Open

release: v10.0.0-rc.11 integration branch #680

Merged

branarakic merged commit 7a7de6e into main May 26, 2026
36 checks passed

This was referenced May 26, 2026

chore(release): v10.0.0-rc.11 #683

Merged

fix: rc.11 follow-up — address 11 unaddressed Codex findings from the #680 integration sweep #685

Closed

release: v10.0.0-rc.12 integration branch #716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(rc.11): test-infra fixes captured during integration sweep#673

chore(rc.11): test-infra fixes captured during integration sweep#673
branarakic merged 2 commits into
mainfrom
chore/rc11-test-infra-fixes

branarakic commented May 26, 2026

Uh oh!

github-actions Bot May 26, 2026

Uh oh!

github-actions Bot May 26, 2026

Uh oh!

github-actions Bot May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

branarakic commented May 26, 2026

Summary

Commits

What's not in this PR

Test plan

Why now

Uh oh!

github-actions Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant