Skip to content

chore(rc.11): test-infra fixes captured during integration sweep#673

Merged
branarakic merged 2 commits into
mainfrom
chore/rc11-test-infra-fixes
May 26, 2026
Merged

chore(rc.11): test-infra fixes captured during integration sweep#673
branarakic merged 2 commits into
mainfrom
chore/rc11-test-infra-fixes

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Summary

Three commits pulled forward from the release/v10.0.0-rc.11 integration
branch — pure test-infrastructure fixes that surfaced while running the
RFC-38 sweep + targeted devnet scenarios against the merged rc.11 PR
universe. None of them touch production code; they make the existing
test surface honest about what it's checking.

Commits

  1. ecf2bdd3 fix(rc.11): lu8 sweep script — replace unbound OK_FLAG with EXPLICIT_OK
    scripts/devnet-test-rfc38-lu8.sh summary block referenced an
    $OK_FLAG variable that was never defined; the script ran with
    set -u and crashed in the LAST line after all real assertions had
    already passed. Round-1 of the integration sweep counted that as
    1 FAIL out of 11. Fix: print $EXPLICIT_OK (the variable the
    scenario actually sets) instead. Round-2 sweep then went 11/11
    without changing any actual LU-8 behaviour.

  2. c38b04df test(rc.11): two new devnet scenarios for async-promote + shutdown
    Two targeted scenarios that the existing RFC-38 sweep didn't cover:

What's not in this PR

A third fix from the integration branch — 36fe4c91 fix(rc.11): align async-promote e2e test with PR #660's 200 ship contract — cannot
cleanly cherry-pick onto main because the file it touches
(packages/cli/test/async-promote-queue-e2e.test.ts) is introduced by
PR #667, which hasn't landed yet. Once #667 lands I'll open a tiny
follow-up PR (or @branarakic can absorb the 1-line change directly into
#667). The fix is:

-    expect(status).toBe(202);
+    // PR #660's wiring ships HTTP 200 on enqueue success, which the
+    // 24 sibling tests in promote-async-routes.test.ts encode as the
+    // de-facto API contract. Aligning with the ship contract until a
+    // future RC can do the 200 → 202 (HTTP-canonical for async
+    // accepted) breaking change for external clients.
+    expect(status).toBe(200);

Test plan

  • scripts/devnet-test-rfc38-lu8.sh exits 0 with the same actual
    LU-8 assertions as before (no behavioural change).
  • scripts/devnet-test-rfc38-all.sh was 11/11 PASS on the
    integration branch after this fix.
  • devnet-test-rc11-promote-crash-recovery.sh: PASS on the
    integration branch (job state=succeeded post-restart, integration
    plumbing intact).
  • devnet-test-rc11-shutdown-mid-publish.sh: GREEN on the
    integration branch (5459ms shutdown, zero new
    [shutdown-timeout] lines).

Why now

These fixes were authored during the rc.11 integration sweep
(release/v10.0.0-rc.11 branch — informational, not for merge). The
runbook docs/runbooks/RELEASE_RC11_INTEGRATION.md on that branch
captures the full sweep evidence. This PR pulls forward the
behaviour-neutral test-infra parts so they land on main and end up
in the rc.11 tag — no other rc.11 PR carries them.

Made with Cursor

branarakic and others added 2 commits May 26, 2026 09:31
The LU-8 devnet sweep script set `set -u`, ran all 3 scenarios green,
printed the success banner, and then crashed with `OK_FLAG: unbound
variable` while emitting the summary because the author renamed the
scenario-1 result variable to EXPLICIT_OK but forgot to update the
summary block. Renames the summary reference and relabels it as
"scenario 1b" to match what's actually being captured (the
explicit-quads happy-path).

Surfaced by the rc.11 integration round-1 sweep: lu8 itself was the
sole "failure" (exit code 1 from the unbound-variable trap) even
though every actual assertion passed. Not a regression in any of the
merged PRs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two scenarios that didn't exist in the RFC-38 sweep but cover invariants
specific to the rc.11 PR universe:

devnet-test-rc11-promote-crash-recovery.sh
  Validates the async-promote queue (#657#667) survives a hard
  SIGKILL/restart cycle without violating RFC §6.2 (no `running →
  queued` demote; no spurious abandonment of running jobs whose lease
  hasn't expired). On devnet the 5-minute default `leaseMs` makes the
  expired-lease path unreachable in a single run — that path is covered
  by the 27 unit tests in async-promote-queue.test.ts. The devnet
  scenario validates the integration plumbing (daemon ↔ queue ↔ store ↔
  worker supervisor) end-to-end. Classifies the post-restart state per
  a matrix of acceptable outcomes (succeeded = worker drained pre-kill,
  running = preserved with valid lease, failed-with-lease-expiry =
  abandoned by recovery), all of which count as PASS; only queued or
  failed-with-other-reason counts as FAIL.

devnet-test-rc11-shutdown-mid-publish.sh
  Exercises PR-1 (#655 — 15s hard shutdown timeout) and PR-6 (#669 —
  AbortSignal through DKGNode.stop() to libp2p reads) together. Edge
  node 5 launches 5 concurrent publishes against the core nodes, then
  the script SIGTERMs core node 2 mid-flight. Pass = shutdown completes
  under 15s; classification distinguishes:
    GREEN: <15s + zero new [shutdown-timeout] lines → PR-6 is the
           active path, PR-1 passive defense-in-depth.
    SOFT-PASS: <15.5s + ≥1 new [shutdown-timeout] line → PR-1 caught
           a real PR-6 gap; file a follow-up.
    FAIL: shutdown took >15.5s → both PR-1 and PR-6 broken.

First run on the integration branch: GREEN — daemon dead in 5459ms with
zero shutdown-timeout lines while 5 concurrent publishes were in flight.

Both scripts include macOS-portable timing (node's Date.now() via
now_ms()) and grep-counter helpers that survive `set -o pipefail`.

Co-authored-by: Cursor <cursoragent@cursor.com>
DAEMON_PID=$(cat "$PIDFILE")
log ""
log "SIGKILL daemon (pid=$DAEMON_PID) — bypassing graceful-shutdown to reproduce a hard crash..."
kill -9 "$DAEMON_PID" 2>/dev/null || warn "kill -9 returned non-zero; pid may already be gone"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: devnet.pid is the foreground supervisor, not the inner daemon worker. Sending SIGKILL only to this PID can leave daemon-foreground-worker alive, so the old process may keep the API port/store open and the restart path no longer reflects a real crash. Kill both devnet.pid and daemon.pid (or the whole process group) before asserting the node is down.

warn "job drained before kill could land (state=succeeded). Worker faster than poll loop — rerun for a reliable crash window."
log " raw status: $STATUS"
log "RESULT: INCONCLUSIVE (worker finished before kill)"
exit 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this exits 0 on the exact path where the crash/restart scenario was never exercised (state=succeeded before the kill). In automated validation that becomes a false green for the recovery logic. Either retry until you actually observe running, or exit non-zero so callers can distinguish an inconclusive run from a pass.

> "$TMP_OUT_DIR/write-$i.json" 2>&1 || true
api_call "$LOAD_NODE" POST /api/shared-memory/publish "$(cat <<EOF
{ "contextGraphId": "$CG_ID",
"selection": { "kind": "sparql",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: /api/shared-memory/publish does not accept a SPARQL-shaped selection object here; the route only supports "all" or a root-entity list. At runtime this request never creates publish load, and the surrounding || true hides the failure, so the script ends up timing an idle shutdown instead of shutdown under StorageACK pressure. Use an isolated CG/subgraph plus selection: "all", or resolve the root entities client-side and pass them as an array.

@branarakic branarakic merged commit 7a7de6e into main May 26, 2026
36 checks passed
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
Bump root + 17 workspace packages from 10.0.0-rc.10 to 10.0.0-rc.11.
Promote the CHANGELOG "Unreleased" block to the dated rc.11 section.

Release contents (PR OriginTrail#680 — release/rc.11 integration branch):

  Core-stability hardening (rc.10 deadlock workstream):
    OriginTrail#655 hard shutdown timeout
    OriginTrail#657 async-promote queue library
    OriginTrail#659 auto-update install-source override
    OriginTrail#669 AbortSignal plumbing through DKGNode.stop()
    OriginTrail#670 chain provider filter log-spam silencer
    OriginTrail#666 dkg migrate-to-npm CLI subcommand
    OriginTrail#668 AutoNAT boot self-probe
    OriginTrail#661 core relay capability sanity check
    OriginTrail#662 relay metrics in /api/status
    OriginTrail#664 supervisor positive-liveness probe

  ERC-721 mint ordering:
    OriginTrail#681 CEI mint-last at every mint site (supersedes OriginTrail#663,
         which proposed _safeMint and was rejected as a public-API
         break for older Gnosis Safes / DAO timelocks / strategy
         wrappers). Keeps _mint; reorders so _mint is the last
         state-changing call. relock moves _burn before _mint.

  Async-promote queue stack:
    OriginTrail#660 /promote-async route wiring with worker-readiness gate
    OriginTrail#665 async-promote worker supervisor
    OriginTrail#667 async-promote queue config + e2e tests

  Honest ACK + tentative VM cleanup:
    OriginTrail#671 delete self-signed ACK fallback + tentative-VM concept
    OriginTrail#672 typed errors + LU-6 runbook + provenance telemetry

  Test infra:
    OriginTrail#673 rc.11 test infrastructure fixes

Verification on the integration branch (release/rc.11):
  pnpm -r build                              clean
  pnpm --filter @origintrail-official/dkg test:unit   403/403 PASS
  evm-module 278/278 PASS (NFT + CG contract tests)
  devnet-test-rc11-promote-crash-recovery.sh GREEN
  devnet-test-rc11-shutdown-mid-publish.sh   GREEN (549ms shutdown,
                                             0 [shutdown-timeout] lines)
  devnet-test-rfc38-all.sh                   10/11 PASS (lj is the
                                             pre-existing documented
                                             LU-6 cores-only gap)
  devnet-test.sh                             343/347 PASS — 4 fails
                                             tracked in OriginTrail#676 as stale
                                             test expectations against
                                             OriginTrail#671's seal contract +
                                             V10 auto-registration.

Co-authored-by: Cursor <cursoragent@cursor.com>
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
PR OriginTrail#673 follow-up — three false-green / wrong-target issues Codex flagged
in the final review pass that the rc.11 cut deferred:

1. **promote-crash-recovery.sh** SIGKILL hit only the supervisor pid
   (OriginTrail#673#discussion_r3302023868). `devnet.pid` is the foreground
   supervisor process — `daemon-foreground-worker` keeps the API port
   and SQLite store open after `kill -9 <supervisor>`, so the restart
   path doesn't reproduce a real crash. Now reads BOTH `devnet.pid`
   (supervisor) and `daemon.pid` (worker) and kills both, with the
   alive-after-SIGKILL gate checking both pids.

2. **promote-crash-recovery.sh** exit code on the inconclusive path
   (OriginTrail#673#discussion_r3302023872). When the worker drains the job
   before the kill could land, the script used to `exit 0`, hiding
   the fact that the crash-recovery path was never exercised. Now
   `exit 2` so CI can distinguish a real pass from a missed window.

3. **shutdown-mid-publish.sh** SPARQL-shaped selection silently
   rejected (OriginTrail#673#discussion_r3302023873).
   `/api/shared-memory/publish` accepts `selection: "all"` or a root-
   entity string array — not `{ kind: "sparql", query: ... }`. The
   surrounding `|| true` was hiding the failure, so the script ended
   up timing an idle shutdown instead of shutdown under StorageACK
   pressure. Now passes the 8 generated root entities directly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant