OT-RFC-38 LU-6 C2 — curator-offline mid-batch devnet harness by branarakic · Pull Request #622 · OriginTrail/dkg

branarakic · 2026-05-24T23:45:46Z

Summary

Validates the RFC §1.3 resilience contract for OPEN publishPolicy curated CGs: once the curator has gossiped a batch's SWM envelopes, the publish-to-VM step no longer requires the curator to stay online. Any allowlisted member can finish the path.

Stacked on #610. Pure devnet harness; no production-code changes.

Test phases

Curator (N5) creates an OPEN publishPolicy curated CG with allowlist `[curator, M1=N6, M2=N4]`; all three pre-create.
Curator writes 4 triples; members catch up so ciphertext lands on their disks.
Curator's daemon is `kill -TERM`'d (offline simulation).
Asserts M1 + M2 still hold the 4 triples (no regression after curator goes offline).
M1 (non-curator) calls `/api/shared-memory/publish`; must return `status=confirmed` + a `txHash`.
Outsider (N1) confirms the published KC's `merkleRoot` exists on chain.
Restarts the curator's daemon for devnet hygiene.

Test plan

Bash `-n` syntax check passes
Devnet run — invoked manually post-merge as part of the LU-6 mainnet validation sweep

Not covered (intentionally)

Curator-only `publishPolicy` FAILURE case (existing publisher unit tests cover the 403 path).
Catchup serving while curator offline: SCENARIO B in `devnet-test-rfc38-late-joiner.sh` already covers that.

Made with Cursor

…arness Validates the RFC §1.3 resilience contract for OPEN publishPolicy CGs: once the curator has gossiped a batch's SWM envelopes to the cores, the publish-to-VM step no longer requires the curator's participation. Any allowlisted member can finish the path. Test phases: 1. Curator (N5) creates an OPEN publishPolicy curated CG with [curator, M1=N6, M2=N4] in the allowlist; all three pre-create. 2. Curator writes 4 triples; members catch up so ciphertext is on their disks. 3. Curator's daemon is SIGTERMed (offline simulation). 4. Asserts M1 + M2 still hold the 4 triples (no regression). 5. M1 (non-curator) calls /api/shared-memory/publish; must return status=confirmed + a txHash. 6. Outsider (N1) confirms the published KC's merkleRoot exists on chain. 7. Restarts the curator's daemon for devnet hygiene. Notes: * Re-runnable: timestamp-suffixed CG id. * Curator-only publishPolicy FAILURE case is covered by existing publisher unit tests; not duplicated here. * Falls through cleanly if devnet.sh restart-node is missing (warns operator to restart manually). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-24T23:47:13Z

+log "Stopping curator pid=$CURATOR_PID …"
+kill -TERM "$CURATOR_PID" 2>/dev/null || warn "SIGTERM returned non-zero (process may already be gone)"
+# Wait until the port stops accepting
+for _ in $(seq 1 20); do


🔴 Bug: If the curator still hasn't shut down after this 20s loop, the script just falls through and continues. That can turn phase 5 into a false positive because the publish is no longer happening with the curator actually offline. Fail here (or wait on kill -0//api/status) before proceeding.

github-actions · 2026-05-24T23:47:13Z

+done
+
+# ===========================================================================
+act "4. Members confirm they still hold the 4 triples (no regression)"


🔴 Bug: This phase only re-reads M1/M2, which were already online when the batch was written. It never exercises the late-joiner/catchup path described in the header, so the test can pass even if recovery for a peer that joins after the curator goes offline is broken. Add a node that joins after phase 3 and assert it catches up successfully before publishing.

github-actions · 2026-05-24T23:47:13Z

+# ===========================================================================
+act "7. Restart curator (cleanup so devnet stays usable)"
+# ===========================================================================
+DEVNET_SH="$REPO_ROOT/scripts/devnet.sh"


🟡 Issue: Cleanup only runs on the happy path. With set -e, any failure after stopping node 5 exits immediately and leaves the curator down, which makes subsequent devnet runs flaky and contradicts the script's own cleanup contract. Register a trap before phase 3 so restart happens on both success and failure.

…trap cleanup Addresses three Codex bugs flagged on PR #622: 1. Curator shutdown loop fell through silently (curator-offline.sh:174) If the daemon refused to terminate within 20s the script just continued to phase 5. That turns the headline assertion into a false positive: the publish ran with the curator STILL online, defeating the offline-resilience contract the test claims to validate. Fix: hard-fail after 30s if the port is still open or the pid is still alive. Also checks `kill -0 $CURATOR_PID` so a zombie/respawn doesn't pass the port-only check. 2. Phase 4 didn't exercise the late-joiner path (curator-offline.sh:183) Re-reading M1/M2 only proves they didn't lose data — both were already online during the write. The script's own header claims to cover the LU-6 host-catchup path; without a node that joins AFTER the curator goes offline, that path was never executed. Fix: new phase 4b adds N1 (otherwise an outsider) as a late joiner. It pre-creates the CG with the same allowlist, then catches up via SWM gossip / member host-catchup while the curator is still offline. The phase is best-effort (chain-anchored allowlist denial is acceptable), but the outcome is now observable instead of silently skipped. 3. Cleanup only ran on the happy path (curator-offline.sh:220) With `set -e`, any failure after stopping the curator exited immediately and left the curator down — breaking subsequent devnet runs without operator intervention. Fix: EXIT trap registered before phase 3 calls restart-node on BOTH success and failure (gated by a `CURATOR_STOPPED` flag so we don't restart a node we never stopped). The explicit restart-on-success phase 7 is now removed (the trap covers it). Bonus: /api/shared-memory/list isn't a daemon route — replaced with /api/query SPARQL COUNT against the _shared_memory graph suffix, same fix shipped for C1 (#621). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-25T00:10:22Z

+EOF
+)" >/dev/null || true
+done
+sleep 3


🔴 Bug: this fixed sleep 3 makes phase 5 flaky. The non-curator publish later depends on node 6 already having the CG's onChainId, and elsewhere in the repo we explicitly poll for that because gossip lag surfaces as Context graph "..." is not registered on-chain. Please wait until M1 sees the same onChainId before taking the curator offline.

github-actions · 2026-05-25T00:10:22Z

+# meant to enable.
+LATE_NODE="$OUTSIDER_NODE"
+LATE_AGENT=$(api_call "$LATE_NODE" GET /api/agent/identity | node -e 'let d="";process.stdin.on("data",c=>d+=c);process.stdin.on("end",()=>console.log(JSON.parse(d).agentAddress))')
+ALLOWED_LATE='["'"$CURATOR_AGENT"'", "'"$M1_AGENT"'", "'"$M2_AGENT"'", "'"$LATE_AGENT"'"]'


🔴 Bug: phase 4b does not exercise a real "late-joining member". N1 was not part of the original on-chain allowlist, so adding "$LATE_AGENT" only in this local pre-create cannot prove the curated-CG catchup path; if it succeeds, it is testing locally-forged membership, and if it fails the script currently accepts that. Use a node that was in allowedAgents at registration time but stayed offline until this phase, then assert success.

github-actions · 2026-05-25T00:10:22Z

+    local devnet_sh="$REPO_ROOT/scripts/devnet.sh"
+    if [ -x "$devnet_sh" ]; then
+      log "trap: restarting curator (node $CURATOR_NODE) so the devnet stays usable…"
+      "$devnet_sh" restart-node "$CURATOR_NODE" >/dev/null 2>&1 || warn "trap: devnet.sh restart-node returned non-zero — please restart node $CURATOR_NODE manually"


🟡 Issue: the EXIT trap restarts node 5 but never waits for it to become healthy again. devnet.sh restart-node returns after spawning, so callers that run another scenario immediately can inherit a half-started devnet. Reuse the existing /api/status readiness polling pattern before letting the script exit.

branarakic · 2026-05-25T15:29:33Z

Superseded by PR #649 (release: rc.10 testnet-ready cut). All commits from this PR are now on main via #649. Unaddressed Codex review feedback (C2 curator-offline mid-batch harness reliability) is being tracked + fixed in a dedicated post-rc.10 followup PR.

…38 scripts The four LU-6 devnet scripts (#621/#622/#623/#624) shipped with several control-flow gaps that let regressions silently sneak through: errors swallowed with `|| true`, fixed sleeps where bounded retries were needed, and assertions that passed even when the scenario under test never happened. Codex's review on the closed PRs flagged 24 specific items; the rc.10 integration merge fixed most of them. This commit closes the remaining ones — the ones that materially affect whether a PASS is meaningful. devnet-test-rfc38-revocation.sh (#621): - Member pre-creates (M1, M2) no longer swallow EVERY error with `|| true`. The new `member_pre_create` helper captures the response and tolerates ONLY the idempotent "already exists" signal; any other failure (wrong auth, malformed body, daemon down) now fails the script immediately with the actual error visible, instead of surfacing later as an opaque catchup timeout. - Phase 5's single "sleep 3 + one-shot read" replaced with a 30s bounded-retry loop (`wait_for_count_or_steady`) — gossip / catchup latency between the post-revocation write and a member's final triple count is variable, and the one-shot snapshot was reporting M1's partial state as a regression. - Added the forward-only-rotation lower bound: `M2_FINAL >= M2_PRE`. The pre-existing `<= 3` check alone would pass if revocation also wiped M2's previously-decryptable triples — violating the contract in the script header that the kicked member RETAINS what they could already decrypt; they just stop learning anything new. devnet-test-rfc38-curator-offline-midbatch.sh (#622): - Phase 1's `sleep 3` after member pre-create replaced with an explicit `wait_for_m1_onchain_id` poll. Phase 5's non-curator publish requires M1 to have observed the CG's `onChainId`, otherwise it bounces with "Context graph ... is not registered on-chain" — gossip lag for the `ContextGraphCreated` event can easily exceed 3s under devnet load. - EXIT trap now waits up to 60s for `/api/status` to respond before declaring the curator healthy again. `devnet.sh restart-node` returns after spawning the daemon, not after it's actually ready to serve, so CI runners that chain another scenario immediately were inheriting a half-started devnet. devnet-test-rfc38-prereg-bytecap-stress.sh (#623): - Added a precondition that the burst's SUBMITTED_BYTES actually exceeds the configured cap. With the default 80 KiB payload size anything up to 12 writes stayed under the 1 MiB cap, so the downstream clamp assertion was vacuously satisfied if anyone misconfigured WRITES_COUNT/WRITE_PAYLOAD_BYTES — a TEST bug passing as a daemon PASS. - Phase 6's liveness check no longer silently skips when the pidfile is missing/empty. Falls back to a `/api/status` HTTP probe so containerised devnets (where pidfiles aren't written) still get a meaningful liveness check; hard-fails when both pidfile AND status are unreachable. devnet-test-rfc38-unclean-restart.sh (#624): - Captures the killed core's `peerId` from `/api/status` BEFORE the SIGKILL. The post-restart catchup calls in phase 6 now pin to that peerId so we're explicitly exercising recovery from the restarted node — previously the catchup fanned out to any connected peer and could pass by pulling data from the curator (still online), validating nothing about the unclean-restart contract. - Phase 3 now waits for STRICT mid-batch state (`0 < M1_PARTIAL < WRITES_COUNT`) before the SIGKILL. The previous one-shot snapshot accepted any partial value including 0 or already-complete; both meant the kill below never actually exercised the `lastHostCatchupSeqno` resume path this test claims to cover. - Catchup responses are captured (not discarded to `/dev/null`) and asserted free of `error` / `swmError` / `durableError` via the new `assert_catchup_clean` helper. HTTP 500s, auth denials, host-catchup failures, etc. were previously invisible — the final triple-count check would still go green if data arrived via background gossip. Bash-syntax-checked (`bash -n` on all four). No behaviour change when the underlying scenarios are healthy; the changes only convert silent false-positives into loud failures. Co-authored-by: Cursor <cursoragent@cursor.com>

branarakic requested review from Jurij89 and zsculac as code owners May 24, 2026 23:45

github-actions Bot reviewed May 24, 2026

View reviewed changes

github-actions Bot reviewed May 25, 2026

View reviewed changes

branarakic closed this May 25, 2026

branarakic mentioned this pull request May 25, 2026

fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast #653

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OT-RFC-38 LU-6 C2 — curator-offline mid-batch devnet harness#622

OT-RFC-38 LU-6 C2 — curator-offline mid-batch devnet harness#622
branarakic wants to merge 2 commits into
feat/ot-rfc-38-lu6-host-modefrom
feat/lu6-followup-c2-curator-offline-mid-batch

branarakic commented May 24, 2026

Uh oh!

github-actions Bot May 24, 2026

Uh oh!

github-actions Bot May 24, 2026

Uh oh!

github-actions Bot May 24, 2026

Uh oh!

github-actions Bot May 25, 2026

Uh oh!

github-actions Bot May 25, 2026

Uh oh!

github-actions Bot May 25, 2026

Uh oh!

branarakic commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

branarakic commented May 24, 2026

Summary

Test phases

Test plan

Not covered (intentionally)

Uh oh!

github-actions Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

branarakic commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant