OT-RFC-38 LU-6 C2 — curator-offline mid-batch devnet harness#622
OT-RFC-38 LU-6 C2 — curator-offline mid-batch devnet harness#622branarakic wants to merge 2 commits into
Conversation
…arness
Validates the RFC §1.3 resilience contract for OPEN publishPolicy
CGs: once the curator has gossiped a batch's SWM envelopes to the
cores, the publish-to-VM step no longer requires the curator's
participation. Any allowlisted member can finish the path.
Test phases:
1. Curator (N5) creates an OPEN publishPolicy curated CG with
[curator, M1=N6, M2=N4] in the allowlist; all three pre-create.
2. Curator writes 4 triples; members catch up so ciphertext is
on their disks.
3. Curator's daemon is SIGTERMed (offline simulation).
4. Asserts M1 + M2 still hold the 4 triples (no regression).
5. M1 (non-curator) calls /api/shared-memory/publish; must
return status=confirmed + a txHash.
6. Outsider (N1) confirms the published KC's merkleRoot exists
on chain.
7. Restarts the curator's daemon for devnet hygiene.
Notes: * Re-runnable: timestamp-suffixed CG id.
* Curator-only publishPolicy FAILURE case is covered by existing
publisher unit tests; not duplicated here.
* Falls through cleanly if devnet.sh restart-node is missing
(warns operator to restart manually).
Co-authored-by: Cursor <cursoragent@cursor.com>
| log "Stopping curator pid=$CURATOR_PID …" | ||
| kill -TERM "$CURATOR_PID" 2>/dev/null || warn "SIGTERM returned non-zero (process may already be gone)" | ||
| # Wait until the port stops accepting | ||
| for _ in $(seq 1 20); do |
There was a problem hiding this comment.
🔴 Bug: If the curator still hasn't shut down after this 20s loop, the script just falls through and continues. That can turn phase 5 into a false positive because the publish is no longer happening with the curator actually offline. Fail here (or wait on kill -0//api/status) before proceeding.
| done | ||
|
|
||
| # =========================================================================== | ||
| act "4. Members confirm they still hold the 4 triples (no regression)" |
There was a problem hiding this comment.
🔴 Bug: This phase only re-reads M1/M2, which were already online when the batch was written. It never exercises the late-joiner/catchup path described in the header, so the test can pass even if recovery for a peer that joins after the curator goes offline is broken. Add a node that joins after phase 3 and assert it catches up successfully before publishing.
| # =========================================================================== | ||
| act "7. Restart curator (cleanup so devnet stays usable)" | ||
| # =========================================================================== | ||
| DEVNET_SH="$REPO_ROOT/scripts/devnet.sh" |
There was a problem hiding this comment.
🟡 Issue: Cleanup only runs on the happy path. With set -e, any failure after stopping node 5 exits immediately and leaves the curator down, which makes subsequent devnet runs flaky and contradicts the script's own cleanup contract. Register a trap before phase 3 so restart happens on both success and failure.
…trap cleanup Addresses three Codex bugs flagged on PR #622: 1. Curator shutdown loop fell through silently (curator-offline.sh:174) If the daemon refused to terminate within 20s the script just continued to phase 5. That turns the headline assertion into a false positive: the publish ran with the curator STILL online, defeating the offline-resilience contract the test claims to validate. Fix: hard-fail after 30s if the port is still open or the pid is still alive. Also checks `kill -0 $CURATOR_PID` so a zombie/respawn doesn't pass the port-only check. 2. Phase 4 didn't exercise the late-joiner path (curator-offline.sh:183) Re-reading M1/M2 only proves they didn't lose data — both were already online during the write. The script's own header claims to cover the LU-6 host-catchup path; without a node that joins AFTER the curator goes offline, that path was never executed. Fix: new phase 4b adds N1 (otherwise an outsider) as a late joiner. It pre-creates the CG with the same allowlist, then catches up via SWM gossip / member host-catchup while the curator is still offline. The phase is best-effort (chain-anchored allowlist denial is acceptable), but the outcome is now observable instead of silently skipped. 3. Cleanup only ran on the happy path (curator-offline.sh:220) With `set -e`, any failure after stopping the curator exited immediately and left the curator down — breaking subsequent devnet runs without operator intervention. Fix: EXIT trap registered before phase 3 calls restart-node on BOTH success and failure (gated by a `CURATOR_STOPPED` flag so we don't restart a node we never stopped). The explicit restart-on-success phase 7 is now removed (the trap covers it). Bonus: /api/shared-memory/list isn't a daemon route — replaced with /api/query SPARQL COUNT against the _shared_memory graph suffix, same fix shipped for C1 (#621). Co-authored-by: Cursor <cursoragent@cursor.com>
| EOF | ||
| )" >/dev/null || true | ||
| done | ||
| sleep 3 |
There was a problem hiding this comment.
🔴 Bug: this fixed sleep 3 makes phase 5 flaky. The non-curator publish later depends on node 6 already having the CG's onChainId, and elsewhere in the repo we explicitly poll for that because gossip lag surfaces as Context graph "..." is not registered on-chain. Please wait until M1 sees the same onChainId before taking the curator offline.
| # meant to enable. | ||
| LATE_NODE="$OUTSIDER_NODE" | ||
| LATE_AGENT=$(api_call "$LATE_NODE" GET /api/agent/identity | node -e 'let d="";process.stdin.on("data",c=>d+=c);process.stdin.on("end",()=>console.log(JSON.parse(d).agentAddress))') | ||
| ALLOWED_LATE='["'"$CURATOR_AGENT"'", "'"$M1_AGENT"'", "'"$M2_AGENT"'", "'"$LATE_AGENT"'"]' |
There was a problem hiding this comment.
🔴 Bug: phase 4b does not exercise a real "late-joining member". N1 was not part of the original on-chain allowlist, so adding "$LATE_AGENT" only in this local pre-create cannot prove the curated-CG catchup path; if it succeeds, it is testing locally-forged membership, and if it fails the script currently accepts that. Use a node that was in allowedAgents at registration time but stayed offline until this phase, then assert success.
| local devnet_sh="$REPO_ROOT/scripts/devnet.sh" | ||
| if [ -x "$devnet_sh" ]; then | ||
| log "trap: restarting curator (node $CURATOR_NODE) so the devnet stays usable…" | ||
| "$devnet_sh" restart-node "$CURATOR_NODE" >/dev/null 2>&1 || warn "trap: devnet.sh restart-node returned non-zero — please restart node $CURATOR_NODE manually" |
There was a problem hiding this comment.
🟡 Issue: the EXIT trap restarts node 5 but never waits for it to become healthy again. devnet.sh restart-node returns after spawning, so callers that run another scenario immediately can inherit a half-started devnet. Reuse the existing /api/status readiness polling pattern before letting the script exit.
…38 scripts The four LU-6 devnet scripts (#621/#622/#623/#624) shipped with several control-flow gaps that let regressions silently sneak through: errors swallowed with `|| true`, fixed sleeps where bounded retries were needed, and assertions that passed even when the scenario under test never happened. Codex's review on the closed PRs flagged 24 specific items; the rc.10 integration merge fixed most of them. This commit closes the remaining ones — the ones that materially affect whether a PASS is meaningful. devnet-test-rfc38-revocation.sh (#621): - Member pre-creates (M1, M2) no longer swallow EVERY error with `|| true`. The new `member_pre_create` helper captures the response and tolerates ONLY the idempotent "already exists" signal; any other failure (wrong auth, malformed body, daemon down) now fails the script immediately with the actual error visible, instead of surfacing later as an opaque catchup timeout. - Phase 5's single "sleep 3 + one-shot read" replaced with a 30s bounded-retry loop (`wait_for_count_or_steady`) — gossip / catchup latency between the post-revocation write and a member's final triple count is variable, and the one-shot snapshot was reporting M1's partial state as a regression. - Added the forward-only-rotation lower bound: `M2_FINAL >= M2_PRE`. The pre-existing `<= 3` check alone would pass if revocation also wiped M2's previously-decryptable triples — violating the contract in the script header that the kicked member RETAINS what they could already decrypt; they just stop learning anything new. devnet-test-rfc38-curator-offline-midbatch.sh (#622): - Phase 1's `sleep 3` after member pre-create replaced with an explicit `wait_for_m1_onchain_id` poll. Phase 5's non-curator publish requires M1 to have observed the CG's `onChainId`, otherwise it bounces with "Context graph ... is not registered on-chain" — gossip lag for the `ContextGraphCreated` event can easily exceed 3s under devnet load. - EXIT trap now waits up to 60s for `/api/status` to respond before declaring the curator healthy again. `devnet.sh restart-node` returns after spawning the daemon, not after it's actually ready to serve, so CI runners that chain another scenario immediately were inheriting a half-started devnet. devnet-test-rfc38-prereg-bytecap-stress.sh (#623): - Added a precondition that the burst's SUBMITTED_BYTES actually exceeds the configured cap. With the default 80 KiB payload size anything up to 12 writes stayed under the 1 MiB cap, so the downstream clamp assertion was vacuously satisfied if anyone misconfigured WRITES_COUNT/WRITE_PAYLOAD_BYTES — a TEST bug passing as a daemon PASS. - Phase 6's liveness check no longer silently skips when the pidfile is missing/empty. Falls back to a `/api/status` HTTP probe so containerised devnets (where pidfiles aren't written) still get a meaningful liveness check; hard-fails when both pidfile AND status are unreachable. devnet-test-rfc38-unclean-restart.sh (#624): - Captures the killed core's `peerId` from `/api/status` BEFORE the SIGKILL. The post-restart catchup calls in phase 6 now pin to that peerId so we're explicitly exercising recovery from the restarted node — previously the catchup fanned out to any connected peer and could pass by pulling data from the curator (still online), validating nothing about the unclean-restart contract. - Phase 3 now waits for STRICT mid-batch state (`0 < M1_PARTIAL < WRITES_COUNT`) before the SIGKILL. The previous one-shot snapshot accepted any partial value including 0 or already-complete; both meant the kill below never actually exercised the `lastHostCatchupSeqno` resume path this test claims to cover. - Catchup responses are captured (not discarded to `/dev/null`) and asserted free of `error` / `swmError` / `durableError` via the new `assert_catchup_clean` helper. HTTP 500s, auth denials, host-catchup failures, etc. were previously invisible — the final triple-count check would still go green if data arrived via background gossip. Bash-syntax-checked (`bash -n` on all four). No behaviour change when the underlying scenarios are healthy; the changes only convert silent false-positives into loud failures. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Validates the RFC §1.3 resilience contract for OPEN publishPolicy curated CGs: once the curator has gossiped a batch's SWM envelopes, the publish-to-VM step no longer requires the curator to stay online. Any allowlisted member can finish the path.
Stacked on #610. Pure devnet harness; no production-code changes.
Test phases
Test plan
Not covered (intentionally)
Made with Cursor