Skip to content

OT-RFC-38 LU-6 C2 — curator-offline mid-batch devnet harness#622

Closed
branarakic wants to merge 2 commits into
feat/ot-rfc-38-lu6-host-modefrom
feat/lu6-followup-c2-curator-offline-mid-batch
Closed

OT-RFC-38 LU-6 C2 — curator-offline mid-batch devnet harness#622
branarakic wants to merge 2 commits into
feat/ot-rfc-38-lu6-host-modefrom
feat/lu6-followup-c2-curator-offline-mid-batch

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Summary

Validates the RFC §1.3 resilience contract for OPEN publishPolicy curated CGs: once the curator has gossiped a batch's SWM envelopes, the publish-to-VM step no longer requires the curator to stay online. Any allowlisted member can finish the path.

Stacked on #610. Pure devnet harness; no production-code changes.

Test phases

  1. Curator (N5) creates an OPEN publishPolicy curated CG with allowlist `[curator, M1=N6, M2=N4]`; all three pre-create.
  2. Curator writes 4 triples; members catch up so ciphertext lands on their disks.
  3. Curator's daemon is `kill -TERM`'d (offline simulation).
  4. Asserts M1 + M2 still hold the 4 triples (no regression after curator goes offline).
  5. M1 (non-curator) calls `/api/shared-memory/publish`; must return `status=confirmed` + a `txHash`.
  6. Outsider (N1) confirms the published KC's `merkleRoot` exists on chain.
  7. Restarts the curator's daemon for devnet hygiene.

Test plan

  • Bash `-n` syntax check passes
  • Devnet run — invoked manually post-merge as part of the LU-6 mainnet validation sweep

Not covered (intentionally)

  • Curator-only `publishPolicy` FAILURE case (existing publisher unit tests cover the 403 path).
  • Catchup serving while curator offline: SCENARIO B in `devnet-test-rfc38-late-joiner.sh` already covers that.

Made with Cursor

…arness

Validates the RFC §1.3 resilience contract for OPEN publishPolicy
CGs: once the curator has gossiped a batch's SWM envelopes to the
cores, the publish-to-VM step no longer requires the curator's
participation. Any allowlisted member can finish the path.

Test phases:
  1. Curator (N5) creates an OPEN publishPolicy curated CG with
     [curator, M1=N6, M2=N4] in the allowlist; all three pre-create.
  2. Curator writes 4 triples; members catch up so ciphertext is
     on their disks.
  3. Curator's daemon is SIGTERMed (offline simulation).
  4. Asserts M1 + M2 still hold the 4 triples (no regression).
  5. M1 (non-curator) calls /api/shared-memory/publish; must
     return status=confirmed + a txHash.
  6. Outsider (N1) confirms the published KC's merkleRoot exists
     on chain.
  7. Restarts the curator's daemon for devnet hygiene.

Notes: * Re-runnable: timestamp-suffixed CG id.
  * Curator-only publishPolicy FAILURE case is covered by existing
    publisher unit tests; not duplicated here.
  * Falls through cleanly if devnet.sh restart-node is missing
    (warns operator to restart manually).
Co-authored-by: Cursor <cursoragent@cursor.com>
log "Stopping curator pid=$CURATOR_PID …"
kill -TERM "$CURATOR_PID" 2>/dev/null || warn "SIGTERM returned non-zero (process may already be gone)"
# Wait until the port stops accepting
for _ in $(seq 1 20); do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: If the curator still hasn't shut down after this 20s loop, the script just falls through and continues. That can turn phase 5 into a false positive because the publish is no longer happening with the curator actually offline. Fail here (or wait on kill -0//api/status) before proceeding.

done

# ===========================================================================
act "4. Members confirm they still hold the 4 triples (no regression)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This phase only re-reads M1/M2, which were already online when the batch was written. It never exercises the late-joiner/catchup path described in the header, so the test can pass even if recovery for a peer that joins after the curator goes offline is broken. Add a node that joins after phase 3 and assert it catches up successfully before publishing.

# ===========================================================================
act "7. Restart curator (cleanup so devnet stays usable)"
# ===========================================================================
DEVNET_SH="$REPO_ROOT/scripts/devnet.sh"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Cleanup only runs on the happy path. With set -e, any failure after stopping node 5 exits immediately and leaves the curator down, which makes subsequent devnet runs flaky and contradicts the script's own cleanup contract. Register a trap before phase 3 so restart happens on both success and failure.

…trap cleanup

Addresses three Codex bugs flagged on PR #622:

1. Curator shutdown loop fell through silently (curator-offline.sh:174)

   If the daemon refused to terminate within 20s the script just
   continued to phase 5. That turns the headline assertion into a
   false positive: the publish ran with the curator STILL online,
   defeating the offline-resilience contract the test claims to
   validate.

   Fix: hard-fail after 30s if the port is still open or the pid
   is still alive. Also checks `kill -0 $CURATOR_PID` so a
   zombie/respawn doesn't pass the port-only check.

2. Phase 4 didn't exercise the late-joiner path (curator-offline.sh:183)

   Re-reading M1/M2 only proves they didn't lose data — both were
   already online during the write. The script's own header claims
   to cover the LU-6 host-catchup path; without a node that joins
   AFTER the curator goes offline, that path was never executed.

   Fix: new phase 4b adds N1 (otherwise an outsider) as a late
   joiner. It pre-creates the CG with the same allowlist, then
   catches up via SWM gossip / member host-catchup while the
   curator is still offline. The phase is best-effort
   (chain-anchored allowlist denial is acceptable), but the
   outcome is now observable instead of silently skipped.

3. Cleanup only ran on the happy path (curator-offline.sh:220)

   With `set -e`, any failure after stopping the curator exited
   immediately and left the curator down — breaking subsequent
   devnet runs without operator intervention.

   Fix: EXIT trap registered before phase 3 calls restart-node on
   BOTH success and failure (gated by a `CURATOR_STOPPED` flag so
   we don't restart a node we never stopped). The explicit
   restart-on-success phase 7 is now removed (the trap covers it).

Bonus: /api/shared-memory/list isn't a daemon route — replaced with
/api/query SPARQL COUNT against the _shared_memory graph suffix, same
fix shipped for C1 (#621).

Co-authored-by: Cursor <cursoragent@cursor.com>
EOF
)" >/dev/null || true
done
sleep 3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this fixed sleep 3 makes phase 5 flaky. The non-curator publish later depends on node 6 already having the CG's onChainId, and elsewhere in the repo we explicitly poll for that because gossip lag surfaces as Context graph "..." is not registered on-chain. Please wait until M1 sees the same onChainId before taking the curator offline.

# meant to enable.
LATE_NODE="$OUTSIDER_NODE"
LATE_AGENT=$(api_call "$LATE_NODE" GET /api/agent/identity | node -e 'let d="";process.stdin.on("data",c=>d+=c);process.stdin.on("end",()=>console.log(JSON.parse(d).agentAddress))')
ALLOWED_LATE='["'"$CURATOR_AGENT"'", "'"$M1_AGENT"'", "'"$M2_AGENT"'", "'"$LATE_AGENT"'"]'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: phase 4b does not exercise a real "late-joining member". N1 was not part of the original on-chain allowlist, so adding "$LATE_AGENT" only in this local pre-create cannot prove the curated-CG catchup path; if it succeeds, it is testing locally-forged membership, and if it fails the script currently accepts that. Use a node that was in allowedAgents at registration time but stayed offline until this phase, then assert success.

local devnet_sh="$REPO_ROOT/scripts/devnet.sh"
if [ -x "$devnet_sh" ]; then
log "trap: restarting curator (node $CURATOR_NODE) so the devnet stays usable…"
"$devnet_sh" restart-node "$CURATOR_NODE" >/dev/null 2>&1 || warn "trap: devnet.sh restart-node returned non-zero — please restart node $CURATOR_NODE manually"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: the EXIT trap restarts node 5 but never waits for it to become healthy again. devnet.sh restart-node returns after spawning, so callers that run another scenario immediately can inherit a half-started devnet. Reuse the existing /api/status readiness polling pattern before letting the script exit.

@branarakic
Copy link
Copy Markdown
Contributor Author

Superseded by PR #649 (release: rc.10 testnet-ready cut). All commits from this PR are now on main via #649. Unaddressed Codex review feedback (C2 curator-offline mid-batch harness reliability) is being tracked + fixed in a dedicated post-rc.10 followup PR.

@branarakic branarakic closed this May 25, 2026
branarakic pushed a commit that referenced this pull request May 25, 2026
…38 scripts

The four LU-6 devnet scripts (#621/#622/#623/#624) shipped with several
control-flow gaps that let regressions silently sneak through:
errors swallowed with `|| true`, fixed sleeps where bounded retries
were needed, and assertions that passed even when the scenario under
test never happened. Codex's review on the closed PRs flagged 24
specific items; the rc.10 integration merge fixed most of them. This
commit closes the remaining ones — the ones that materially affect
whether a PASS is meaningful.

devnet-test-rfc38-revocation.sh (#621):
 - Member pre-creates (M1, M2) no longer swallow EVERY error with
   `|| true`. The new `member_pre_create` helper captures the response
   and tolerates ONLY the idempotent "already exists" signal; any
   other failure (wrong auth, malformed body, daemon down) now fails
   the script immediately with the actual error visible, instead of
   surfacing later as an opaque catchup timeout.
 - Phase 5's single "sleep 3 + one-shot read" replaced with a 30s
   bounded-retry loop (`wait_for_count_or_steady`) — gossip /
   catchup latency between the post-revocation write and a member's
   final triple count is variable, and the one-shot snapshot was
   reporting M1's partial state as a regression.
 - Added the forward-only-rotation lower bound: `M2_FINAL >= M2_PRE`.
   The pre-existing `<= 3` check alone would pass if revocation also
   wiped M2's previously-decryptable triples — violating the contract
   in the script header that the kicked member RETAINS what they
   could already decrypt; they just stop learning anything new.

devnet-test-rfc38-curator-offline-midbatch.sh (#622):
 - Phase 1's `sleep 3` after member pre-create replaced with an
   explicit `wait_for_m1_onchain_id` poll. Phase 5's non-curator
   publish requires M1 to have observed the CG's `onChainId`,
   otherwise it bounces with "Context graph ... is not registered
   on-chain" — gossip lag for the `ContextGraphCreated` event can
   easily exceed 3s under devnet load.
 - EXIT trap now waits up to 60s for `/api/status` to respond
   before declaring the curator healthy again. `devnet.sh restart-node`
   returns after spawning the daemon, not after it's actually ready
   to serve, so CI runners that chain another scenario immediately
   were inheriting a half-started devnet.

devnet-test-rfc38-prereg-bytecap-stress.sh (#623):
 - Added a precondition that the burst's SUBMITTED_BYTES actually
   exceeds the configured cap. With the default 80 KiB payload size
   anything up to 12 writes stayed under the 1 MiB cap, so the
   downstream clamp assertion was vacuously satisfied if anyone
   misconfigured WRITES_COUNT/WRITE_PAYLOAD_BYTES — a TEST bug
   passing as a daemon PASS.
 - Phase 6's liveness check no longer silently skips when the
   pidfile is missing/empty. Falls back to a `/api/status` HTTP
   probe so containerised devnets (where pidfiles aren't written)
   still get a meaningful liveness check; hard-fails when both
   pidfile AND status are unreachable.

devnet-test-rfc38-unclean-restart.sh (#624):
 - Captures the killed core's `peerId` from `/api/status` BEFORE
   the SIGKILL. The post-restart catchup calls in phase 6 now pin
   to that peerId so we're explicitly exercising recovery from the
   restarted node — previously the catchup fanned out to any
   connected peer and could pass by pulling data from the curator
   (still online), validating nothing about the unclean-restart
   contract.
 - Phase 3 now waits for STRICT mid-batch state (`0 < M1_PARTIAL <
   WRITES_COUNT`) before the SIGKILL. The previous one-shot snapshot
   accepted any partial value including 0 or already-complete; both
   meant the kill below never actually exercised the
   `lastHostCatchupSeqno` resume path this test claims to cover.
 - Catchup responses are captured (not discarded to `/dev/null`)
   and asserted free of `error` / `swmError` / `durableError` via
   the new `assert_catchup_clean` helper. HTTP 500s, auth denials,
   host-catchup failures, etc. were previously invisible — the
   final triple-count check would still go green if data arrived
   via background gossip.

Bash-syntax-checked (`bash -n` on all four). No behaviour change
when the underlying scenarios are healthy; the changes only convert
silent false-positives into loud failures.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant