Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook) by branarakic · Pull Request #700 · OriginTrail/dkg

branarakic · 2026-05-26T16:48:34Z

Summary

Adds three layered pieces so a small / sparse DKG mesh can rediscover direct peer addresses without depending on a flaky DHT lookup:

Layer	Where	What
Schema	`profile.ts`	`dkg:multiaddr` (one per address) + `dkg:lastSeen` (ISO timestamp) on every profile
Heartbeat	`dkg-agent.ts`	`AGENT_PROFILE_HEARTBEAT_MS = 5min`; re-publishes profile so phonebook entries stay fresh
Dial fallback	`peer-resolver.ts`	step 4 prefers richer `findAgentDialAddresses` (multiaddrs + relay + lastSeen) with 24h staleness filter
Stall recovery	`dkg-agent.ts`	`Messenger.resolvePeer` now routes through full `PeerResolver` instead of raw `peerRouting.findPeer`

The trigger: testnet diagnostics for #697 showed that even when DHT lookups fail, the agents CG already holds enough metadata to find a peer — we just weren't asking it during dial. PR A (#698) tunes the libp2p side; this PR teaches the resolver to use the CG as a phonebook.

Backward compat

Schema: dkg:multiaddr and dkg:lastSeen are NOT added to genesis.ts. That would change computeNetworkId (which hashes all genesis quads) and break rolling deployment with any rc.11 peer. RDF doesn't require properties to be declared in the ontology — they're usable as-is. A coordinated genesis bump can land in a future release.
AgentDirectoryLookup: findAgentDialAddresses is optional. The resolver falls back to the legacy findRelayForPeer when the directory doesn't implement it, so any external embedder or test fixture using the old shape keeps working.
Heartbeat: defaults to 5min; 0 disables (one-shot startup publish still fires). Operator-tunable via config.network.agentProfileHeartbeatMs.

Wiring

AgentProfileConfig.multiaddrs + lastSeen; populated in DKGAgent.publishProfile from this.node.multiaddrs filtered through new collectPublishableMultiaddrs (drops loopback / link-local / unspecified bind / dedup).
DiscoveryClient.findAgentByPeerId gains a second SPARQL query for dkg:multiaddr rows and pulls ?lastSeen in the scalar SELECT.
PeerResolver step 4 — primes the peerStore with direct multiaddrs (staleness-gated) then layers the relay form on top.
Messenger resolvePeer hook — routes through peerResolver.resolve() so stall recovery picks up the phonebook path.

Tests

agent.test.ts > Profile Builder (+3): multiaddrs/lastSeen emission, default-timestamp behaviour, quote-injection guard, collectPublishableMultiaddrs filter.
peer-resolver.test.ts (+4): phonebook path preferred over relay-only; stale lastSeen drops multiaddrs but keeps relay; falls back to old findRelayForPeer when new method absent; custom stale threshold honoured.

Full suites: @dkg/core 935 / 935 green (+2 vs PR A), @dkg/agent Profile Builder 9/9 green. Same single pre-existing swm-snapshot-sync failure on agent suite as PR A (verified on main: Unknown file extension ".ts" worker-thread loader issue, unrelated).

Test plan

Build green: pnpm -F @origintrail-official/dkg-core build
Build green: pnpm -F @origintrail-official/dkg-agent build
Build green: pnpm -F @origintrail-official/dkg build (cli)
Unit tests green: peer-resolver.test.ts (23/23), agent Profile Builder (9/9), full @dkg/core (935/935)
Manual: with this branch deployed on Miles, watch the agents CG over time — confirm dkg:multiaddr triples appear, get refreshed every 5min, and that querying findAgentByPeerId for Muy-Sentinel returns multiaddrs. Then on Muy's side, kill direct connection to Miles, wait for peerStore eviction (1h default), and confirm a fresh dial succeeds via the phonebook step.
Manual: trigger an outbox stall (peer offline > 5 attempts) and confirm [Messenger] recovery routes through PeerResolver (visible via debug logs of agents-CG priming).

Conflict note

Both PR A (#698) and this PR add a network block to DkgConfig. Merging one before the other will require a trivial conflict resolution combining the fields into a single network block. Recommended merge order: A first (smaller, fewer moving parts), then this.

Companion work

PR A: #698 — libp2p peerStore/DHT/PeerResolver-timeout operator-tunable knobs.
Tracking separately: ACK-collector resilience to mid-collection stream resets (today's 3-attempts-per-peer is too brittle when a relay flaps during quorum collection) — to file after Publishing to Verifiable Memory fails with QuorumUnmetError: storage_ack_insufficient (0/3 ACKs) #697 reporter confirms the relay-collapse diagnosis.

Made with Cursor

- **Null-fallback in step 4** (peer-resolver.ts): when `findAgentDialAddresses` is implemented but returns `null` for a peer, fall through to `findRelayForPeer` so peers with only a legacy relay entry (pre-phonebook profiles) still resolve. The interface JSDoc documented this contract; the implementation was only honouring it when the richer method was absent entirely. Regression test added. - **Per-address robustness in step 4** (peer-resolver.ts): a single malformed `dkg:multiaddr` from an untrusted profile would make the whole batch `multiaddr()` conversion throw, poisoning every valid sibling. `primeAndAppend` now retries per-address on batch failure so the bad entry is dropped in isolation. Regression test added (mocks `addKnownAddresses` to throw on the bad entry only). - **Heartbeat re-entrancy guard** (dkg-agent.ts): the `agentProfileHeartbeatTimer` is a fixed-cadence `setInterval`, so a slow `publishProfile()` (e.g. chain RPC slow + heartbeat configured short) could let two publishes race on `ProfileManager.currentKcId`. Added `agentProfileHeartbeatInFlight` flag — ticks short-circuit (with a debug log) when a publish is still in flight. Tests: `peer-resolver.test.ts` 25/25 (+2 regression tests). Full `@dkg/core` suite 937/937 green. Co-authored-by: Cursor <cursoragent@cursor.com>

branarakic · 2026-05-26T17:02:45Z

Addressed all three Codex findings in 1c99a88:

🔴 Null-fallback in step 4: when findAgentDialAddresses is implemented but returns null, the resolver now falls through to findRelayForPeer so peers with only a legacy relay entry (pre-phonebook profiles) still resolve. Regression test added.
🔴 Heartbeat re-entrancy: added agentProfileHeartbeatInFlight flag; ticks short-circuit (with a debug log) when a previous publishProfile() is still running, eliminating the currentKcId race when chain RPC is slow or the heartbeat is configured tight. No dedicated unit test — guard is a 5-line well-understood try/finally, and a test would require full DKGAgent setup for marginal value. Existing agent integration tests cover regression.
🟡 Per-address robustness: primeAndAppend now retries per-address when the batch addKnownAddresses throws, so a single malformed dkg:multiaddr from an untrusted profile is dropped in isolation instead of poisoning every valid sibling. Regression test added (mocks addKnownAddresses to throw on the bad entry only).

Tests: peer-resolver.test.ts 25/25 (+2 regression). Full @dkg/core suite 937/937 green.

Three follow-on issues raised on the round-1 fixes: (1) `collectPublishableMultiaddrs` regex filter was looser than the existing repo classifier — RFC1918 / CGNAT / ULA / `/dns*/localhost` multiaddrs would still get advertised via `dkg:multiaddr`, so peers learnt self-referential / private addresses from the phonebook and wasted dial attempts before falling back to the relay. Replaced with `isPublicLikeAddress` from `dkg-core` (the shared classifier that already pins behaviour with `share-project-modal.test.ts` and the daemon's remotely-dialable check). Test fence expanded with RFC1918 / CGNAT / ULA / DNS-localhost cases. (2) The heartbeat-only `agentProfileHeartbeatInFlight` guard left a race window vs the other `publishProfile()` callers — startup, key rotation, and revocation also call this method, and they all mutate `ProfileManager.currentKcId` plus rewrite the registry triples. Added a `publishProfileTail` mutex (tail-promise chain) inside `publishProfile()` so EVERY caller is serialised at the lowest level. Kept the heartbeat inFlight flag as a tick- coalescing optimisation (keeps queue depth at 1) but documented that it's not the correctness gate. (3) Step-4 staleness gate treated `lastSeenMs === undefined` as fresh, which contradicted the `DiscoveredAgent.lastSeen` JSDoc saying unknown freshness should fall back to relay only. Partial or manual agents-CG entries (no heartbeat) bypassed the stale-data guard. Flipped to "fresh iff lastSeenMs present AND within window"; relay address still tried regardless. JSDoc updated and a regression test pins the new behaviour. Drive-by: also escaped a `/dns*/` token in the new JSDoc that TypeScript was parsing as a comment terminator. Co-authored-by: Cursor <cursoragent@cursor.com>

branarakic · 2026-05-26T17:22:09Z

Round 2 Codex feedback addressed in 5c3a77c7:

Finding (4): `collectPublishableMultiaddrs` filter was too loose

Diagnosis confirmed. The round-1 regex only caught loopback / link-local / unspecified bind. RFC1918 (10/8, 172.16/12, 192.168/16), CGNAT (100.64/10), IPv6 ULA (fc00::/7), and /dns4/localhost would have leaked into the agent profile, causing peers to learn self-referential or private multiaddrs and waste dial attempts before falling back to the relay.
Fix: replaced the inline regex with isPublicLikeAddress from dkg-core — the same classifier the daemon already uses for "node is remotely-dialable" and that pairs with the UI's isMultiaddrRemotelyDialable in share-project-modal.test.ts. Now there's exactly one classifier across the codebase. dkg-core re-exports isPublicLikeAddress so profile.ts can consume it without reaching into a deep import path. Test fence in agent.test.ts expanded with RFC1918 / CGNAT / ULA / DNS-localhost / public-DNS cases.

Finding (5): in-flight flag only guarded heartbeat-vs-heartbeat

Diagnosis confirmed. publishProfile() is also called from startup, key rotation, and revocation paths — all of which mutate ProfileManager.currentKcId and rewrite registry triples. The round-1 agentProfileHeartbeatInFlight flag protected nothing about those overlaps.
Fix: added a publishProfileTail tail-promise chain INSIDE publishProfile() itself. Every caller (heartbeat, startup, key rotation, revocation, manual API) now waits for the prior publish to settle (success or failure) before running its own. Kept the heartbeat inFlight flag as a tick-coalescing optimisation (keeps queue depth at 1 even if publish latency exceeds the heartbeat cadence) but documented in the JSDoc that it's NOT the correctness gate — the mutex is.

Finding (6): `lastSeenMs === undefined` was treated as fresh

Diagnosis confirmed. The round-1 staleness check used dial.lastSeenMs !== undefined && Date.now() - dial.lastSeenMs > threshold, which means missing freshness → isStale=false → multiaddrs used anyway. That contradicted the DiscoveredAgent.lastSeen JSDoc ("unknown freshness should fall back to relay only") and let partial / manual agents-CG entries bypass the stale-data guard.
Fix: flipped to isFresh = lastSeenMs !== undefined && Date.now() - lastSeenMs <= threshold and only use multiaddrs when isFresh. Relay address still tried regardless (relays outlive individual peer NAT bindings). Added a regression test (step 4 (phonebook): missing lastSeenMs is treated as stale (multiaddrs dropped, relay still used)) that pins the new behaviour. JSDoc on both AgentDirectoryLookup.findAgentDialAddresses and the docs around the gate updated.

CI green locally (core 938/938, agent 120/120). Drive-by: also escaped a /dns*/ token in the new JSDoc that TypeScript was reading as a comment terminator.

Two real-bug findings from round 2: (1) SPARQL injection vector in `DiscoveryClient.findAgentByPeerId`: the `agentUri` bound from the first query was interpolated unguarded as `<${agentUri}>` in the follow-up multiaddr query. A blank-node subject (`_:b1`) or an IRI containing `>` / whitespace / control chars from a malicious agents-CG entry would break the second query or open an injection path. Two layers of defense added: (a) `FILTER(isIRI(?agent))` in the first query so the engine drops non-IRI bindings before they reach JS, and (b) the existing `assertSafeIri` / `sparqlIri` helpers from `core/sparql-safe.ts` validate the IRI before interpolation. On validation failure the whole entry is dropped (return null) so the bug can't relocate downstream via the returned `agentUri`. (2) The round-2 freshness gate `Date.now() - lastSeenMs <= threshold` silently accepted ANY future `lastSeenMs` because the LHS becomes negative. A clock-skewed or malicious agents-CG profile could therefore keep dead direct multiaddrs eligible indefinitely. Added an upper bound `lastSeenMs <= now + AGENT_DIRECTORY_CLOCK_SKEW_ALLOWANCE_MS` (5 minutes — covers NTP drift between independent hosts without admitting attacks). Regression tests pin both the future-rejection corner and the within-skew-acceptance corner. JSDoc on `AgentDirectoryLookup.findAgentDialAddresses` updated to document the two-sided window. Co-authored-by: Cursor <cursoragent@cursor.com>

branarakic · 2026-05-26T18:36:32Z

Round 3 Codex feedback addressed in 1d9842aa:

Finding (7) 🔴: SPARQL injection vector in `findAgentByPeerId`

Diagnosis confirmed. The agentUri bound from the first SPARQL query was interpolated unguarded as <${agentUri}> in the follow-up multiaddr query. A blank-node subject (_:b1) or any IRI containing > / whitespace / control chars from a malicious agents-CG entry would break the second query or open an injection path.
Fix: two layers of defense:
1. Engine layer: FILTER(isIRI(?agent)) added to the first query so non-IRI bindings never reach JS.
2. JS layer: wrapped the IRI with assertSafeIri() / sparqlIri() from core/sparql-safe.ts (the existing helpers). On validation failure, the entire entry is dropped (return null) so a malformed agentUri can't relocate the bug downstream via the returned DiscoveredAgent.agentUri.

Finding (8) 🔴: future-timestamp bypass in freshness check

Diagnosis confirmed. Date.now() - lastSeenMs <= threshold silently accepted ANY future lastSeenMs because the LHS goes negative. A clock-skewed or malicious profile could keep dead direct multiaddrs eligible indefinitely.
Fix: added a 5-minute upper-bound skew allowance: lastSeenMs !== undefined && lastSeenMs <= now + skew && now - lastSeenMs <= threshold. The 5-minute window covers reasonable NTP drift between independent hosts without admitting attacks. Two regression tests pin the corners:
- 1-hour future timestamp → multiaddrs DROPPED, relay still used.
- 1-minute future timestamp (within skew) → multiaddrs KEPT.

JSDoc on AgentDirectoryLookup.findAgentDialAddresses updated to document the two-sided window explicitly. CI green locally (core 944/944 with new tests, agent 120/120).

Agents Context Graph as distributed phonebook — adds DiscoveryClient multiaddr/lastSeen advertisement so peer resolution can fall back to the CG instead of relying solely on libp2p PeerStore + the public relay's routing table. Includes 3 rounds of Codex review fixes. Notably interacts with PR-2 of the SWM-fanout work already on rc.12: PR-2's drainPendingSenderKeyForPeer calls discovery.findAgentByPeerId to resolve the connecting peer's agent address \u2192 phonebook makes that resolution work for agents we haven't directly synced with yet. Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # packages/agent/src/dkg-agent-types.ts # packages/cli/src/config.ts # packages/cli/src/daemon/lifecycle.ts # packages/core/src/index.ts # packages/core/src/network/peer-resolver.ts # packages/core/test/peer-resolver.test.ts

fix(agent/discovery): populate `agentAddress` on findAgentByPeerId so #700 drain works in production

Round 4 of Codex review cleanup, plus node-UI activity feed work that sits on top of the same branch: agent/core scope trim (review feedback): - Remove agent-profile heartbeat (`AGENT_PROFILE_HEARTBEAT_MS`, `AGENT_PROFILE_STALE_THRESHOLD_MS`, periodic re-publish timer). - Remove `multiaddrs` / `lastSeen` fields from `DiscoveredAgent` and the SPARQL multiaddr-collection query in `findAgentByPeerId`. - Drop `collectPublishableMultiaddrs` helper and the discovery client multiaddr round-trip; profile is now just the scalar phonebook entry. - Trim `peer-resolver.ts` and `agent.test.ts` to match. - Remove `agentProfileHeartbeatMs` config knob. - Net: phonebook is the "agents Context Graph as directory" feature the PR title states; multiaddr / freshness signalling is deferred. node-ui activity feed: - New `useAssertionLifecycleEvents` hook + tests; surfaces created / promoted / published / finalized / discarded events on assertions. - Project activity joiner now merges lifecycle events with the existing SWM gossip / VM publish streams in a single ordered feed. - ActivityFeed + AgentChip render updates; new test fixtures cover empty-state and joiner-order edge cases. evm-module: - Refresh `base_sepolia_v10_contracts.json` against the current devnet deployment. Co-authored-by: Cursor <cursoragent@cursor.com>

…egation test(agent): pin #700 publishProfile mutex serialization + 1-of-N partial-fail aggregation

…lidation Productionising the test-suite findings from a full rc.12 comprehensive devnet sweep so the same harness reliably guards rc.13+. All changes are in scripts/; no production code touched. Fixes uncovered while testing release/rc.12: v10-rc-validation.sh — migrate to rc.12 API shapes * publish path: /api/publish (removed) → /api/shared-memory/write + /api/shared-memory/publish, with selection.rootEntities so re-runs against a CG that already has SWM content don't trip the rc.12 rootEntity-uniqueness rule * private quads: rewritten to /api/update + privateMerkleRoot receipt (rc.12 stores private quads encrypted-at-rest; they are intentionally not served via /api/query, so the old "publisher sees private back" assertion is replaced with the storage-receipt check) * CAS: send a real non-empty conditions array (rc.12 rejects empty) * chat: { to, text } (was recipientPeerId/text) * identity: /api/profile (removed) → /api/identity + /api/status * exit non-zero on any FAIL so the orchestrator picks it up * timestamp-suffixed entity URIs, sub-graph names, assertion names so every re-run is collision-free even on a long-lived devnet * helper rewritten so the JSON-extraction pipeline is heredoc-safe (was: stdin shadowed by heredoc; expressions with semicolons silently SyntaxError'd and every parse returned EMPTY) swm-soak-test.sh — python heredoc bug * Final-summary block used `<<PYTHON` (unquoted), so bash expanded every backtick-quoted code reference in Python comments (`m = re.match(...)`, `s`, `by_cg_final[cg]`, etc.) as command substitutions, producing a flood of "command not found" lines. Switched to `<<'PYTHON'` + env-var pass-through. devnet-test-rfc38-curator-offline-midbatch.sh — wrong endpoint * wait_for_m1_onchain_id polled `GET /api/context-graph/<id>`, which is a 404 (no such route) — the test reliably false-failed on every run. Fixed to use `GET /api/context-graph/list` + client-side lookup, and made the timeout configurable via RFC38_M1_ONCHAIN_WAIT_S=60. devnet-test-rfc38-unclean-restart.sh — too-fast catchup + curl ARG_MAX * Bumped WRITES_COUNT 20 → 200 and WRITE_PAYLOAD_BYTES 4096 → 16384, and dropped the partial-catchup poll from 1 s → 100 ms, so the test reliably observes M1 mid-batch at rc.12 catchup speeds (verified: M1_pre-kill = 159 / 200 on the reference devnet). * api_call streams large bodies through stdin (`-d @-`) instead of argv; pre-fix the 3.2 MiB stress body tripped macOS ARG_MAX with "Argument list too long". devnet.sh — `stop` is now actually idempotent * After stopping by pidfile, sweep every devnet port (HARDHAT_PORT + API_PORT_BASE..+N + LIBP2P_PORT_BASE..+N) with `lsof` and SIGTERM/SIGKILL any process still LISTENing. Catches stale processes from prior rc.X devnets that were killed at the worker layer while the supervisor respawned them. Opt out with DEVNET_STOP_PORT_SWEEP=0 when running multiple devnets on one host. Promoted from .rc12-test/ scratch dir to scripts/ for rc.13+: * devnet-probe-hub-rotation.sh — PR OriginTrail#689 (chain hub rotation) * devnet-probe-multi-rpc-failover.sh — PR OriginTrail#684 (multi-RPC failover) * devnet-probe-libp2p-tunables.sh — PR OriginTrail#698 (libp2p tunables) * devnet-probe-cg-phonebook.sh — PR OriginTrail#700 (agents CG) * devnet-probe-ack-rejection-reasons.sh — PR OriginTrail#711 (ACK gate diagnostics) * devnet-test-node-ui-smoke.sh — Vite dev-server smoke * devnet-comprehensive.sh — orchestrator wiring all of the above + v10-rc-validation + _devnet-full-sweep + rc11 recovery + rfc38-all + soak suite; bash-3.2 safe (no `declare -A`); SKIP_SOAK / SOAK_ONLY / SKIP_PROBES / SKIP_RFC38_EXTRAS / SKIP_UI / FAIL_FAST knobs. Verified on the rc.12 reference devnet: - v10-rc-validation: 34 PASS / 0 FAIL / 2 WARN (exit 0) - rfc38-unclean-restart: PASS (mid-batch window: 159/200) - rfc38-curator-offline-midbatch: PASS - 5 promoted probes (new layout): all PASS - swm-soak (2 cycles, quick): 100.00% on both CGs, no errors - devnet-comprehensive orchestrator: pre-flight + suite registration OK Co-authored-by: Cursor <cursoragent@cursor.com>

…riginTrail#700 drain works in production PR OriginTrail#700 ("Agents Context Graph as distributed phonebook") shipped `DKGAgent.drainPendingSenderKeyForPeer` (`dkg-agent.ts:6089-6138`) as the recovery loop that replays queued SWM sender keys once a previously- unknown agent's peerId resolves on `connection:open`. It's the entire point of the pending-by-agent path landed by OriginTrail#700. The bug ======= `drainPendingSenderKeyForPeer` gates on `profile?.agentAddress`: const profile = await this.discovery.findAgentByPeerId(peerId); if (profile?.agentAddress) { agentAddresses = [profile.agentAddress.toLowerCase()]; } ... if (agentAddresses.length === 0) return 0; But `DiscoveryClient.findAgentByPeerId` (`discovery.ts:147-218`) **never populates that field** — its scalar `SELECT` omits the `?agentAddress` column entirely, and the return object on lines 209-218 doesn't set it. (Compare to the sibling `findAgents()` on the same file, which DOES select and return `agentAddress` at lines 71, 78, 92.) The asymmetry between the two discovery entrypoints meant: - Every `connection:open` → drain attempt early-returned 0. - `pendingSenderKeyByAgent` grew but never replayed. - In production, sender keys queued for not-yet-resolved agents would never be delivered — the whole recovery feature shipped as a permanent no-op. CI was green because `swm-sender-key-pending-by-agent.test.ts` stubbed `discovery.findAgentByPeerId` to return `agentAddress` explicitly (lines 202-209), masking the production gap. The fix ======= Three lines: add `?agentAddress` to the scalar `SELECT`, add the matching `OPTIONAL { ... <${DKG}agentAddress> ... }` clause, and populate the field on the returned object. Now both `findAgents()` and `findAgentByPeerId()` resolve the same identity for the same peer. Regression tests ================ 1. `test/agent.test.ts` — Discovery Client suite — new test "returns agentAddress on findAgentByPeerId — keeps both discovery entrypoints in lockstep". Inserts a profile with an explicit `agentAddress`, asserts both `findAgents()` AND `findAgentByPeerId()` return it (lowercased per `canonicalAgentDidSubject`), and pins the legacy-profile-without-agentAddress fallback (returns `undefined`, doesn't throw). 2. `test/swm-sender-key-pending-by-agent.test.ts` — new "real discovery + agent registry CG" describe block — two integration tests that exercise the drain path **without** stubbing discovery: a. Boot agent → enqueue pending sender key via the no-peerId path → publish the recipient's profile (real `buildAgentProfile` output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer` with the real `DiscoveryClient` → assert messenger.sendReliable was called and the queue is empty. b. Same flow but the recipient profile omits `dkg:agentAddress` (legacy profile) → assert drain returns 0, queue stays in place, messenger is never called. Test (b) would have caught the original bug — every assertion in the all-stubbed path was satisfied even with the broken implementation. Verified ======== - `pnpm --filter @origintrail-official/dkg-agent build` — clean - All 18 SWM sender-key tests pass (+2 new + 16 existing) - All 7 Discovery Client tests pass (+1 new + 6 existing) Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Co-authored-by: Cursor <cursoragent@cursor.com>

…dler + canonical CG keying (OriginTrail#729 Bug 4) Adds `packages/publisher/test/v10-ack-v2-chunked.test.ts` — 8 cases exercising the LU-11 / OT-RFC-39 V2 chunked ACK path landed by PR OriginTrail#715/OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 4. Background ========== OriginTrail#716 review-consolidation audit flagged this as the closest analogue of the gap PR OriginTrail#735 closed for OriginTrail#720: helper-level primitives (chunked AEAD, ciphertext Merkle tree, proto wire format, on-chain commitment fields) are well tested, but the StorageACKHandler that wires them together at the ACK boundary has zero direct coverage. The existing `v10-ack-edge-cases.test.ts` thoroughly covers the V1 (inline staging quads / inline encrypted blob) paths but skips V2 entirely. Worse, PR OriginTrail#729 Bug 4 shipped a fix to the V2 ACK `loadChunk` graph canonicalisation (`normalizeContextGraphIdForChunkStore` returning null → wildcard `GRAPH ?g` fallback) without a regression test — any future refactor of that hook could silently re-introduce the "keccak the decimal string '42'" miss. Scope ===== Eight cases — happy paths first, then the regression and decline shapes: 1. **Happy path** — cleartext CG, canonicalising normalizer, chunks persisted under `ciphertextChunkStoreGraph(canonical(swmGraphId))`, ACK signed. Pins the production canonicalisation path. 2. **OriginTrail#729 Bug 4 regression** — V2 intent omits `swmGraphId`, normalizer returns `null` on every input → handler MUST widen to `GRAPH ?g` and still find the chunks. The pre-fix code keccak'd the decimal `cgId` ('42') and missed every persisted chunk; the test now locks the fix in. 3. **Legacy no-normalizer shim** — `normalizeContextGraphIdForChunkStore` absent → uses raw `swmGraphId` literally, preserving pre-fix behaviour for callers that haven't wired the hook. 4. **Multi-CG isolation (PR OriginTrail#715 / ciphertext-chunk-store.ts:28-44)** — two CGs publish identical V10 KCs (same batchId, since it's plaintext-derived) but persist chunks under different canonical named graphs. ACK for CG-A must only see CG-A's chunks; an intent claiming CG-B's root for the same batchId is correctly declined with `CIPHERTEXT_ROOT_MISMATCH`. This pins the Codex finding that drove the per-CG named-graph design. 5. **`MISSING_CIPHERTEXT_CHUNKS` decline** — claim count=4, persist only indexes 0 and 2 → declines after the 10-second retry window with the missing indexes in the message ("missing 2/4 ... 1,3"). 6. **`CIPHERTEXT_ROOT_MISMATCH` decline** — all chunks present but publisher lies about the root → declines fast. 7. **Curated-only gate** — `isCgCurated` returns false → declines with `SIGNER_NOT_REGISTERED` (the V2-curated-only gate at `storage-ack-handler.ts:296-310`). Closes the bypass concern called out in the inline comment above the V2 branch. 8. **stagingQuads forbidden on V2 wire** — V2 intent with non-empty `stagingQuads` is rejected (the chunked path never carries inline ciphertext). Test harness ============ Real `OxigraphStore` (mirrors `v10-ack-edge-cases.test.ts` pattern), real `encodePublishIntent` / `decodeStorageACK` proto round-trips, real `buildCiphertextChunksRoot` Merkle tree construction. The new `buildV2IntentBytes(opts)` helper derives `ciphertextChunksRoot`, `ciphertextChunkCount`, and `publicByteSize` from the chunks list unless explicitly overridden — production wire shape, not a fake. The `seedChunks(store, opts)` helper inserts chunk literals under the same `(graph, subject, predicate)` layout that `ingestSwmCiphertextChunkEnvelope` in `dkg-agent.ts` writes. Verification ============ - `pnpm --filter @origintrail-official/dkg-publisher build` — clean - `vitest run test/v10-ack-v2-chunked.test.ts` — 8/8 pass - `vitest run test/v10-ack-v2-chunked.test.ts test/v10-ack-edge-cases.test.ts` — 53/53 pass (8 new + 45 existing); no regressions to the V1 suite Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Companion to PR OriginTrail#735 (OriginTrail#720 adapter coverage) and OriginTrail#737 (OriginTrail#700 drain bug). Co-authored-by: Cursor <cursoragent@cursor.com>

…onder + OriginTrail#729 Bug 5 canonical CG keying Adds `packages/agent/test/lu11-handle-get-ciphertext-chunk.test.ts` — 6 cases exercising the LU-11 / OT-RFC-39 `get-ciphertext-chunk` sync verb responder shipped by PR OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 5. Background ========== OriginTrail#716 review-consolidation audit flagged this as one of the critical adapter-wiring gaps in the same class as OriginTrail#720/OriginTrail#735: helper-level signing primitives (`mintSignedCiphertextChunkCatchupRequest`, `verifySignedCiphertextChunkCatchupRequest`, replay guard) are present, but the responder that wires them together with the 5-layer authority stack and the SPARQL chunk lookup is completely untested. Worse, PR OriginTrail#729 Bug 5 shipped a fix to the responder's chunk-graph lookup without a regression test: - Pre-fix: `gossipWireIdFor(req.contextGraphId)` was called unconditionally. For a numeric on-chain id like "42" this keccak'd the literal decimal string and produced a graph URI nothing was ever persisted under — every late-joining core's backfill request silently returned "chunk not found" even when the bytes were on disk under the curator's nameHash. - Fix: routes through `canonicalChunkStoreCgIdOrNull` which returns `null` for inputs it can't safely canonicalise; the responder then widens to a `GRAPH ?g` wildcard scan. Scope ===== Six cases — happy path, the Bug 5 regression, structured-deny shapes, and the replay-guard boundary: 1. **Happy path** — subscribed cleartext CG → keccak wire hash → scoped lookup → chunk returned. Pins the production canonicalisation path the responder relies on. 2. **OriginTrail#729 Bug 5 regression** — numeric `contextGraphId = "42"` with no local CG mapping → `canonicalChunkStoreCgIdOrNull` returns `null` → handler widens to `GRAPH ?g` → finds the chunk persisted under the curator's real nameHash graph. Locks the fix in so a future refactor can't silently re-introduce the decimal-string keccak miss. 3. **`chunk not found` decline** — authorised requester + canonicalisable CG but no chunk persisted → structured `denied: 'chunk not found'` with echoed `(contextGraphId, batchIdHex, chunkIndex)` for requester correlation. 4. **Unauthorised requester** — full 5-layer auth fall-through: none of `resolveOnChainParticipantAgents`, `resolveBeaconPinnedCuratorEoa`, `getContextGraphAgentGateAddresses`, `getContextGraphAllowedPeers`, `chain.getIdentityIdForAddress` admit the requester → structured `denied: 'requester EOA not in any of: ...'` (or `'no authority source available'` when every probe returns null/undefined on MockChainAdapter). 5. **Malformed request bytes** — decoder throws → handler maps to `denied: 'malformed request: ...'` with defensive defaults on the echo fields (empty `contextGraphId`, empty `batchIdHex`, `-1` chunkIndex) so an attacker-controlled garbage payload can't leak back through the response envelope. 6. **Replay-guard boundary** — same wire bytes twice (same nonce, same issuedAtMs, same signature) → first attempt succeeds, second is rejected with `denied: 'replayed chunk-catchup nonce'`. Pins `ciphertextChunkCatchupReplayGuard.recordIfFresh()` so the defensive boundary against signed-envelope replay holds. Test harness ============ Real `DKGAgent` instance booted on `MockChainAdapter`. The responder method is reached through an `(agent as unknown as ResponderInternals)` cast — same pattern as the existing `swm-sender-key-pending-by-agent.test.ts`. Real ciphertext-catchup proto round-trip (`mintSignedCiphertextChunkCatchupRequest` → `encodeCiphertextChunkCatchupRequest` → `handleGetCiphertextChunk` → `decodeCiphertextChunkCatchupResponse`), real EIP-191 personal-sign via `ethers.Wallet.signMessage`. The OT-RFC-39 fifth authority is exercised by monkey-patching `chain.getIdentityIdForAddress` on the agent's chain adapter — the other four authorities answer null/undefined naturally on MockChainAdapter. Verification ============ - `pnpm --filter @origintrail-official/dkg-agent build` — clean - `vitest run test/lu11-handle-get-ciphertext-chunk.test.ts` — 6/6 pass No production code changes. Test-only PR. Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Companion to PR OriginTrail#735 (OriginTrail#720), PR OriginTrail#737 (OriginTrail#700 drain bug), PR OriginTrail#738 (OriginTrail#729 Bug 4 V2 ACK loadChunk canonical keying). Co-authored-by: Cursor <cursoragent@cursor.com>

… 1-of-N partial-fail aggregation PR OriginTrail#716 review-consolidation audit flagged two concurrency-critical gates in PR OriginTrail#700 as "Critical — helper code tested, wiring untested": 1. `publishProfileTail` mutex (dkg-agent.ts:4085-4101). The chained- promise tail is what prevents startup, heartbeat, key rotation, and key revocation from racing on `ProfileManager.currentKcId` and the agent registry triples. The audit found zero direct coverage. This adds 3 tests: - N=5 concurrent `publishProfile()` calls: `maxConcurrency===1` and FIFO call order is preserved. - Error isolation: a rejecting `publishProfileImpl` does not poison the tail; subsequent callers still run. - Return-value isolation: each caller receives its OWN implementation return, not a sibling's. 2. SWM sender-key 1-of-N partial-fail aggregation (dkg-agent.ts:5998-6042). Existing tests covered the all-fail throw and the all-soft no-throw branches; the "M-of-N fatal" path — where some agents succeed and others are fatal — was never exercised. This adds 1 test: - 3 recipients, only the middle one's ack rejects: the throw must cite exactly 1 agent, include that agent's address + the per-recipient reason, and MUST NOT contain the successful recipients' addresses. Both gaps were verified by inspection before adding tests — the production code is correct today, but a refactor that collapses the tail-chain or relaxes the per-agent grouping would silently ship a concurrency regression. These pins make either failure mode loud at PR time. Closes audit items D1 (mutex) + D2 (aggregation) from OriginTrail#716. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/core/src/network/peer-resolver.ts Outdated

Comment thread packages/agent/src/dkg-agent.ts Outdated

Comment thread packages/core/src/network/peer-resolver.ts Outdated

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/agent/src/profile.ts Outdated

Comment thread packages/agent/src/dkg-agent.ts Outdated

Comment thread packages/core/src/network/peer-resolver.ts Outdated

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/agent/src/discovery.ts Outdated

Comment thread packages/core/src/network/peer-resolver.ts Outdated

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/agent/src/discovery.ts

Comment thread packages/agent/src/profile.ts Outdated

Comment thread packages/core/src/network/peer-resolver.ts Outdated

branarakic added a commit that referenced this pull request May 27, 2026

Merge pull request #737 from OriginTrail/fix/rc12-700-pending-drain-bug

6b594aa

fix(agent/discovery): populate `agentAddress` on findAgentByPeerId so #700 drain works in production

branarakic added a commit that referenced this pull request May 27, 2026

Merge pull request #740 from OriginTrail/test/rc12-700-mutex-and-aggr…

574e8aa

…egation test(agent): pin #700 publishProfile mutex serialization + 1-of-N partial-fail aggregation

branarakic changed the base branch from main to release/rc.12 May 27, 2026 14:55

branarakic changed the title ~~feat(agent/core): agents Context Graph as distributed phonebook~~ Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook) May 27, 2026

branarakic mentioned this pull request May 27, 2026

rc.12 stabilization integration — 5 findings (2 HIGH) from 4c+2e devnet sweep on integration/rc.12-stabilize-20260527-2147 #774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook)#700

Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook)#700
branarakic wants to merge 1 commit into
release/rc.12from
feat/chain-agents-cg-phonebook

branarakic commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

branarakic commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

branarakic commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

branarakic commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

branarakic commented May 26, 2026

Summary

Backward compat

Wiring

Tests

Test plan

Conflict note

Companion work

Uh oh!

Uh oh!

Uh oh!

Uh oh!

branarakic commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

branarakic commented May 26, 2026

Finding (4): collectPublishableMultiaddrs filter was too loose

Finding (5): in-flight flag only guarded heartbeat-vs-heartbeat

Finding (6): lastSeenMs === undefined was treated as fresh

Uh oh!

Uh oh!

Uh oh!

branarakic commented May 26, 2026

Finding (7) 🔴: SPARQL injection vector in findAgentByPeerId

Finding (8) 🔴: future-timestamp bypass in freshness check

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Finding (4): `collectPublishableMultiaddrs` filter was too loose

Finding (6): `lastSeenMs === undefined` was treated as fresh

Finding (7) 🔴: SPARQL injection vector in `findAgentByPeerId`