Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook)#700
Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook)#700branarakic wants to merge 1 commit into
Conversation
- **Null-fallback in step 4** (peer-resolver.ts): when `findAgentDialAddresses` is implemented but returns `null` for a peer, fall through to `findRelayForPeer` so peers with only a legacy relay entry (pre-phonebook profiles) still resolve. The interface JSDoc documented this contract; the implementation was only honouring it when the richer method was absent entirely. Regression test added. - **Per-address robustness in step 4** (peer-resolver.ts): a single malformed `dkg:multiaddr` from an untrusted profile would make the whole batch `multiaddr()` conversion throw, poisoning every valid sibling. `primeAndAppend` now retries per-address on batch failure so the bad entry is dropped in isolation. Regression test added (mocks `addKnownAddresses` to throw on the bad entry only). - **Heartbeat re-entrancy guard** (dkg-agent.ts): the `agentProfileHeartbeatTimer` is a fixed-cadence `setInterval`, so a slow `publishProfile()` (e.g. chain RPC slow + heartbeat configured short) could let two publishes race on `ProfileManager.currentKcId`. Added `agentProfileHeartbeatInFlight` flag — ticks short-circuit (with a debug log) when a publish is still in flight. Tests: `peer-resolver.test.ts` 25/25 (+2 regression tests). Full `@dkg/core` suite 937/937 green. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Addressed all three Codex findings in 1c99a88:
Tests: |
Three follow-on issues raised on the round-1 fixes: (1) `collectPublishableMultiaddrs` regex filter was looser than the existing repo classifier — RFC1918 / CGNAT / ULA / `/dns*/localhost` multiaddrs would still get advertised via `dkg:multiaddr`, so peers learnt self-referential / private addresses from the phonebook and wasted dial attempts before falling back to the relay. Replaced with `isPublicLikeAddress` from `dkg-core` (the shared classifier that already pins behaviour with `share-project-modal.test.ts` and the daemon's remotely-dialable check). Test fence expanded with RFC1918 / CGNAT / ULA / DNS-localhost cases. (2) The heartbeat-only `agentProfileHeartbeatInFlight` guard left a race window vs the other `publishProfile()` callers — startup, key rotation, and revocation also call this method, and they all mutate `ProfileManager.currentKcId` plus rewrite the registry triples. Added a `publishProfileTail` mutex (tail-promise chain) inside `publishProfile()` so EVERY caller is serialised at the lowest level. Kept the heartbeat inFlight flag as a tick- coalescing optimisation (keeps queue depth at 1) but documented that it's not the correctness gate. (3) Step-4 staleness gate treated `lastSeenMs === undefined` as fresh, which contradicted the `DiscoveredAgent.lastSeen` JSDoc saying unknown freshness should fall back to relay only. Partial or manual agents-CG entries (no heartbeat) bypassed the stale-data guard. Flipped to "fresh iff lastSeenMs present AND within window"; relay address still tried regardless. JSDoc updated and a regression test pins the new behaviour. Drive-by: also escaped a `/dns*/` token in the new JSDoc that TypeScript was parsing as a comment terminator. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Round 2 Codex feedback addressed in Finding (4):
|
Two real-bug findings from round 2:
(1) SPARQL injection vector in `DiscoveryClient.findAgentByPeerId`:
the `agentUri` bound from the first query was interpolated
unguarded as `<${agentUri}>` in the follow-up multiaddr query.
A blank-node subject (`_:b1`) or an IRI containing `>` /
whitespace / control chars from a malicious agents-CG entry would
break the second query or open an injection path. Two layers of
defense added: (a) `FILTER(isIRI(?agent))` in the first query so
the engine drops non-IRI bindings before they reach JS, and (b)
the existing `assertSafeIri` / `sparqlIri` helpers from
`core/sparql-safe.ts` validate the IRI before interpolation. On
validation failure the whole entry is dropped (return null) so the
bug can't relocate downstream via the returned `agentUri`.
(2) The round-2 freshness gate `Date.now() - lastSeenMs <=
threshold` silently accepted ANY future `lastSeenMs` because the
LHS becomes negative. A clock-skewed or malicious agents-CG
profile could therefore keep dead direct multiaddrs eligible
indefinitely. Added an upper bound `lastSeenMs <= now +
AGENT_DIRECTORY_CLOCK_SKEW_ALLOWANCE_MS` (5 minutes — covers NTP
drift between independent hosts without admitting attacks).
Regression tests pin both the future-rejection corner and the
within-skew-acceptance corner. JSDoc on
`AgentDirectoryLookup.findAgentDialAddresses` updated to document
the two-sided window.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Round 3 Codex feedback addressed in Finding (7) 🔴: SPARQL injection vector in
|
Agents Context Graph as distributed phonebook — adds DiscoveryClient multiaddr/lastSeen advertisement so peer resolution can fall back to the CG instead of relying solely on libp2p PeerStore + the public relay's routing table. Includes 3 rounds of Codex review fixes. Notably interacts with PR-2 of the SWM-fanout work already on rc.12: PR-2's drainPendingSenderKeyForPeer calls discovery.findAgentByPeerId to resolve the connecting peer's agent address \u2192 phonebook makes that resolution work for agents we haven't directly synced with yet. Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # packages/agent/src/dkg-agent-types.ts # packages/cli/src/config.ts # packages/cli/src/daemon/lifecycle.ts # packages/core/src/index.ts # packages/core/src/network/peer-resolver.ts # packages/core/test/peer-resolver.test.ts
fix(agent/discovery): populate `agentAddress` on findAgentByPeerId so #700 drain works in production
Round 4 of Codex review cleanup, plus node-UI activity feed work that sits on top of the same branch: agent/core scope trim (review feedback): - Remove agent-profile heartbeat (`AGENT_PROFILE_HEARTBEAT_MS`, `AGENT_PROFILE_STALE_THRESHOLD_MS`, periodic re-publish timer). - Remove `multiaddrs` / `lastSeen` fields from `DiscoveredAgent` and the SPARQL multiaddr-collection query in `findAgentByPeerId`. - Drop `collectPublishableMultiaddrs` helper and the discovery client multiaddr round-trip; profile is now just the scalar phonebook entry. - Trim `peer-resolver.ts` and `agent.test.ts` to match. - Remove `agentProfileHeartbeatMs` config knob. - Net: phonebook is the "agents Context Graph as directory" feature the PR title states; multiaddr / freshness signalling is deferred. node-ui activity feed: - New `useAssertionLifecycleEvents` hook + tests; surfaces created / promoted / published / finalized / discarded events on assertions. - Project activity joiner now merges lifecycle events with the existing SWM gossip / VM publish streams in a single ordered feed. - ActivityFeed + AgentChip render updates; new test fixtures cover empty-state and joiner-order edge cases. evm-module: - Refresh `base_sepolia_v10_contracts.json` against the current devnet deployment. Co-authored-by: Cursor <cursoragent@cursor.com>
…egation test(agent): pin #700 publishProfile mutex serialization + 1-of-N partial-fail aggregation
…lidation
Productionising the test-suite findings from a full rc.12 comprehensive
devnet sweep so the same harness reliably guards rc.13+. All changes are
in scripts/; no production code touched.
Fixes uncovered while testing release/rc.12:
v10-rc-validation.sh — migrate to rc.12 API shapes
* publish path: /api/publish (removed) → /api/shared-memory/write +
/api/shared-memory/publish, with selection.rootEntities so re-runs
against a CG that already has SWM content don't trip the rc.12
rootEntity-uniqueness rule
* private quads: rewritten to /api/update + privateMerkleRoot receipt
(rc.12 stores private quads encrypted-at-rest; they are intentionally
not served via /api/query, so the old "publisher sees private back"
assertion is replaced with the storage-receipt check)
* CAS: send a real non-empty conditions array (rc.12 rejects empty)
* chat: { to, text } (was recipientPeerId/text)
* identity: /api/profile (removed) → /api/identity + /api/status
* exit non-zero on any FAIL so the orchestrator picks it up
* timestamp-suffixed entity URIs, sub-graph names, assertion names so
every re-run is collision-free even on a long-lived devnet
* helper rewritten so the JSON-extraction pipeline is heredoc-safe
(was: stdin shadowed by heredoc; expressions with semicolons silently
SyntaxError'd and every parse returned EMPTY)
swm-soak-test.sh — python heredoc bug
* Final-summary block used `<<PYTHON` (unquoted), so bash expanded
every backtick-quoted code reference in Python comments (`m =
re.match(...)`, `s`, `by_cg_final[cg]`, etc.) as command
substitutions, producing a flood of "command not found" lines.
Switched to `<<'PYTHON'` + env-var pass-through.
devnet-test-rfc38-curator-offline-midbatch.sh — wrong endpoint
* wait_for_m1_onchain_id polled `GET /api/context-graph/<id>`, which
is a 404 (no such route) — the test reliably false-failed on every
run. Fixed to use `GET /api/context-graph/list` + client-side lookup,
and made the timeout configurable via RFC38_M1_ONCHAIN_WAIT_S=60.
devnet-test-rfc38-unclean-restart.sh — too-fast catchup + curl ARG_MAX
* Bumped WRITES_COUNT 20 → 200 and WRITE_PAYLOAD_BYTES 4096 → 16384,
and dropped the partial-catchup poll from 1 s → 100 ms, so the test
reliably observes M1 mid-batch at rc.12 catchup speeds (verified:
M1_pre-kill = 159 / 200 on the reference devnet).
* api_call streams large bodies through stdin (`-d @-`) instead of
argv; pre-fix the 3.2 MiB stress body tripped macOS ARG_MAX with
"Argument list too long".
devnet.sh — `stop` is now actually idempotent
* After stopping by pidfile, sweep every devnet port
(HARDHAT_PORT + API_PORT_BASE..+N + LIBP2P_PORT_BASE..+N) with
`lsof` and SIGTERM/SIGKILL any process still LISTENing. Catches
stale processes from prior rc.X devnets that were killed at the
worker layer while the supervisor respawned them. Opt out with
DEVNET_STOP_PORT_SWEEP=0 when running multiple devnets on one host.
Promoted from .rc12-test/ scratch dir to scripts/ for rc.13+:
* devnet-probe-hub-rotation.sh — PR OriginTrail#689 (chain hub rotation)
* devnet-probe-multi-rpc-failover.sh — PR OriginTrail#684 (multi-RPC failover)
* devnet-probe-libp2p-tunables.sh — PR OriginTrail#698 (libp2p tunables)
* devnet-probe-cg-phonebook.sh — PR OriginTrail#700 (agents CG)
* devnet-probe-ack-rejection-reasons.sh — PR OriginTrail#711 (ACK gate diagnostics)
* devnet-test-node-ui-smoke.sh — Vite dev-server smoke
* devnet-comprehensive.sh — orchestrator wiring all of
the above + v10-rc-validation + _devnet-full-sweep + rc11 recovery
+ rfc38-all + soak suite; bash-3.2 safe (no `declare -A`);
SKIP_SOAK / SOAK_ONLY / SKIP_PROBES / SKIP_RFC38_EXTRAS / SKIP_UI
/ FAIL_FAST knobs.
Verified on the rc.12 reference devnet:
- v10-rc-validation: 34 PASS / 0 FAIL / 2 WARN (exit 0)
- rfc38-unclean-restart: PASS (mid-batch window: 159/200)
- rfc38-curator-offline-midbatch: PASS
- 5 promoted probes (new layout): all PASS
- swm-soak (2 cycles, quick): 100.00% on both CGs, no errors
- devnet-comprehensive orchestrator: pre-flight + suite registration OK
Co-authored-by: Cursor <cursoragent@cursor.com>
…riginTrail#700 drain works in production PR OriginTrail#700 ("Agents Context Graph as distributed phonebook") shipped `DKGAgent.drainPendingSenderKeyForPeer` (`dkg-agent.ts:6089-6138`) as the recovery loop that replays queued SWM sender keys once a previously- unknown agent's peerId resolves on `connection:open`. It's the entire point of the pending-by-agent path landed by OriginTrail#700. The bug ======= `drainPendingSenderKeyForPeer` gates on `profile?.agentAddress`: const profile = await this.discovery.findAgentByPeerId(peerId); if (profile?.agentAddress) { agentAddresses = [profile.agentAddress.toLowerCase()]; } ... if (agentAddresses.length === 0) return 0; But `DiscoveryClient.findAgentByPeerId` (`discovery.ts:147-218`) **never populates that field** — its scalar `SELECT` omits the `?agentAddress` column entirely, and the return object on lines 209-218 doesn't set it. (Compare to the sibling `findAgents()` on the same file, which DOES select and return `agentAddress` at lines 71, 78, 92.) The asymmetry between the two discovery entrypoints meant: - Every `connection:open` → drain attempt early-returned 0. - `pendingSenderKeyByAgent` grew but never replayed. - In production, sender keys queued for not-yet-resolved agents would never be delivered — the whole recovery feature shipped as a permanent no-op. CI was green because `swm-sender-key-pending-by-agent.test.ts` stubbed `discovery.findAgentByPeerId` to return `agentAddress` explicitly (lines 202-209), masking the production gap. The fix ======= Three lines: add `?agentAddress` to the scalar `SELECT`, add the matching `OPTIONAL { ... <${DKG}agentAddress> ... }` clause, and populate the field on the returned object. Now both `findAgents()` and `findAgentByPeerId()` resolve the same identity for the same peer. Regression tests ================ 1. `test/agent.test.ts` — Discovery Client suite — new test "returns agentAddress on findAgentByPeerId — keeps both discovery entrypoints in lockstep". Inserts a profile with an explicit `agentAddress`, asserts both `findAgents()` AND `findAgentByPeerId()` return it (lowercased per `canonicalAgentDidSubject`), and pins the legacy-profile-without-agentAddress fallback (returns `undefined`, doesn't throw). 2. `test/swm-sender-key-pending-by-agent.test.ts` — new "real discovery + agent registry CG" describe block — two integration tests that exercise the drain path **without** stubbing discovery: a. Boot agent → enqueue pending sender key via the no-peerId path → publish the recipient's profile (real `buildAgentProfile` output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer` with the real `DiscoveryClient` → assert messenger.sendReliable was called and the queue is empty. b. Same flow but the recipient profile omits `dkg:agentAddress` (legacy profile) → assert drain returns 0, queue stays in place, messenger is never called. Test (b) would have caught the original bug — every assertion in the all-stubbed path was satisfied even with the broken implementation. Verified ======== - `pnpm --filter @origintrail-official/dkg-agent build` — clean - All 18 SWM sender-key tests pass (+2 new + 16 existing) - All 7 Discovery Client tests pass (+1 new + 6 existing) Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Co-authored-by: Cursor <cursoragent@cursor.com>
…dler + canonical CG keying (OriginTrail#729 Bug 4) Adds `packages/publisher/test/v10-ack-v2-chunked.test.ts` — 8 cases exercising the LU-11 / OT-RFC-39 V2 chunked ACK path landed by PR OriginTrail#715/OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 4. Background ========== OriginTrail#716 review-consolidation audit flagged this as the closest analogue of the gap PR OriginTrail#735 closed for OriginTrail#720: helper-level primitives (chunked AEAD, ciphertext Merkle tree, proto wire format, on-chain commitment fields) are well tested, but the StorageACKHandler that wires them together at the ACK boundary has zero direct coverage. The existing `v10-ack-edge-cases.test.ts` thoroughly covers the V1 (inline staging quads / inline encrypted blob) paths but skips V2 entirely. Worse, PR OriginTrail#729 Bug 4 shipped a fix to the V2 ACK `loadChunk` graph canonicalisation (`normalizeContextGraphIdForChunkStore` returning null → wildcard `GRAPH ?g` fallback) without a regression test — any future refactor of that hook could silently re-introduce the "keccak the decimal string '42'" miss. Scope ===== Eight cases — happy paths first, then the regression and decline shapes: 1. **Happy path** — cleartext CG, canonicalising normalizer, chunks persisted under `ciphertextChunkStoreGraph(canonical(swmGraphId))`, ACK signed. Pins the production canonicalisation path. 2. **OriginTrail#729 Bug 4 regression** — V2 intent omits `swmGraphId`, normalizer returns `null` on every input → handler MUST widen to `GRAPH ?g` and still find the chunks. The pre-fix code keccak'd the decimal `cgId` ('42') and missed every persisted chunk; the test now locks the fix in. 3. **Legacy no-normalizer shim** — `normalizeContextGraphIdForChunkStore` absent → uses raw `swmGraphId` literally, preserving pre-fix behaviour for callers that haven't wired the hook. 4. **Multi-CG isolation (PR OriginTrail#715 / ciphertext-chunk-store.ts:28-44)** — two CGs publish identical V10 KCs (same batchId, since it's plaintext-derived) but persist chunks under different canonical named graphs. ACK for CG-A must only see CG-A's chunks; an intent claiming CG-B's root for the same batchId is correctly declined with `CIPHERTEXT_ROOT_MISMATCH`. This pins the Codex finding that drove the per-CG named-graph design. 5. **`MISSING_CIPHERTEXT_CHUNKS` decline** — claim count=4, persist only indexes 0 and 2 → declines after the 10-second retry window with the missing indexes in the message ("missing 2/4 ... 1,3"). 6. **`CIPHERTEXT_ROOT_MISMATCH` decline** — all chunks present but publisher lies about the root → declines fast. 7. **Curated-only gate** — `isCgCurated` returns false → declines with `SIGNER_NOT_REGISTERED` (the V2-curated-only gate at `storage-ack-handler.ts:296-310`). Closes the bypass concern called out in the inline comment above the V2 branch. 8. **stagingQuads forbidden on V2 wire** — V2 intent with non-empty `stagingQuads` is rejected (the chunked path never carries inline ciphertext). Test harness ============ Real `OxigraphStore` (mirrors `v10-ack-edge-cases.test.ts` pattern), real `encodePublishIntent` / `decodeStorageACK` proto round-trips, real `buildCiphertextChunksRoot` Merkle tree construction. The new `buildV2IntentBytes(opts)` helper derives `ciphertextChunksRoot`, `ciphertextChunkCount`, and `publicByteSize` from the chunks list unless explicitly overridden — production wire shape, not a fake. The `seedChunks(store, opts)` helper inserts chunk literals under the same `(graph, subject, predicate)` layout that `ingestSwmCiphertextChunkEnvelope` in `dkg-agent.ts` writes. Verification ============ - `pnpm --filter @origintrail-official/dkg-publisher build` — clean - `vitest run test/v10-ack-v2-chunked.test.ts` — 8/8 pass - `vitest run test/v10-ack-v2-chunked.test.ts test/v10-ack-edge-cases.test.ts` — 53/53 pass (8 new + 45 existing); no regressions to the V1 suite Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Companion to PR OriginTrail#735 (OriginTrail#720 adapter coverage) and OriginTrail#737 (OriginTrail#700 drain bug). Co-authored-by: Cursor <cursoragent@cursor.com>
…onder + OriginTrail#729 Bug 5 canonical CG keying Adds `packages/agent/test/lu11-handle-get-ciphertext-chunk.test.ts` — 6 cases exercising the LU-11 / OT-RFC-39 `get-ciphertext-chunk` sync verb responder shipped by PR OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 5. Background ========== OriginTrail#716 review-consolidation audit flagged this as one of the critical adapter-wiring gaps in the same class as OriginTrail#720/OriginTrail#735: helper-level signing primitives (`mintSignedCiphertextChunkCatchupRequest`, `verifySignedCiphertextChunkCatchupRequest`, replay guard) are present, but the responder that wires them together with the 5-layer authority stack and the SPARQL chunk lookup is completely untested. Worse, PR OriginTrail#729 Bug 5 shipped a fix to the responder's chunk-graph lookup without a regression test: - Pre-fix: `gossipWireIdFor(req.contextGraphId)` was called unconditionally. For a numeric on-chain id like "42" this keccak'd the literal decimal string and produced a graph URI nothing was ever persisted under — every late-joining core's backfill request silently returned "chunk not found" even when the bytes were on disk under the curator's nameHash. - Fix: routes through `canonicalChunkStoreCgIdOrNull` which returns `null` for inputs it can't safely canonicalise; the responder then widens to a `GRAPH ?g` wildcard scan. Scope ===== Six cases — happy path, the Bug 5 regression, structured-deny shapes, and the replay-guard boundary: 1. **Happy path** — subscribed cleartext CG → keccak wire hash → scoped lookup → chunk returned. Pins the production canonicalisation path the responder relies on. 2. **OriginTrail#729 Bug 5 regression** — numeric `contextGraphId = "42"` with no local CG mapping → `canonicalChunkStoreCgIdOrNull` returns `null` → handler widens to `GRAPH ?g` → finds the chunk persisted under the curator's real nameHash graph. Locks the fix in so a future refactor can't silently re-introduce the decimal-string keccak miss. 3. **`chunk not found` decline** — authorised requester + canonicalisable CG but no chunk persisted → structured `denied: 'chunk not found'` with echoed `(contextGraphId, batchIdHex, chunkIndex)` for requester correlation. 4. **Unauthorised requester** — full 5-layer auth fall-through: none of `resolveOnChainParticipantAgents`, `resolveBeaconPinnedCuratorEoa`, `getContextGraphAgentGateAddresses`, `getContextGraphAllowedPeers`, `chain.getIdentityIdForAddress` admit the requester → structured `denied: 'requester EOA not in any of: ...'` (or `'no authority source available'` when every probe returns null/undefined on MockChainAdapter). 5. **Malformed request bytes** — decoder throws → handler maps to `denied: 'malformed request: ...'` with defensive defaults on the echo fields (empty `contextGraphId`, empty `batchIdHex`, `-1` chunkIndex) so an attacker-controlled garbage payload can't leak back through the response envelope. 6. **Replay-guard boundary** — same wire bytes twice (same nonce, same issuedAtMs, same signature) → first attempt succeeds, second is rejected with `denied: 'replayed chunk-catchup nonce'`. Pins `ciphertextChunkCatchupReplayGuard.recordIfFresh()` so the defensive boundary against signed-envelope replay holds. Test harness ============ Real `DKGAgent` instance booted on `MockChainAdapter`. The responder method is reached through an `(agent as unknown as ResponderInternals)` cast — same pattern as the existing `swm-sender-key-pending-by-agent.test.ts`. Real ciphertext-catchup proto round-trip (`mintSignedCiphertextChunkCatchupRequest` → `encodeCiphertextChunkCatchupRequest` → `handleGetCiphertextChunk` → `decodeCiphertextChunkCatchupResponse`), real EIP-191 personal-sign via `ethers.Wallet.signMessage`. The OT-RFC-39 fifth authority is exercised by monkey-patching `chain.getIdentityIdForAddress` on the agent's chain adapter — the other four authorities answer null/undefined naturally on MockChainAdapter. Verification ============ - `pnpm --filter @origintrail-official/dkg-agent build` — clean - `vitest run test/lu11-handle-get-ciphertext-chunk.test.ts` — 6/6 pass No production code changes. Test-only PR. Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Companion to PR OriginTrail#735 (OriginTrail#720), PR OriginTrail#737 (OriginTrail#700 drain bug), PR OriginTrail#738 (OriginTrail#729 Bug 4 V2 ACK loadChunk canonical keying). Co-authored-by: Cursor <cursoragent@cursor.com>
… 1-of-N partial-fail aggregation PR OriginTrail#716 review-consolidation audit flagged two concurrency-critical gates in PR OriginTrail#700 as "Critical — helper code tested, wiring untested": 1. `publishProfileTail` mutex (dkg-agent.ts:4085-4101). The chained- promise tail is what prevents startup, heartbeat, key rotation, and key revocation from racing on `ProfileManager.currentKcId` and the agent registry triples. The audit found zero direct coverage. This adds 3 tests: - N=5 concurrent `publishProfile()` calls: `maxConcurrency===1` and FIFO call order is preserved. - Error isolation: a rejecting `publishProfileImpl` does not poison the tail; subsequent callers still run. - Return-value isolation: each caller receives its OWN implementation return, not a sibling's. 2. SWM sender-key 1-of-N partial-fail aggregation (dkg-agent.ts:5998-6042). Existing tests covered the all-fail throw and the all-soft no-throw branches; the "M-of-N fatal" path — where some agents succeed and others are fatal — was never exercised. This adds 1 test: - 3 recipients, only the middle one's ack rejects: the throw must cite exactly 1 agent, include that agent's address + the per-recipient reason, and MUST NOT contain the successful recipients' addresses. Both gaps were verified by inspection before adding tests — the production code is correct today, but a refactor that collapses the tail-chain or relaxes the per-agent grouping would silently ship a concurrency regression. These pins make either failure mode loud at PR time. Closes audit items D1 (mutex) + D2 (aggregation) from OriginTrail#716. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Adds three layered pieces so a small / sparse DKG mesh can rediscover direct peer addresses without depending on a flaky DHT lookup:
profile.tsdkg:multiaddr(one per address) +dkg:lastSeen(ISO timestamp) on every profiledkg-agent.tsAGENT_PROFILE_HEARTBEAT_MS = 5min; re-publishes profile so phonebook entries stay freshpeer-resolver.tsfindAgentDialAddresses(multiaddrs + relay + lastSeen) with 24h staleness filterdkg-agent.tsMessenger.resolvePeernow routes through fullPeerResolverinstead of rawpeerRouting.findPeerThe trigger: testnet diagnostics for #697 showed that even when DHT lookups fail, the agents CG already holds enough metadata to find a peer — we just weren't asking it during dial. PR A (#698) tunes the libp2p side; this PR teaches the resolver to use the CG as a phonebook.
Backward compat
dkg:multiaddranddkg:lastSeenare NOT added togenesis.ts. That would changecomputeNetworkId(which hashes all genesis quads) and break rolling deployment with any rc.11 peer. RDF doesn't require properties to be declared in the ontology — they're usable as-is. A coordinated genesis bump can land in a future release.findAgentDialAddressesis optional. The resolver falls back to the legacyfindRelayForPeerwhen the directory doesn't implement it, so any external embedder or test fixture using the old shape keeps working.0disables (one-shot startup publish still fires). Operator-tunable viaconfig.network.agentProfileHeartbeatMs.Wiring
AgentProfileConfig.multiaddrs+lastSeen; populated inDKGAgent.publishProfilefromthis.node.multiaddrsfiltered through newcollectPublishableMultiaddrs(drops loopback / link-local / unspecified bind / dedup).DiscoveryClient.findAgentByPeerIdgains a second SPARQL query fordkg:multiaddrrows and pulls?lastSeenin the scalar SELECT.PeerResolverstep 4 — primes the peerStore with direct multiaddrs (staleness-gated) then layers the relay form on top.MessengerresolvePeerhook — routes throughpeerResolver.resolve()so stall recovery picks up the phonebook path.Tests
agent.test.ts > Profile Builder(+3): multiaddrs/lastSeen emission, default-timestamp behaviour, quote-injection guard,collectPublishableMultiaddrsfilter.peer-resolver.test.ts(+4): phonebook path preferred over relay-only; stale lastSeen drops multiaddrs but keeps relay; falls back to oldfindRelayForPeerwhen new method absent; custom stale threshold honoured.Full suites:
@dkg/core935 / 935 green (+2 vs PR A),@dkg/agentProfile Builder 9/9 green. Same single pre-existingswm-snapshot-syncfailure on agent suite as PR A (verified onmain:Unknown file extension ".ts"worker-thread loader issue, unrelated).Test plan
pnpm -F @origintrail-official/dkg-core buildpnpm -F @origintrail-official/dkg-agent buildpnpm -F @origintrail-official/dkg build(cli)peer-resolver.test.ts(23/23), agentProfile Builder(9/9), full@dkg/core(935/935)dkg:multiaddrtriples appear, get refreshed every 5min, and that queryingfindAgentByPeerIdfor Muy-Sentinel returns multiaddrs. Then on Muy's side, kill direct connection to Miles, wait for peerStore eviction (1h default), and confirm a fresh dial succeeds via the phonebook step.[Messenger]recovery routes throughPeerResolver(visible via debug logs ofagents-CGpriming).Conflict note
Both PR A (#698) and this PR add a
networkblock toDkgConfig. Merging one before the other will require a trivial conflict resolution combining the fields into a singlenetworkblock. Recommended merge order: A first (smaller, fewer moving parts), then this.Companion work
Made with Cursor