Skip to content

Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook)#700

Open
branarakic wants to merge 1 commit into
release/rc.12from
feat/chain-agents-cg-phonebook
Open

Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook)#700
branarakic wants to merge 1 commit into
release/rc.12from
feat/chain-agents-cg-phonebook

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Summary

Adds three layered pieces so a small / sparse DKG mesh can rediscover direct peer addresses without depending on a flaky DHT lookup:

Layer Where What
Schema profile.ts dkg:multiaddr (one per address) + dkg:lastSeen (ISO timestamp) on every profile
Heartbeat dkg-agent.ts AGENT_PROFILE_HEARTBEAT_MS = 5min; re-publishes profile so phonebook entries stay fresh
Dial fallback peer-resolver.ts step 4 prefers richer findAgentDialAddresses (multiaddrs + relay + lastSeen) with 24h staleness filter
Stall recovery dkg-agent.ts Messenger.resolvePeer now routes through full PeerResolver instead of raw peerRouting.findPeer

The trigger: testnet diagnostics for #697 showed that even when DHT lookups fail, the agents CG already holds enough metadata to find a peer — we just weren't asking it during dial. PR A (#698) tunes the libp2p side; this PR teaches the resolver to use the CG as a phonebook.

Backward compat

  • Schema: dkg:multiaddr and dkg:lastSeen are NOT added to genesis.ts. That would change computeNetworkId (which hashes all genesis quads) and break rolling deployment with any rc.11 peer. RDF doesn't require properties to be declared in the ontology — they're usable as-is. A coordinated genesis bump can land in a future release.
  • AgentDirectoryLookup: findAgentDialAddresses is optional. The resolver falls back to the legacy findRelayForPeer when the directory doesn't implement it, so any external embedder or test fixture using the old shape keeps working.
  • Heartbeat: defaults to 5min; 0 disables (one-shot startup publish still fires). Operator-tunable via config.network.agentProfileHeartbeatMs.

Wiring

  • AgentProfileConfig.multiaddrs + lastSeen; populated in DKGAgent.publishProfile from this.node.multiaddrs filtered through new collectPublishableMultiaddrs (drops loopback / link-local / unspecified bind / dedup).
  • DiscoveryClient.findAgentByPeerId gains a second SPARQL query for dkg:multiaddr rows and pulls ?lastSeen in the scalar SELECT.
  • PeerResolver step 4 — primes the peerStore with direct multiaddrs (staleness-gated) then layers the relay form on top.
  • Messenger resolvePeer hook — routes through peerResolver.resolve() so stall recovery picks up the phonebook path.

Tests

  • agent.test.ts > Profile Builder (+3): multiaddrs/lastSeen emission, default-timestamp behaviour, quote-injection guard, collectPublishableMultiaddrs filter.
  • peer-resolver.test.ts (+4): phonebook path preferred over relay-only; stale lastSeen drops multiaddrs but keeps relay; falls back to old findRelayForPeer when new method absent; custom stale threshold honoured.

Full suites: @dkg/core 935 / 935 green (+2 vs PR A), @dkg/agent Profile Builder 9/9 green. Same single pre-existing swm-snapshot-sync failure on agent suite as PR A (verified on main: Unknown file extension ".ts" worker-thread loader issue, unrelated).

Test plan

  • Build green: pnpm -F @origintrail-official/dkg-core build
  • Build green: pnpm -F @origintrail-official/dkg-agent build
  • Build green: pnpm -F @origintrail-official/dkg build (cli)
  • Unit tests green: peer-resolver.test.ts (23/23), agent Profile Builder (9/9), full @dkg/core (935/935)
  • Manual: with this branch deployed on Miles, watch the agents CG over time — confirm dkg:multiaddr triples appear, get refreshed every 5min, and that querying findAgentByPeerId for Muy-Sentinel returns multiaddrs. Then on Muy's side, kill direct connection to Miles, wait for peerStore eviction (1h default), and confirm a fresh dial succeeds via the phonebook step.
  • Manual: trigger an outbox stall (peer offline > 5 attempts) and confirm [Messenger] recovery routes through PeerResolver (visible via debug logs of agents-CG priming).

Conflict note

Both PR A (#698) and this PR add a network block to DkgConfig. Merging one before the other will require a trivial conflict resolution combining the fields into a single network block. Recommended merge order: A first (smaller, fewer moving parts), then this.

Companion work

Made with Cursor

Comment thread packages/core/src/network/peer-resolver.ts Outdated
Comment thread packages/agent/src/dkg-agent.ts Outdated
Comment thread packages/core/src/network/peer-resolver.ts Outdated
branarakic pushed a commit that referenced this pull request May 26, 2026
- **Null-fallback in step 4** (peer-resolver.ts): when
  `findAgentDialAddresses` is implemented but returns `null` for a
  peer, fall through to `findRelayForPeer` so peers with only a
  legacy relay entry (pre-phonebook profiles) still resolve. The
  interface JSDoc documented this contract; the implementation was
  only honouring it when the richer method was absent entirely.
  Regression test added.

- **Per-address robustness in step 4** (peer-resolver.ts): a single
  malformed `dkg:multiaddr` from an untrusted profile would make the
  whole batch `multiaddr()` conversion throw, poisoning every valid
  sibling. `primeAndAppend` now retries per-address on batch failure
  so the bad entry is dropped in isolation. Regression test added
  (mocks `addKnownAddresses` to throw on the bad entry only).

- **Heartbeat re-entrancy guard** (dkg-agent.ts): the
  `agentProfileHeartbeatTimer` is a fixed-cadence `setInterval`, so
  a slow `publishProfile()` (e.g. chain RPC slow + heartbeat
  configured short) could let two publishes race on
  `ProfileManager.currentKcId`. Added `agentProfileHeartbeatInFlight`
  flag — ticks short-circuit (with a debug log) when a publish is
  still in flight.

Tests: `peer-resolver.test.ts` 25/25 (+2 regression tests). Full
`@dkg/core` suite 937/937 green.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic
Copy link
Copy Markdown
Contributor Author

Addressed all three Codex findings in 1c99a88:

  • 🔴 Null-fallback in step 4: when findAgentDialAddresses is implemented but returns null, the resolver now falls through to findRelayForPeer so peers with only a legacy relay entry (pre-phonebook profiles) still resolve. Regression test added.

  • 🔴 Heartbeat re-entrancy: added agentProfileHeartbeatInFlight flag; ticks short-circuit (with a debug log) when a previous publishProfile() is still running, eliminating the currentKcId race when chain RPC is slow or the heartbeat is configured tight. No dedicated unit test — guard is a 5-line well-understood try/finally, and a test would require full DKGAgent setup for marginal value. Existing agent integration tests cover regression.

  • 🟡 Per-address robustness: primeAndAppend now retries per-address when the batch addKnownAddresses throws, so a single malformed dkg:multiaddr from an untrusted profile is dropped in isolation instead of poisoning every valid sibling. Regression test added (mocks addKnownAddresses to throw on the bad entry only).

Tests: peer-resolver.test.ts 25/25 (+2 regression). Full @dkg/core suite 937/937 green.

Comment thread packages/agent/src/profile.ts Outdated
Comment thread packages/agent/src/dkg-agent.ts Outdated
Comment thread packages/core/src/network/peer-resolver.ts Outdated
branarakic pushed a commit that referenced this pull request May 26, 2026
Three follow-on issues raised on the round-1 fixes:

(1) `collectPublishableMultiaddrs` regex filter was looser than the
existing repo classifier — RFC1918 / CGNAT / ULA / `/dns*/localhost`
multiaddrs would still get advertised via `dkg:multiaddr`, so peers
learnt self-referential / private addresses from the phonebook and
wasted dial attempts before falling back to the relay. Replaced
with `isPublicLikeAddress` from `dkg-core` (the shared classifier
that already pins behaviour with `share-project-modal.test.ts` and
the daemon's remotely-dialable check). Test fence expanded with
RFC1918 / CGNAT / ULA / DNS-localhost cases.

(2) The heartbeat-only `agentProfileHeartbeatInFlight` guard left a
race window vs the other `publishProfile()` callers — startup,
key rotation, and revocation also call this method, and they all
mutate `ProfileManager.currentKcId` plus rewrite the registry
triples. Added a `publishProfileTail` mutex (tail-promise chain)
inside `publishProfile()` so EVERY caller is serialised at the
lowest level. Kept the heartbeat inFlight flag as a tick-
coalescing optimisation (keeps queue depth at 1) but documented
that it's not the correctness gate.

(3) Step-4 staleness gate treated `lastSeenMs === undefined` as
fresh, which contradicted the `DiscoveredAgent.lastSeen` JSDoc
saying unknown freshness should fall back to relay only. Partial
or manual agents-CG entries (no heartbeat) bypassed the
stale-data guard. Flipped to "fresh iff lastSeenMs present AND
within window"; relay address still tried regardless. JSDoc
updated and a regression test pins the new behaviour.

Drive-by: also escaped a `/dns*/` token in the new JSDoc that
TypeScript was parsing as a comment terminator.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic
Copy link
Copy Markdown
Contributor Author

Round 2 Codex feedback addressed in 5c3a77c7:

Finding (4): collectPublishableMultiaddrs filter was too loose

  • Diagnosis confirmed. The round-1 regex only caught loopback / link-local / unspecified bind. RFC1918 (10/8, 172.16/12, 192.168/16), CGNAT (100.64/10), IPv6 ULA (fc00::/7), and /dns4/localhost would have leaked into the agent profile, causing peers to learn self-referential or private multiaddrs and waste dial attempts before falling back to the relay.
  • Fix: replaced the inline regex with isPublicLikeAddress from dkg-core — the same classifier the daemon already uses for "node is remotely-dialable" and that pairs with the UI's isMultiaddrRemotelyDialable in share-project-modal.test.ts. Now there's exactly one classifier across the codebase. dkg-core re-exports isPublicLikeAddress so profile.ts can consume it without reaching into a deep import path. Test fence in agent.test.ts expanded with RFC1918 / CGNAT / ULA / DNS-localhost / public-DNS cases.

Finding (5): in-flight flag only guarded heartbeat-vs-heartbeat

  • Diagnosis confirmed. publishProfile() is also called from startup, key rotation, and revocation paths — all of which mutate ProfileManager.currentKcId and rewrite registry triples. The round-1 agentProfileHeartbeatInFlight flag protected nothing about those overlaps.
  • Fix: added a publishProfileTail tail-promise chain INSIDE publishProfile() itself. Every caller (heartbeat, startup, key rotation, revocation, manual API) now waits for the prior publish to settle (success or failure) before running its own. Kept the heartbeat inFlight flag as a tick-coalescing optimisation (keeps queue depth at 1 even if publish latency exceeds the heartbeat cadence) but documented in the JSDoc that it's NOT the correctness gate — the mutex is.

Finding (6): lastSeenMs === undefined was treated as fresh

  • Diagnosis confirmed. The round-1 staleness check used dial.lastSeenMs !== undefined && Date.now() - dial.lastSeenMs > threshold, which means missing freshness → isStale=false → multiaddrs used anyway. That contradicted the DiscoveredAgent.lastSeen JSDoc ("unknown freshness should fall back to relay only") and let partial / manual agents-CG entries bypass the stale-data guard.
  • Fix: flipped to isFresh = lastSeenMs !== undefined && Date.now() - lastSeenMs <= threshold and only use multiaddrs when isFresh. Relay address still tried regardless (relays outlive individual peer NAT bindings). Added a regression test (step 4 (phonebook): missing lastSeenMs is treated as stale (multiaddrs dropped, relay still used)) that pins the new behaviour. JSDoc on both AgentDirectoryLookup.findAgentDialAddresses and the docs around the gate updated.

CI green locally (core 938/938, agent 120/120). Drive-by: also escaped a /dns*/ token in the new JSDoc that TypeScript was reading as a comment terminator.

Comment thread packages/agent/src/discovery.ts Outdated
Comment thread packages/core/src/network/peer-resolver.ts Outdated
branarakic pushed a commit that referenced this pull request May 26, 2026
Two real-bug findings from round 2:

(1) SPARQL injection vector in `DiscoveryClient.findAgentByPeerId`:
the `agentUri` bound from the first query was interpolated
unguarded as `<${agentUri}>` in the follow-up multiaddr query.
A blank-node subject (`_:b1`) or an IRI containing `>` /
whitespace / control chars from a malicious agents-CG entry would
break the second query or open an injection path. Two layers of
defense added: (a) `FILTER(isIRI(?agent))` in the first query so
the engine drops non-IRI bindings before they reach JS, and (b)
the existing `assertSafeIri` / `sparqlIri` helpers from
`core/sparql-safe.ts` validate the IRI before interpolation. On
validation failure the whole entry is dropped (return null) so the
bug can't relocate downstream via the returned `agentUri`.

(2) The round-2 freshness gate `Date.now() - lastSeenMs <=
threshold` silently accepted ANY future `lastSeenMs` because the
LHS becomes negative. A clock-skewed or malicious agents-CG
profile could therefore keep dead direct multiaddrs eligible
indefinitely. Added an upper bound `lastSeenMs <= now +
AGENT_DIRECTORY_CLOCK_SKEW_ALLOWANCE_MS` (5 minutes — covers NTP
drift between independent hosts without admitting attacks).
Regression tests pin both the future-rejection corner and the
within-skew-acceptance corner. JSDoc on
`AgentDirectoryLookup.findAgentDialAddresses` updated to document
the two-sided window.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic
Copy link
Copy Markdown
Contributor Author

Round 3 Codex feedback addressed in 1d9842aa:

Finding (7) 🔴: SPARQL injection vector in findAgentByPeerId

  • Diagnosis confirmed. The agentUri bound from the first SPARQL query was interpolated unguarded as <${agentUri}> in the follow-up multiaddr query. A blank-node subject (_:b1) or any IRI containing > / whitespace / control chars from a malicious agents-CG entry would break the second query or open an injection path.
  • Fix: two layers of defense:
    1. Engine layer: FILTER(isIRI(?agent)) added to the first query so non-IRI bindings never reach JS.
    2. JS layer: wrapped the IRI with assertSafeIri() / sparqlIri() from core/sparql-safe.ts (the existing helpers). On validation failure, the entire entry is dropped (return null) so a malformed agentUri can't relocate the bug downstream via the returned DiscoveredAgent.agentUri.

Finding (8) 🔴: future-timestamp bypass in freshness check

  • Diagnosis confirmed. Date.now() - lastSeenMs <= threshold silently accepted ANY future lastSeenMs because the LHS goes negative. A clock-skewed or malicious profile could keep dead direct multiaddrs eligible indefinitely.
  • Fix: added a 5-minute upper-bound skew allowance: lastSeenMs !== undefined && lastSeenMs <= now + skew && now - lastSeenMs <= threshold. The 5-minute window covers reasonable NTP drift between independent hosts without admitting attacks. Two regression tests pin the corners:
    • 1-hour future timestamp → multiaddrs DROPPED, relay still used.
    • 1-minute future timestamp (within skew) → multiaddrs KEPT.

JSDoc on AgentDirectoryLookup.findAgentDialAddresses updated to document the two-sided window explicitly. CI green locally (core 944/944 with new tests, agent 120/120).

Comment thread packages/agent/src/discovery.ts
Comment thread packages/agent/src/profile.ts Outdated
Comment thread packages/core/src/network/peer-resolver.ts Outdated
branarakic pushed a commit that referenced this pull request May 26, 2026
Agents Context Graph as distributed phonebook — adds DiscoveryClient
multiaddr/lastSeen advertisement so peer resolution can fall back to
the CG instead of relying solely on libp2p PeerStore + the public
relay's routing table. Includes 3 rounds of Codex review fixes.

Notably interacts with PR-2 of the SWM-fanout work already on rc.12:
PR-2's drainPendingSenderKeyForPeer calls discovery.findAgentByPeerId
to resolve the connecting peer's agent address \u2192 phonebook makes that
resolution work for agents we haven't directly synced with yet.

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	packages/agent/src/dkg-agent-types.ts
#	packages/cli/src/config.ts
#	packages/cli/src/daemon/lifecycle.ts
#	packages/core/src/index.ts
#	packages/core/src/network/peer-resolver.ts
#	packages/core/test/peer-resolver.test.ts
branarakic added a commit that referenced this pull request May 27, 2026
fix(agent/discovery): populate `agentAddress` on findAgentByPeerId so #700 drain works in production
Round 4 of Codex review cleanup, plus node-UI activity feed work that
sits on top of the same branch:

agent/core scope trim (review feedback):
- Remove agent-profile heartbeat (`AGENT_PROFILE_HEARTBEAT_MS`,
  `AGENT_PROFILE_STALE_THRESHOLD_MS`, periodic re-publish timer).
- Remove `multiaddrs` / `lastSeen` fields from `DiscoveredAgent` and
  the SPARQL multiaddr-collection query in `findAgentByPeerId`.
- Drop `collectPublishableMultiaddrs` helper and the discovery client
  multiaddr round-trip; profile is now just the scalar phonebook entry.
- Trim `peer-resolver.ts` and `agent.test.ts` to match.
- Remove `agentProfileHeartbeatMs` config knob.
- Net: phonebook is the "agents Context Graph as directory" feature
  the PR title states; multiaddr / freshness signalling is deferred.

node-ui activity feed:
- New `useAssertionLifecycleEvents` hook + tests; surfaces created /
  promoted / published / finalized / discarded events on assertions.
- Project activity joiner now merges lifecycle events with the existing
  SWM gossip / VM publish streams in a single ordered feed.
- ActivityFeed + AgentChip render updates; new test fixtures cover
  empty-state and joiner-order edge cases.

evm-module:
- Refresh `base_sepolia_v10_contracts.json` against the current devnet
  deployment.

Co-authored-by: Cursor <cursoragent@cursor.com>
branarakic added a commit that referenced this pull request May 27, 2026
…egation

test(agent): pin #700 publishProfile mutex serialization + 1-of-N partial-fail aggregation
@branarakic branarakic changed the base branch from main to release/rc.12 May 27, 2026 14:55
@branarakic branarakic changed the title feat(agent/core): agents Context Graph as distributed phonebook Do Not Merge (pending: feat(agent/core): agents Context Graph as distributed phonebook) May 27, 2026
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
…lidation

Productionising the test-suite findings from a full rc.12 comprehensive
devnet sweep so the same harness reliably guards rc.13+. All changes are
in scripts/; no production code touched.

Fixes uncovered while testing release/rc.12:

v10-rc-validation.sh — migrate to rc.12 API shapes
  * publish path: /api/publish (removed) → /api/shared-memory/write +
    /api/shared-memory/publish, with selection.rootEntities so re-runs
    against a CG that already has SWM content don't trip the rc.12
    rootEntity-uniqueness rule
  * private quads: rewritten to /api/update + privateMerkleRoot receipt
    (rc.12 stores private quads encrypted-at-rest; they are intentionally
    not served via /api/query, so the old "publisher sees private back"
    assertion is replaced with the storage-receipt check)
  * CAS: send a real non-empty conditions array (rc.12 rejects empty)
  * chat: { to, text } (was recipientPeerId/text)
  * identity: /api/profile (removed) → /api/identity + /api/status
  * exit non-zero on any FAIL so the orchestrator picks it up
  * timestamp-suffixed entity URIs, sub-graph names, assertion names so
    every re-run is collision-free even on a long-lived devnet
  * helper rewritten so the JSON-extraction pipeline is heredoc-safe
    (was: stdin shadowed by heredoc; expressions with semicolons silently
    SyntaxError'd and every parse returned EMPTY)

swm-soak-test.sh — python heredoc bug
  * Final-summary block used `<<PYTHON` (unquoted), so bash expanded
    every backtick-quoted code reference in Python comments (`m =
    re.match(...)`, `s`, `by_cg_final[cg]`, etc.) as command
    substitutions, producing a flood of "command not found" lines.
    Switched to `<<'PYTHON'` + env-var pass-through.

devnet-test-rfc38-curator-offline-midbatch.sh — wrong endpoint
  * wait_for_m1_onchain_id polled `GET /api/context-graph/<id>`, which
    is a 404 (no such route) — the test reliably false-failed on every
    run. Fixed to use `GET /api/context-graph/list` + client-side lookup,
    and made the timeout configurable via RFC38_M1_ONCHAIN_WAIT_S=60.

devnet-test-rfc38-unclean-restart.sh — too-fast catchup + curl ARG_MAX
  * Bumped WRITES_COUNT 20 → 200 and WRITE_PAYLOAD_BYTES 4096 → 16384,
    and dropped the partial-catchup poll from 1 s → 100 ms, so the test
    reliably observes M1 mid-batch at rc.12 catchup speeds (verified:
    M1_pre-kill = 159 / 200 on the reference devnet).
  * api_call streams large bodies through stdin (`-d @-`) instead of
    argv; pre-fix the 3.2 MiB stress body tripped macOS ARG_MAX with
    "Argument list too long".

devnet.sh — `stop` is now actually idempotent
  * After stopping by pidfile, sweep every devnet port
    (HARDHAT_PORT + API_PORT_BASE..+N + LIBP2P_PORT_BASE..+N) with
    `lsof` and SIGTERM/SIGKILL any process still LISTENing. Catches
    stale processes from prior rc.X devnets that were killed at the
    worker layer while the supervisor respawned them. Opt out with
    DEVNET_STOP_PORT_SWEEP=0 when running multiple devnets on one host.

Promoted from .rc12-test/ scratch dir to scripts/ for rc.13+:
  * devnet-probe-hub-rotation.sh           — PR OriginTrail#689 (chain hub rotation)
  * devnet-probe-multi-rpc-failover.sh     — PR OriginTrail#684 (multi-RPC failover)
  * devnet-probe-libp2p-tunables.sh        — PR OriginTrail#698 (libp2p tunables)
  * devnet-probe-cg-phonebook.sh           — PR OriginTrail#700 (agents CG)
  * devnet-probe-ack-rejection-reasons.sh  — PR OriginTrail#711 (ACK gate diagnostics)
  * devnet-test-node-ui-smoke.sh           — Vite dev-server smoke
  * devnet-comprehensive.sh                — orchestrator wiring all of
    the above + v10-rc-validation + _devnet-full-sweep + rc11 recovery
    + rfc38-all + soak suite; bash-3.2 safe (no `declare -A`);
    SKIP_SOAK / SOAK_ONLY / SKIP_PROBES / SKIP_RFC38_EXTRAS / SKIP_UI
    / FAIL_FAST knobs.

Verified on the rc.12 reference devnet:
  - v10-rc-validation:                34 PASS / 0 FAIL / 2 WARN (exit 0)
  - rfc38-unclean-restart:            PASS (mid-batch window: 159/200)
  - rfc38-curator-offline-midbatch:   PASS
  - 5 promoted probes (new layout):   all PASS
  - swm-soak (2 cycles, quick):       100.00% on both CGs, no errors
  - devnet-comprehensive orchestrator: pre-flight + suite registration OK

Co-authored-by: Cursor <cursoragent@cursor.com>
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
…riginTrail#700 drain works in production

PR OriginTrail#700 ("Agents Context Graph as distributed phonebook") shipped
`DKGAgent.drainPendingSenderKeyForPeer` (`dkg-agent.ts:6089-6138`) as
the recovery loop that replays queued SWM sender keys once a previously-
unknown agent's peerId resolves on `connection:open`. It's the entire
point of the pending-by-agent path landed by OriginTrail#700.

The bug
=======

`drainPendingSenderKeyForPeer` gates on `profile?.agentAddress`:

  const profile = await this.discovery.findAgentByPeerId(peerId);
  if (profile?.agentAddress) {
    agentAddresses = [profile.agentAddress.toLowerCase()];
  }
  ...
  if (agentAddresses.length === 0) return 0;

But `DiscoveryClient.findAgentByPeerId` (`discovery.ts:147-218`)
**never populates that field** — its scalar `SELECT` omits the
`?agentAddress` column entirely, and the return object on lines
209-218 doesn't set it. (Compare to the sibling `findAgents()` on
the same file, which DOES select and return `agentAddress` at lines
71, 78, 92.) The asymmetry between the two discovery entrypoints
meant:

  - Every `connection:open` → drain attempt early-returned 0.
  - `pendingSenderKeyByAgent` grew but never replayed.
  - In production, sender keys queued for not-yet-resolved agents
    would never be delivered — the whole recovery feature shipped as
    a permanent no-op.

CI was green because `swm-sender-key-pending-by-agent.test.ts`
stubbed `discovery.findAgentByPeerId` to return `agentAddress`
explicitly (lines 202-209), masking the production gap.

The fix
=======

Three lines: add `?agentAddress` to the scalar `SELECT`, add the
matching `OPTIONAL { ... <${DKG}agentAddress> ... }` clause, and
populate the field on the returned object. Now both
`findAgents()` and `findAgentByPeerId()` resolve the same identity
for the same peer.

Regression tests
================

1. `test/agent.test.ts` — Discovery Client suite — new test
   "returns agentAddress on findAgentByPeerId — keeps both discovery
   entrypoints in lockstep". Inserts a profile with an explicit
   `agentAddress`, asserts both `findAgents()` AND `findAgentByPeerId()`
   return it (lowercased per `canonicalAgentDidSubject`), and pins
   the legacy-profile-without-agentAddress fallback (returns
   `undefined`, doesn't throw).

2. `test/swm-sender-key-pending-by-agent.test.ts` — new
   "real discovery + agent registry CG" describe block — two
   integration tests that exercise the drain path **without**
   stubbing discovery:

   a. Boot agent → enqueue pending sender key via the no-peerId
      path → publish the recipient's profile (real `buildAgentProfile`
      output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer`
      with the real `DiscoveryClient` → assert messenger.sendReliable
      was called and the queue is empty.

   b. Same flow but the recipient profile omits `dkg:agentAddress`
      (legacy profile) → assert drain returns 0, queue stays in
      place, messenger is never called.

Test (b) would have caught the original bug — every assertion in
the all-stubbed path was satisfied even with the broken implementation.

Verified
========

- `pnpm --filter @origintrail-official/dkg-agent build` — clean
- All 18 SWM sender-key tests pass (+2 new + 16 existing)
- All 7 Discovery Client tests pass (+1 new + 6 existing)

Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive.

Co-authored-by: Cursor <cursoragent@cursor.com>
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
…dler + canonical CG keying (OriginTrail#729 Bug 4)

Adds `packages/publisher/test/v10-ack-v2-chunked.test.ts` — 8 cases
exercising the LU-11 / OT-RFC-39 V2 chunked ACK path landed by
PR OriginTrail#715/OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 4.

Background
==========

OriginTrail#716 review-consolidation audit flagged this as the closest analogue
of the gap PR OriginTrail#735 closed for OriginTrail#720: helper-level primitives (chunked
AEAD, ciphertext Merkle tree, proto wire format, on-chain commitment
fields) are well tested, but the StorageACKHandler that wires them
together at the ACK boundary has zero direct coverage. The existing
`v10-ack-edge-cases.test.ts` thoroughly covers the V1 (inline
staging quads / inline encrypted blob) paths but skips V2 entirely.

Worse, PR OriginTrail#729 Bug 4 shipped a fix to the V2 ACK `loadChunk` graph
canonicalisation (`normalizeContextGraphIdForChunkStore` returning
null → wildcard `GRAPH ?g` fallback) without a regression test —
any future refactor of that hook could silently re-introduce the
"keccak the decimal string '42'" miss.

Scope
=====

Eight cases — happy paths first, then the regression and decline
shapes:

1. **Happy path** — cleartext CG, canonicalising normalizer, chunks
   persisted under `ciphertextChunkStoreGraph(canonical(swmGraphId))`,
   ACK signed. Pins the production canonicalisation path.

2. **OriginTrail#729 Bug 4 regression** — V2 intent omits `swmGraphId`,
   normalizer returns `null` on every input → handler MUST widen
   to `GRAPH ?g` and still find the chunks. The pre-fix code
   keccak'd the decimal `cgId` ('42') and missed every persisted
   chunk; the test now locks the fix in.

3. **Legacy no-normalizer shim** — `normalizeContextGraphIdForChunkStore`
   absent → uses raw `swmGraphId` literally, preserving pre-fix
   behaviour for callers that haven't wired the hook.

4. **Multi-CG isolation (PR OriginTrail#715 / ciphertext-chunk-store.ts:28-44)**
   — two CGs publish identical V10 KCs (same batchId, since it's
   plaintext-derived) but persist chunks under different canonical
   named graphs. ACK for CG-A must only see CG-A's chunks; an
   intent claiming CG-B's root for the same batchId is correctly
   declined with `CIPHERTEXT_ROOT_MISMATCH`. This pins the Codex
   finding that drove the per-CG named-graph design.

5. **`MISSING_CIPHERTEXT_CHUNKS` decline** — claim count=4, persist
   only indexes 0 and 2 → declines after the 10-second retry window
   with the missing indexes in the message ("missing 2/4 ... 1,3").

6. **`CIPHERTEXT_ROOT_MISMATCH` decline** — all chunks present but
   publisher lies about the root → declines fast.

7. **Curated-only gate** — `isCgCurated` returns false → declines
   with `SIGNER_NOT_REGISTERED` (the V2-curated-only gate at
   `storage-ack-handler.ts:296-310`). Closes the bypass concern
   called out in the inline comment above the V2 branch.

8. **stagingQuads forbidden on V2 wire** — V2 intent with non-empty
   `stagingQuads` is rejected (the chunked path never carries
   inline ciphertext).

Test harness
============

Real `OxigraphStore` (mirrors `v10-ack-edge-cases.test.ts` pattern),
real `encodePublishIntent` / `decodeStorageACK` proto round-trips,
real `buildCiphertextChunksRoot` Merkle tree construction. The new
`buildV2IntentBytes(opts)` helper derives `ciphertextChunksRoot`,
`ciphertextChunkCount`, and `publicByteSize` from the chunks list
unless explicitly overridden — production wire shape, not a fake.
The `seedChunks(store, opts)` helper inserts chunk literals under
the same `(graph, subject, predicate)` layout that
`ingestSwmCiphertextChunkEnvelope` in `dkg-agent.ts` writes.

Verification
============

- `pnpm --filter @origintrail-official/dkg-publisher build` — clean
- `vitest run test/v10-ack-v2-chunked.test.ts` — 8/8 pass
- `vitest run test/v10-ack-v2-chunked.test.ts test/v10-ack-edge-cases.test.ts`
  — 53/53 pass (8 new + 45 existing); no regressions to the V1 suite

Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive.
Companion to PR OriginTrail#735 (OriginTrail#720 adapter coverage) and OriginTrail#737 (OriginTrail#700 drain bug).

Co-authored-by: Cursor <cursoragent@cursor.com>
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
…onder + OriginTrail#729 Bug 5 canonical CG keying

Adds `packages/agent/test/lu11-handle-get-ciphertext-chunk.test.ts`
— 6 cases exercising the LU-11 / OT-RFC-39 `get-ciphertext-chunk`
sync verb responder shipped by PR OriginTrail#717 and the canonical-CG-keying
fix in PR OriginTrail#729 Bug 5.

Background
==========

OriginTrail#716 review-consolidation audit flagged this as one of the
critical adapter-wiring gaps in the same class as OriginTrail#720/OriginTrail#735:
helper-level signing primitives (`mintSignedCiphertextChunkCatchupRequest`,
`verifySignedCiphertextChunkCatchupRequest`, replay guard) are
present, but the responder that wires them together with the
5-layer authority stack and the SPARQL chunk lookup is completely
untested.

Worse, PR OriginTrail#729 Bug 5 shipped a fix to the responder's chunk-graph
lookup without a regression test:

  - Pre-fix: `gossipWireIdFor(req.contextGraphId)` was called
    unconditionally. For a numeric on-chain id like "42" this
    keccak'd the literal decimal string and produced a graph URI
    nothing was ever persisted under — every late-joining core's
    backfill request silently returned "chunk not found" even
    when the bytes were on disk under the curator's nameHash.

  - Fix: routes through `canonicalChunkStoreCgIdOrNull` which
    returns `null` for inputs it can't safely canonicalise; the
    responder then widens to a `GRAPH ?g` wildcard scan.

Scope
=====

Six cases — happy path, the Bug 5 regression, structured-deny shapes,
and the replay-guard boundary:

1. **Happy path** — subscribed cleartext CG → keccak wire hash →
   scoped lookup → chunk returned. Pins the production
   canonicalisation path the responder relies on.

2. **OriginTrail#729 Bug 5 regression** — numeric `contextGraphId = "42"` with
   no local CG mapping → `canonicalChunkStoreCgIdOrNull` returns
   `null` → handler widens to `GRAPH ?g` → finds the chunk
   persisted under the curator's real nameHash graph. Locks the
   fix in so a future refactor can't silently re-introduce the
   decimal-string keccak miss.

3. **`chunk not found` decline** — authorised requester +
   canonicalisable CG but no chunk persisted → structured
   `denied: 'chunk not found'` with echoed `(contextGraphId,
   batchIdHex, chunkIndex)` for requester correlation.

4. **Unauthorised requester** — full 5-layer auth fall-through:
   none of `resolveOnChainParticipantAgents`, `resolveBeaconPinnedCuratorEoa`,
   `getContextGraphAgentGateAddresses`, `getContextGraphAllowedPeers`,
   `chain.getIdentityIdForAddress` admit the requester →
   structured `denied: 'requester EOA not in any of: ...'`
   (or `'no authority source available'` when every probe
   returns null/undefined on MockChainAdapter).

5. **Malformed request bytes** — decoder throws → handler maps to
   `denied: 'malformed request: ...'` with defensive defaults on
   the echo fields (empty `contextGraphId`, empty `batchIdHex`,
   `-1` chunkIndex) so an attacker-controlled garbage payload
   can't leak back through the response envelope.

6. **Replay-guard boundary** — same wire bytes twice (same nonce,
   same issuedAtMs, same signature) → first attempt succeeds,
   second is rejected with `denied: 'replayed chunk-catchup nonce'`.
   Pins `ciphertextChunkCatchupReplayGuard.recordIfFresh()` so
   the defensive boundary against signed-envelope replay holds.

Test harness
============

Real `DKGAgent` instance booted on `MockChainAdapter`. The
responder method is reached through an `(agent as unknown as
ResponderInternals)` cast — same pattern as the existing
`swm-sender-key-pending-by-agent.test.ts`. Real ciphertext-catchup
proto round-trip (`mintSignedCiphertextChunkCatchupRequest` →
`encodeCiphertextChunkCatchupRequest` →
`handleGetCiphertextChunk` → `decodeCiphertextChunkCatchupResponse`),
real EIP-191 personal-sign via `ethers.Wallet.signMessage`. The
OT-RFC-39 fifth authority is exercised by monkey-patching
`chain.getIdentityIdForAddress` on the agent's chain adapter —
the other four authorities answer null/undefined naturally on
MockChainAdapter.

Verification
============

- `pnpm --filter @origintrail-official/dkg-agent build` — clean
- `vitest run test/lu11-handle-get-ciphertext-chunk.test.ts` —
  6/6 pass

No production code changes. Test-only PR.

Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive.
Companion to PR OriginTrail#735 (OriginTrail#720), PR OriginTrail#737 (OriginTrail#700 drain bug), PR OriginTrail#738
(OriginTrail#729 Bug 4 V2 ACK loadChunk canonical keying).

Co-authored-by: Cursor <cursoragent@cursor.com>
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
… 1-of-N partial-fail aggregation

PR OriginTrail#716 review-consolidation audit flagged two concurrency-critical
gates in PR OriginTrail#700 as "Critical — helper code tested, wiring untested":

1. `publishProfileTail` mutex (dkg-agent.ts:4085-4101). The chained-
   promise tail is what prevents startup, heartbeat, key rotation,
   and key revocation from racing on `ProfileManager.currentKcId`
   and the agent registry triples. The audit found zero direct
   coverage. This adds 3 tests:
     - N=5 concurrent `publishProfile()` calls: `maxConcurrency===1`
       and FIFO call order is preserved.
     - Error isolation: a rejecting `publishProfileImpl` does not
       poison the tail; subsequent callers still run.
     - Return-value isolation: each caller receives its OWN
       implementation return, not a sibling's.

2. SWM sender-key 1-of-N partial-fail aggregation
   (dkg-agent.ts:5998-6042). Existing tests covered the all-fail
   throw and the all-soft no-throw branches; the "M-of-N fatal"
   path — where some agents succeed and others are fatal — was
   never exercised. This adds 1 test:
     - 3 recipients, only the middle one's ack rejects: the throw
       must cite exactly 1 agent, include that agent's address +
       the per-recipient reason, and MUST NOT contain the
       successful recipients' addresses.

Both gaps were verified by inspection before adding tests — the
production code is correct today, but a refactor that collapses
the tail-chain or relaxes the per-agent grouping would silently
ship a concurrency regression. These pins make either failure
mode loud at PR time.

Closes audit items D1 (mutex) + D2 (aggregation) from OriginTrail#716.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant