fix(agent/discovery): populate agentAddress on findAgentByPeerId so #700 drain works in production#737
Merged
Conversation
…700 drain works in production PR #700 ("Agents Context Graph as distributed phonebook") shipped `DKGAgent.drainPendingSenderKeyForPeer` (`dkg-agent.ts:6089-6138`) as the recovery loop that replays queued SWM sender keys once a previously- unknown agent's peerId resolves on `connection:open`. It's the entire point of the pending-by-agent path landed by #700. The bug ======= `drainPendingSenderKeyForPeer` gates on `profile?.agentAddress`: const profile = await this.discovery.findAgentByPeerId(peerId); if (profile?.agentAddress) { agentAddresses = [profile.agentAddress.toLowerCase()]; } ... if (agentAddresses.length === 0) return 0; But `DiscoveryClient.findAgentByPeerId` (`discovery.ts:147-218`) **never populates that field** — its scalar `SELECT` omits the `?agentAddress` column entirely, and the return object on lines 209-218 doesn't set it. (Compare to the sibling `findAgents()` on the same file, which DOES select and return `agentAddress` at lines 71, 78, 92.) The asymmetry between the two discovery entrypoints meant: - Every `connection:open` → drain attempt early-returned 0. - `pendingSenderKeyByAgent` grew but never replayed. - In production, sender keys queued for not-yet-resolved agents would never be delivered — the whole recovery feature shipped as a permanent no-op. CI was green because `swm-sender-key-pending-by-agent.test.ts` stubbed `discovery.findAgentByPeerId` to return `agentAddress` explicitly (lines 202-209), masking the production gap. The fix ======= Three lines: add `?agentAddress` to the scalar `SELECT`, add the matching `OPTIONAL { ... <${DKG}agentAddress> ... }` clause, and populate the field on the returned object. Now both `findAgents()` and `findAgentByPeerId()` resolve the same identity for the same peer. Regression tests ================ 1. `test/agent.test.ts` — Discovery Client suite — new test "returns agentAddress on findAgentByPeerId — keeps both discovery entrypoints in lockstep". Inserts a profile with an explicit `agentAddress`, asserts both `findAgents()` AND `findAgentByPeerId()` return it (lowercased per `canonicalAgentDidSubject`), and pins the legacy-profile-without-agentAddress fallback (returns `undefined`, doesn't throw). 2. `test/swm-sender-key-pending-by-agent.test.ts` — new "real discovery + agent registry CG" describe block — two integration tests that exercise the drain path **without** stubbing discovery: a. Boot agent → enqueue pending sender key via the no-peerId path → publish the recipient's profile (real `buildAgentProfile` output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer` with the real `DiscoveryClient` → assert messenger.sendReliable was called and the queue is empty. b. Same flow but the recipient profile omits `dkg:agentAddress` (legacy profile) → assert drain returns 0, queue stays in place, messenger is never called. Test (b) would have caught the original bug — every assertion in the all-stubbed path was satisfied even with the broken implementation. Verified ======== - `pnpm --filter @origintrail-official/dkg-agent build` — clean - All 18 SWM sender-key tests pass (+2 new + 16 existing) - All 7 Discovery Client tests pass (+1 new + 6 existing) Closes audit finding flagged by the #716 review-consolidation deep dive. Co-authored-by: Cursor <cursoragent@cursor.com>
12 tasks
matic031
pushed a commit
to KilianTrunk/dkg
that referenced
this pull request
Jun 2, 2026
…dler + canonical CG keying (OriginTrail#729 Bug 4) Adds `packages/publisher/test/v10-ack-v2-chunked.test.ts` — 8 cases exercising the LU-11 / OT-RFC-39 V2 chunked ACK path landed by PR OriginTrail#715/OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 4. Background ========== OriginTrail#716 review-consolidation audit flagged this as the closest analogue of the gap PR OriginTrail#735 closed for OriginTrail#720: helper-level primitives (chunked AEAD, ciphertext Merkle tree, proto wire format, on-chain commitment fields) are well tested, but the StorageACKHandler that wires them together at the ACK boundary has zero direct coverage. The existing `v10-ack-edge-cases.test.ts` thoroughly covers the V1 (inline staging quads / inline encrypted blob) paths but skips V2 entirely. Worse, PR OriginTrail#729 Bug 4 shipped a fix to the V2 ACK `loadChunk` graph canonicalisation (`normalizeContextGraphIdForChunkStore` returning null → wildcard `GRAPH ?g` fallback) without a regression test — any future refactor of that hook could silently re-introduce the "keccak the decimal string '42'" miss. Scope ===== Eight cases — happy paths first, then the regression and decline shapes: 1. **Happy path** — cleartext CG, canonicalising normalizer, chunks persisted under `ciphertextChunkStoreGraph(canonical(swmGraphId))`, ACK signed. Pins the production canonicalisation path. 2. **OriginTrail#729 Bug 4 regression** — V2 intent omits `swmGraphId`, normalizer returns `null` on every input → handler MUST widen to `GRAPH ?g` and still find the chunks. The pre-fix code keccak'd the decimal `cgId` ('42') and missed every persisted chunk; the test now locks the fix in. 3. **Legacy no-normalizer shim** — `normalizeContextGraphIdForChunkStore` absent → uses raw `swmGraphId` literally, preserving pre-fix behaviour for callers that haven't wired the hook. 4. **Multi-CG isolation (PR OriginTrail#715 / ciphertext-chunk-store.ts:28-44)** — two CGs publish identical V10 KCs (same batchId, since it's plaintext-derived) but persist chunks under different canonical named graphs. ACK for CG-A must only see CG-A's chunks; an intent claiming CG-B's root for the same batchId is correctly declined with `CIPHERTEXT_ROOT_MISMATCH`. This pins the Codex finding that drove the per-CG named-graph design. 5. **`MISSING_CIPHERTEXT_CHUNKS` decline** — claim count=4, persist only indexes 0 and 2 → declines after the 10-second retry window with the missing indexes in the message ("missing 2/4 ... 1,3"). 6. **`CIPHERTEXT_ROOT_MISMATCH` decline** — all chunks present but publisher lies about the root → declines fast. 7. **Curated-only gate** — `isCgCurated` returns false → declines with `SIGNER_NOT_REGISTERED` (the V2-curated-only gate at `storage-ack-handler.ts:296-310`). Closes the bypass concern called out in the inline comment above the V2 branch. 8. **stagingQuads forbidden on V2 wire** — V2 intent with non-empty `stagingQuads` is rejected (the chunked path never carries inline ciphertext). Test harness ============ Real `OxigraphStore` (mirrors `v10-ack-edge-cases.test.ts` pattern), real `encodePublishIntent` / `decodeStorageACK` proto round-trips, real `buildCiphertextChunksRoot` Merkle tree construction. The new `buildV2IntentBytes(opts)` helper derives `ciphertextChunksRoot`, `ciphertextChunkCount`, and `publicByteSize` from the chunks list unless explicitly overridden — production wire shape, not a fake. The `seedChunks(store, opts)` helper inserts chunk literals under the same `(graph, subject, predicate)` layout that `ingestSwmCiphertextChunkEnvelope` in `dkg-agent.ts` writes. Verification ============ - `pnpm --filter @origintrail-official/dkg-publisher build` — clean - `vitest run test/v10-ack-v2-chunked.test.ts` — 8/8 pass - `vitest run test/v10-ack-v2-chunked.test.ts test/v10-ack-edge-cases.test.ts` — 53/53 pass (8 new + 45 existing); no regressions to the V1 suite Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Companion to PR OriginTrail#735 (OriginTrail#720 adapter coverage) and OriginTrail#737 (OriginTrail#700 drain bug). Co-authored-by: Cursor <cursoragent@cursor.com>
matic031
pushed a commit
to KilianTrunk/dkg
that referenced
this pull request
Jun 2, 2026
…onder + OriginTrail#729 Bug 5 canonical CG keying Adds `packages/agent/test/lu11-handle-get-ciphertext-chunk.test.ts` — 6 cases exercising the LU-11 / OT-RFC-39 `get-ciphertext-chunk` sync verb responder shipped by PR OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 5. Background ========== OriginTrail#716 review-consolidation audit flagged this as one of the critical adapter-wiring gaps in the same class as OriginTrail#720/OriginTrail#735: helper-level signing primitives (`mintSignedCiphertextChunkCatchupRequest`, `verifySignedCiphertextChunkCatchupRequest`, replay guard) are present, but the responder that wires them together with the 5-layer authority stack and the SPARQL chunk lookup is completely untested. Worse, PR OriginTrail#729 Bug 5 shipped a fix to the responder's chunk-graph lookup without a regression test: - Pre-fix: `gossipWireIdFor(req.contextGraphId)` was called unconditionally. For a numeric on-chain id like "42" this keccak'd the literal decimal string and produced a graph URI nothing was ever persisted under — every late-joining core's backfill request silently returned "chunk not found" even when the bytes were on disk under the curator's nameHash. - Fix: routes through `canonicalChunkStoreCgIdOrNull` which returns `null` for inputs it can't safely canonicalise; the responder then widens to a `GRAPH ?g` wildcard scan. Scope ===== Six cases — happy path, the Bug 5 regression, structured-deny shapes, and the replay-guard boundary: 1. **Happy path** — subscribed cleartext CG → keccak wire hash → scoped lookup → chunk returned. Pins the production canonicalisation path the responder relies on. 2. **OriginTrail#729 Bug 5 regression** — numeric `contextGraphId = "42"` with no local CG mapping → `canonicalChunkStoreCgIdOrNull` returns `null` → handler widens to `GRAPH ?g` → finds the chunk persisted under the curator's real nameHash graph. Locks the fix in so a future refactor can't silently re-introduce the decimal-string keccak miss. 3. **`chunk not found` decline** — authorised requester + canonicalisable CG but no chunk persisted → structured `denied: 'chunk not found'` with echoed `(contextGraphId, batchIdHex, chunkIndex)` for requester correlation. 4. **Unauthorised requester** — full 5-layer auth fall-through: none of `resolveOnChainParticipantAgents`, `resolveBeaconPinnedCuratorEoa`, `getContextGraphAgentGateAddresses`, `getContextGraphAllowedPeers`, `chain.getIdentityIdForAddress` admit the requester → structured `denied: 'requester EOA not in any of: ...'` (or `'no authority source available'` when every probe returns null/undefined on MockChainAdapter). 5. **Malformed request bytes** — decoder throws → handler maps to `denied: 'malformed request: ...'` with defensive defaults on the echo fields (empty `contextGraphId`, empty `batchIdHex`, `-1` chunkIndex) so an attacker-controlled garbage payload can't leak back through the response envelope. 6. **Replay-guard boundary** — same wire bytes twice (same nonce, same issuedAtMs, same signature) → first attempt succeeds, second is rejected with `denied: 'replayed chunk-catchup nonce'`. Pins `ciphertextChunkCatchupReplayGuard.recordIfFresh()` so the defensive boundary against signed-envelope replay holds. Test harness ============ Real `DKGAgent` instance booted on `MockChainAdapter`. The responder method is reached through an `(agent as unknown as ResponderInternals)` cast — same pattern as the existing `swm-sender-key-pending-by-agent.test.ts`. Real ciphertext-catchup proto round-trip (`mintSignedCiphertextChunkCatchupRequest` → `encodeCiphertextChunkCatchupRequest` → `handleGetCiphertextChunk` → `decodeCiphertextChunkCatchupResponse`), real EIP-191 personal-sign via `ethers.Wallet.signMessage`. The OT-RFC-39 fifth authority is exercised by monkey-patching `chain.getIdentityIdForAddress` on the agent's chain adapter — the other four authorities answer null/undefined naturally on MockChainAdapter. Verification ============ - `pnpm --filter @origintrail-official/dkg-agent build` — clean - `vitest run test/lu11-handle-get-ciphertext-chunk.test.ts` — 6/6 pass No production code changes. Test-only PR. Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Companion to PR OriginTrail#735 (OriginTrail#720), PR OriginTrail#737 (OriginTrail#700 drain bug), PR OriginTrail#738 (OriginTrail#729 Bug 4 V2 ACK loadChunk canonical keying). Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
This is a live production bug, not just a test gap. Found during the #716 review-consolidation deep-dive audit. PR #700's
drainPendingSenderKeyForPeeris a permanent no-op onrelease/rc.12because the discovery method it relies on never populates the field it gates on.3-line fix in
discovery.ts+ 3 regression tests that would have caught the original bug.The bug
PR #700 ("Agents Context Graph as distributed phonebook") shipped
DKGAgent.drainPendingSenderKeyForPeer(packages/agent/src/dkg-agent.ts:6089-6138) as the recovery loop that replays queued SWM sender keys once a previously-unknown agent's peerId resolves onconnection:open. This is the entire point of the pending-by-agent path landed by #700 — without it, sender keys queued for not-yet-resolved agents would never be delivered.The drain gates on
profile?.agentAddress:```ts
const profile = await this.discovery.findAgentByPeerId(peerId);
if (profile?.agentAddress) {
agentAddresses = [profile.agentAddress.toLowerCase()];
}
...
if (agentAddresses.length === 0) return 0;
```
But
DiscoveryClient.findAgentByPeerId(packages/agent/src/discovery.ts:147-218) never populates that field:SELECTclause omits the `?agentAddress` column (line 163).Compare to the sibling `findAgents()` on the same file — that one DOES select and return `agentAddress` (lines 71, 78, 92). The asymmetry between the two discovery entrypoints meant in production:
How CI missed this
`swm-sender-key-pending-by-agent.test.ts` stubbed `discovery.findAgentByPeerId` to return `agentAddress` explicitly (lines 202-209 of the existing file), so the stubbed-discovery tests were green even though the real discovery layer never returns that field.
The fix
3 lines in `discovery.ts`:
Now both `findAgents()` and `findAgentByPeerId()` resolve the same identity for the same peer.
Regression tests
1. `test/agent.test.ts` — Discovery Client suite
New test "returns agentAddress on findAgentByPeerId — keeps both discovery entrypoints in lockstep" that:
2. `test/swm-sender-key-pending-by-agent.test.ts` — new "real discovery + agent registry CG" `describe` block
Two integration tests that exercise the drain path without stubbing discovery:
a. Boot agent → enqueue pending sender key via the no-peerId path → publish the recipient's profile (real `buildAgentProfile` output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer` with the real `DiscoveryClient` → assert messenger.sendReliable was called and the queue is empty.
b. Same flow but the recipient profile omits `dkg:agentAddress` (legacy profile) → assert drain returns 0, queue stays in place, messenger is never called.
Test (b) would have caught the original bug — every assertion in the all-stubbed path was satisfied even with the broken implementation.
Verification
```
pnpm --filter @origintrail-official/dkg-agent build # clean (tsc no errors)
pnpm --filter @origintrail-official/dkg-agent test # 18/18 SWM sender-key tests pass
# 7/7 Discovery Client tests pass
```
Risk
Low. The discovery query change is additive (one new OPTIONAL clause + one new return field). The drain code was already reading `profile?.agentAddress` — populating that field is what the code already expects.
Source of finding
#716 review-consolidation deep-dive audit (the same audit that surfaced #720's missing call-site coverage that became #735). This bug had no symptom in CI; production would have silently lost sender-key recovery for any agent that wasn't already connected at publish time.
Made with Cursor