fix(agent/discovery): populate `agentAddress` on findAgentByPeerId so #700 drain works in production by branarakic · Pull Request #737 · OriginTrail/dkg

branarakic · 2026-05-27T11:19:48Z

TL;DR

This is a live production bug, not just a test gap. Found during the #716 review-consolidation deep-dive audit. PR #700's drainPendingSenderKeyForPeer is a permanent no-op on release/rc.12 because the discovery method it relies on never populates the field it gates on.

3-line fix in discovery.ts + 3 regression tests that would have caught the original bug.

The bug

PR #700 ("Agents Context Graph as distributed phonebook") shipped DKGAgent.drainPendingSenderKeyForPeer (packages/agent/src/dkg-agent.ts:6089-6138) as the recovery loop that replays queued SWM sender keys once a previously-unknown agent's peerId resolves on connection:open. This is the entire point of the pending-by-agent path landed by #700 — without it, sender keys queued for not-yet-resolved agents would never be delivered.

The drain gates on profile?.agentAddress:

```ts
const profile = await this.discovery.findAgentByPeerId(peerId);
if (profile?.agentAddress) {
agentAddresses = [profile.agentAddress.toLowerCase()];
}
...
if (agentAddresses.length === 0) return 0;
```

But DiscoveryClient.findAgentByPeerId (packages/agent/src/discovery.ts:147-218) never populates that field:

Its scalar SELECT clause omits the `?agentAddress` column (line 163).
Its return object (lines 209-218) doesn't set the field.

Compare to the sibling `findAgents()` on the same file — that one DOES select and return `agentAddress` (lines 71, 78, 92). The asymmetry between the two discovery entrypoints meant in production:

Every `connection:open` → drain attempt early-returned `0`.
`pendingSenderKeyByAgent` grew but never replayed.
The recovery feature shipped as a permanent no-op.

How CI missed this

`swm-sender-key-pending-by-agent.test.ts` stubbed `discovery.findAgentByPeerId` to return `agentAddress` explicitly (lines 202-209 of the existing file), so the stubbed-discovery tests were green even though the real discovery layer never returns that field.

The fix

3 lines in `discovery.ts`:

Add `?agentAddress` to the scalar `SELECT`.
Add the matching `OPTIONAL { ... <${DKG}agentAddress> ... }` clause.
Populate the field on the returned object.

Now both `findAgents()` and `findAgentByPeerId()` resolve the same identity for the same peer.

Regression tests

1. `test/agent.test.ts` — Discovery Client suite

New test "returns agentAddress on findAgentByPeerId — keeps both discovery entrypoints in lockstep" that:

Inserts an agent profile with an explicit `agentAddress`.
Asserts both `findAgents()` AND `findAgentByPeerId()` return the (lowercased) address.
Pins the legacy-profile-without-agentAddress fallback path (`undefined`, not a throw).

2. `test/swm-sender-key-pending-by-agent.test.ts` — new "real discovery + agent registry CG" `describe` block

Two integration tests that exercise the drain path without stubbing discovery:

a. Boot agent → enqueue pending sender key via the no-peerId path → publish the recipient's profile (real `buildAgentProfile` output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer` with the real `DiscoveryClient` → assert messenger.sendReliable was called and the queue is empty.

b. Same flow but the recipient profile omits `dkg:agentAddress` (legacy profile) → assert drain returns 0, queue stays in place, messenger is never called.

Test (b) would have caught the original bug — every assertion in the all-stubbed path was satisfied even with the broken implementation.

Verification

```
pnpm --filter @origintrail-official/dkg-agent build # clean (tsc no errors)
pnpm --filter @origintrail-official/dkg-agent test # 18/18 SWM sender-key tests pass
# 7/7 Discovery Client tests pass
```

Risk

Low. The discovery query change is additive (one new OPTIONAL clause + one new return field). The drain code was already reading `profile?.agentAddress` — populating that field is what the code already expects.

Source of finding

#716 review-consolidation deep-dive audit (the same audit that surfaced #720's missing call-site coverage that became #735). This bug had no symptom in CI; production would have silently lost sender-key recovery for any agent that wasn't already connected at publish time.

Made with Cursor

…700 drain works in production PR #700 ("Agents Context Graph as distributed phonebook") shipped `DKGAgent.drainPendingSenderKeyForPeer` (`dkg-agent.ts:6089-6138`) as the recovery loop that replays queued SWM sender keys once a previously- unknown agent's peerId resolves on `connection:open`. It's the entire point of the pending-by-agent path landed by #700. The bug ======= `drainPendingSenderKeyForPeer` gates on `profile?.agentAddress`: const profile = await this.discovery.findAgentByPeerId(peerId); if (profile?.agentAddress) { agentAddresses = [profile.agentAddress.toLowerCase()]; } ... if (agentAddresses.length === 0) return 0; But `DiscoveryClient.findAgentByPeerId` (`discovery.ts:147-218`) **never populates that field** — its scalar `SELECT` omits the `?agentAddress` column entirely, and the return object on lines 209-218 doesn't set it. (Compare to the sibling `findAgents()` on the same file, which DOES select and return `agentAddress` at lines 71, 78, 92.) The asymmetry between the two discovery entrypoints meant: - Every `connection:open` → drain attempt early-returned 0. - `pendingSenderKeyByAgent` grew but never replayed. - In production, sender keys queued for not-yet-resolved agents would never be delivered — the whole recovery feature shipped as a permanent no-op. CI was green because `swm-sender-key-pending-by-agent.test.ts` stubbed `discovery.findAgentByPeerId` to return `agentAddress` explicitly (lines 202-209), masking the production gap. The fix ======= Three lines: add `?agentAddress` to the scalar `SELECT`, add the matching `OPTIONAL { ... <${DKG}agentAddress> ... }` clause, and populate the field on the returned object. Now both `findAgents()` and `findAgentByPeerId()` resolve the same identity for the same peer. Regression tests ================ 1. `test/agent.test.ts` — Discovery Client suite — new test "returns agentAddress on findAgentByPeerId — keeps both discovery entrypoints in lockstep". Inserts a profile with an explicit `agentAddress`, asserts both `findAgents()` AND `findAgentByPeerId()` return it (lowercased per `canonicalAgentDidSubject`), and pins the legacy-profile-without-agentAddress fallback (returns `undefined`, doesn't throw). 2. `test/swm-sender-key-pending-by-agent.test.ts` — new "real discovery + agent registry CG" describe block — two integration tests that exercise the drain path **without** stubbing discovery: a. Boot agent → enqueue pending sender key via the no-peerId path → publish the recipient's profile (real `buildAgentProfile` output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer` with the real `DiscoveryClient` → assert messenger.sendReliable was called and the queue is empty. b. Same flow but the recipient profile omits `dkg:agentAddress` (legacy profile) → assert drain returns 0, queue stays in place, messenger is never called. Test (b) would have caught the original bug — every assertion in the all-stubbed path was satisfied even with the broken implementation. Verified ======== - `pnpm --filter @origintrail-official/dkg-agent build` — clean - All 18 SWM sender-key tests pass (+2 new + 16 existing) - All 7 Discovery Client tests pass (+1 new + 6 existing) Closes audit finding flagged by the #716 review-consolidation deep dive. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions

Codex review completed — no issues found.

…dler + canonical CG keying (OriginTrail#729 Bug 4) Adds `packages/publisher/test/v10-ack-v2-chunked.test.ts` — 8 cases exercising the LU-11 / OT-RFC-39 V2 chunked ACK path landed by PR OriginTrail#715/OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 4. Background ========== OriginTrail#716 review-consolidation audit flagged this as the closest analogue of the gap PR OriginTrail#735 closed for OriginTrail#720: helper-level primitives (chunked AEAD, ciphertext Merkle tree, proto wire format, on-chain commitment fields) are well tested, but the StorageACKHandler that wires them together at the ACK boundary has zero direct coverage. The existing `v10-ack-edge-cases.test.ts` thoroughly covers the V1 (inline staging quads / inline encrypted blob) paths but skips V2 entirely. Worse, PR OriginTrail#729 Bug 4 shipped a fix to the V2 ACK `loadChunk` graph canonicalisation (`normalizeContextGraphIdForChunkStore` returning null → wildcard `GRAPH ?g` fallback) without a regression test — any future refactor of that hook could silently re-introduce the "keccak the decimal string '42'" miss. Scope ===== Eight cases — happy paths first, then the regression and decline shapes: 1. **Happy path** — cleartext CG, canonicalising normalizer, chunks persisted under `ciphertextChunkStoreGraph(canonical(swmGraphId))`, ACK signed. Pins the production canonicalisation path. 2. **OriginTrail#729 Bug 4 regression** — V2 intent omits `swmGraphId`, normalizer returns `null` on every input → handler MUST widen to `GRAPH ?g` and still find the chunks. The pre-fix code keccak'd the decimal `cgId` ('42') and missed every persisted chunk; the test now locks the fix in. 3. **Legacy no-normalizer shim** — `normalizeContextGraphIdForChunkStore` absent → uses raw `swmGraphId` literally, preserving pre-fix behaviour for callers that haven't wired the hook. 4. **Multi-CG isolation (PR OriginTrail#715 / ciphertext-chunk-store.ts:28-44)** — two CGs publish identical V10 KCs (same batchId, since it's plaintext-derived) but persist chunks under different canonical named graphs. ACK for CG-A must only see CG-A's chunks; an intent claiming CG-B's root for the same batchId is correctly declined with `CIPHERTEXT_ROOT_MISMATCH`. This pins the Codex finding that drove the per-CG named-graph design. 5. **`MISSING_CIPHERTEXT_CHUNKS` decline** — claim count=4, persist only indexes 0 and 2 → declines after the 10-second retry window with the missing indexes in the message ("missing 2/4 ... 1,3"). 6. **`CIPHERTEXT_ROOT_MISMATCH` decline** — all chunks present but publisher lies about the root → declines fast. 7. **Curated-only gate** — `isCgCurated` returns false → declines with `SIGNER_NOT_REGISTERED` (the V2-curated-only gate at `storage-ack-handler.ts:296-310`). Closes the bypass concern called out in the inline comment above the V2 branch. 8. **stagingQuads forbidden on V2 wire** — V2 intent with non-empty `stagingQuads` is rejected (the chunked path never carries inline ciphertext). Test harness ============ Real `OxigraphStore` (mirrors `v10-ack-edge-cases.test.ts` pattern), real `encodePublishIntent` / `decodeStorageACK` proto round-trips, real `buildCiphertextChunksRoot` Merkle tree construction. The new `buildV2IntentBytes(opts)` helper derives `ciphertextChunksRoot`, `ciphertextChunkCount`, and `publicByteSize` from the chunks list unless explicitly overridden — production wire shape, not a fake. The `seedChunks(store, opts)` helper inserts chunk literals under the same `(graph, subject, predicate)` layout that `ingestSwmCiphertextChunkEnvelope` in `dkg-agent.ts` writes. Verification ============ - `pnpm --filter @origintrail-official/dkg-publisher build` — clean - `vitest run test/v10-ack-v2-chunked.test.ts` — 8/8 pass - `vitest run test/v10-ack-v2-chunked.test.ts test/v10-ack-edge-cases.test.ts` — 53/53 pass (8 new + 45 existing); no regressions to the V1 suite Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Companion to PR OriginTrail#735 (OriginTrail#720 adapter coverage) and OriginTrail#737 (OriginTrail#700 drain bug). Co-authored-by: Cursor <cursoragent@cursor.com>

…onder + OriginTrail#729 Bug 5 canonical CG keying Adds `packages/agent/test/lu11-handle-get-ciphertext-chunk.test.ts` — 6 cases exercising the LU-11 / OT-RFC-39 `get-ciphertext-chunk` sync verb responder shipped by PR OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 5. Background ========== OriginTrail#716 review-consolidation audit flagged this as one of the critical adapter-wiring gaps in the same class as OriginTrail#720/OriginTrail#735: helper-level signing primitives (`mintSignedCiphertextChunkCatchupRequest`, `verifySignedCiphertextChunkCatchupRequest`, replay guard) are present, but the responder that wires them together with the 5-layer authority stack and the SPARQL chunk lookup is completely untested. Worse, PR OriginTrail#729 Bug 5 shipped a fix to the responder's chunk-graph lookup without a regression test: - Pre-fix: `gossipWireIdFor(req.contextGraphId)` was called unconditionally. For a numeric on-chain id like "42" this keccak'd the literal decimal string and produced a graph URI nothing was ever persisted under — every late-joining core's backfill request silently returned "chunk not found" even when the bytes were on disk under the curator's nameHash. - Fix: routes through `canonicalChunkStoreCgIdOrNull` which returns `null` for inputs it can't safely canonicalise; the responder then widens to a `GRAPH ?g` wildcard scan. Scope ===== Six cases — happy path, the Bug 5 regression, structured-deny shapes, and the replay-guard boundary: 1. **Happy path** — subscribed cleartext CG → keccak wire hash → scoped lookup → chunk returned. Pins the production canonicalisation path the responder relies on. 2. **OriginTrail#729 Bug 5 regression** — numeric `contextGraphId = "42"` with no local CG mapping → `canonicalChunkStoreCgIdOrNull` returns `null` → handler widens to `GRAPH ?g` → finds the chunk persisted under the curator's real nameHash graph. Locks the fix in so a future refactor can't silently re-introduce the decimal-string keccak miss. 3. **`chunk not found` decline** — authorised requester + canonicalisable CG but no chunk persisted → structured `denied: 'chunk not found'` with echoed `(contextGraphId, batchIdHex, chunkIndex)` for requester correlation. 4. **Unauthorised requester** — full 5-layer auth fall-through: none of `resolveOnChainParticipantAgents`, `resolveBeaconPinnedCuratorEoa`, `getContextGraphAgentGateAddresses`, `getContextGraphAllowedPeers`, `chain.getIdentityIdForAddress` admit the requester → structured `denied: 'requester EOA not in any of: ...'` (or `'no authority source available'` when every probe returns null/undefined on MockChainAdapter). 5. **Malformed request bytes** — decoder throws → handler maps to `denied: 'malformed request: ...'` with defensive defaults on the echo fields (empty `contextGraphId`, empty `batchIdHex`, `-1` chunkIndex) so an attacker-controlled garbage payload can't leak back through the response envelope. 6. **Replay-guard boundary** — same wire bytes twice (same nonce, same issuedAtMs, same signature) → first attempt succeeds, second is rejected with `denied: 'replayed chunk-catchup nonce'`. Pins `ciphertextChunkCatchupReplayGuard.recordIfFresh()` so the defensive boundary against signed-envelope replay holds. Test harness ============ Real `DKGAgent` instance booted on `MockChainAdapter`. The responder method is reached through an `(agent as unknown as ResponderInternals)` cast — same pattern as the existing `swm-sender-key-pending-by-agent.test.ts`. Real ciphertext-catchup proto round-trip (`mintSignedCiphertextChunkCatchupRequest` → `encodeCiphertextChunkCatchupRequest` → `handleGetCiphertextChunk` → `decodeCiphertextChunkCatchupResponse`), real EIP-191 personal-sign via `ethers.Wallet.signMessage`. The OT-RFC-39 fifth authority is exercised by monkey-patching `chain.getIdentityIdForAddress` on the agent's chain adapter — the other four authorities answer null/undefined naturally on MockChainAdapter. Verification ============ - `pnpm --filter @origintrail-official/dkg-agent build` — clean - `vitest run test/lu11-handle-get-ciphertext-chunk.test.ts` — 6/6 pass No production code changes. Test-only PR. Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive. Companion to PR OriginTrail#735 (OriginTrail#720), PR OriginTrail#737 (OriginTrail#700 drain bug), PR OriginTrail#738 (OriginTrail#729 Bug 4 V2 ACK loadChunk canonical keying). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions Bot reviewed May 27, 2026

View reviewed changes

This was referenced May 27, 2026

test(publisher): adapter-level coverage for V2 chunked StorageACK handler + canonical CG keying (#729 Bug 4) #738

Merged

test(agent): adapter coverage for handleGetCiphertextChunk responder + #729 Bug 5 canonical CG keying #739

Merged

branarakic merged commit 6b594aa into release/rc.12 May 27, 2026
3 checks passed

branarakic deleted the fix/rc12-700-pending-drain-bug branch May 27, 2026 12:11

branarakic mentioned this pull request May 27, 2026

release: v10.0.0-rc.12 integration branch #716

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent/discovery): populate `agentAddress` on findAgentByPeerId so #700 drain works in production#737

fix(agent/discovery): populate `agentAddress` on findAgentByPeerId so #700 drain works in production#737
branarakic merged 1 commit into
release/rc.12from
fix/rc12-700-pending-drain-bug

branarakic commented May 27, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

branarakic commented May 27, 2026

TL;DR

The bug

How CI missed this

The fix

Regression tests

1. `test/agent.test.ts` — Discovery Client suite

2. `test/swm-sender-key-pending-by-agent.test.ts` — new "real discovery + agent registry CG" `describe` block

Verification

Risk

Source of finding

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant