Skip to content

fix(agent/discovery): populate agentAddress on findAgentByPeerId so #700 drain works in production#737

Merged
branarakic merged 1 commit into
release/rc.12from
fix/rc12-700-pending-drain-bug
May 27, 2026
Merged

fix(agent/discovery): populate agentAddress on findAgentByPeerId so #700 drain works in production#737
branarakic merged 1 commit into
release/rc.12from
fix/rc12-700-pending-drain-bug

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

TL;DR

This is a live production bug, not just a test gap. Found during the #716 review-consolidation deep-dive audit. PR #700's drainPendingSenderKeyForPeer is a permanent no-op on release/rc.12 because the discovery method it relies on never populates the field it gates on.

3-line fix in discovery.ts + 3 regression tests that would have caught the original bug.

The bug

PR #700 ("Agents Context Graph as distributed phonebook") shipped DKGAgent.drainPendingSenderKeyForPeer (packages/agent/src/dkg-agent.ts:6089-6138) as the recovery loop that replays queued SWM sender keys once a previously-unknown agent's peerId resolves on connection:open. This is the entire point of the pending-by-agent path landed by #700 — without it, sender keys queued for not-yet-resolved agents would never be delivered.

The drain gates on profile?.agentAddress:

```ts
const profile = await this.discovery.findAgentByPeerId(peerId);
if (profile?.agentAddress) {
agentAddresses = [profile.agentAddress.toLowerCase()];
}
...
if (agentAddresses.length === 0) return 0;
```

But DiscoveryClient.findAgentByPeerId (packages/agent/src/discovery.ts:147-218) never populates that field:

  • Its scalar SELECT clause omits the `?agentAddress` column (line 163).
  • Its return object (lines 209-218) doesn't set the field.

Compare to the sibling `findAgents()` on the same file — that one DOES select and return `agentAddress` (lines 71, 78, 92). The asymmetry between the two discovery entrypoints meant in production:

  1. Every `connection:open` → drain attempt early-returned `0`.
  2. `pendingSenderKeyByAgent` grew but never replayed.
  3. The recovery feature shipped as a permanent no-op.

How CI missed this

`swm-sender-key-pending-by-agent.test.ts` stubbed `discovery.findAgentByPeerId` to return `agentAddress` explicitly (lines 202-209 of the existing file), so the stubbed-discovery tests were green even though the real discovery layer never returns that field.

The fix

3 lines in `discovery.ts`:

  • Add `?agentAddress` to the scalar `SELECT`.
  • Add the matching `OPTIONAL { ... <${DKG}agentAddress> ... }` clause.
  • Populate the field on the returned object.

Now both `findAgents()` and `findAgentByPeerId()` resolve the same identity for the same peer.

Regression tests

1. `test/agent.test.ts` — Discovery Client suite

New test "returns agentAddress on findAgentByPeerId — keeps both discovery entrypoints in lockstep" that:

  • Inserts an agent profile with an explicit `agentAddress`.
  • Asserts both `findAgents()` AND `findAgentByPeerId()` return the (lowercased) address.
  • Pins the legacy-profile-without-agentAddress fallback path (`undefined`, not a throw).

2. `test/swm-sender-key-pending-by-agent.test.ts` — new "real discovery + agent registry CG" `describe` block

Two integration tests that exercise the drain path without stubbing discovery:

a. Boot agent → enqueue pending sender key via the no-peerId path → publish the recipient's profile (real `buildAgentProfile` output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer` with the real `DiscoveryClient` → assert messenger.sendReliable was called and the queue is empty.

b. Same flow but the recipient profile omits `dkg:agentAddress` (legacy profile) → assert drain returns 0, queue stays in place, messenger is never called.

Test (b) would have caught the original bug — every assertion in the all-stubbed path was satisfied even with the broken implementation.

Verification

```
pnpm --filter @origintrail-official/dkg-agent build # clean (tsc no errors)
pnpm --filter @origintrail-official/dkg-agent test # 18/18 SWM sender-key tests pass
# 7/7 Discovery Client tests pass
```

Risk

Low. The discovery query change is additive (one new OPTIONAL clause + one new return field). The drain code was already reading `profile?.agentAddress` — populating that field is what the code already expects.

Source of finding

#716 review-consolidation deep-dive audit (the same audit that surfaced #720's missing call-site coverage that became #735). This bug had no symptom in CI; production would have silently lost sender-key recovery for any agent that wasn't already connected at publish time.

Made with Cursor

…700 drain works in production

PR #700 ("Agents Context Graph as distributed phonebook") shipped
`DKGAgent.drainPendingSenderKeyForPeer` (`dkg-agent.ts:6089-6138`) as
the recovery loop that replays queued SWM sender keys once a previously-
unknown agent's peerId resolves on `connection:open`. It's the entire
point of the pending-by-agent path landed by #700.

The bug
=======

`drainPendingSenderKeyForPeer` gates on `profile?.agentAddress`:

  const profile = await this.discovery.findAgentByPeerId(peerId);
  if (profile?.agentAddress) {
    agentAddresses = [profile.agentAddress.toLowerCase()];
  }
  ...
  if (agentAddresses.length === 0) return 0;

But `DiscoveryClient.findAgentByPeerId` (`discovery.ts:147-218`)
**never populates that field** — its scalar `SELECT` omits the
`?agentAddress` column entirely, and the return object on lines
209-218 doesn't set it. (Compare to the sibling `findAgents()` on
the same file, which DOES select and return `agentAddress` at lines
71, 78, 92.) The asymmetry between the two discovery entrypoints
meant:

  - Every `connection:open` → drain attempt early-returned 0.
  - `pendingSenderKeyByAgent` grew but never replayed.
  - In production, sender keys queued for not-yet-resolved agents
    would never be delivered — the whole recovery feature shipped as
    a permanent no-op.

CI was green because `swm-sender-key-pending-by-agent.test.ts`
stubbed `discovery.findAgentByPeerId` to return `agentAddress`
explicitly (lines 202-209), masking the production gap.

The fix
=======

Three lines: add `?agentAddress` to the scalar `SELECT`, add the
matching `OPTIONAL { ... <${DKG}agentAddress> ... }` clause, and
populate the field on the returned object. Now both
`findAgents()` and `findAgentByPeerId()` resolve the same identity
for the same peer.

Regression tests
================

1. `test/agent.test.ts` — Discovery Client suite — new test
   "returns agentAddress on findAgentByPeerId — keeps both discovery
   entrypoints in lockstep". Inserts a profile with an explicit
   `agentAddress`, asserts both `findAgents()` AND `findAgentByPeerId()`
   return it (lowercased per `canonicalAgentDidSubject`), and pins
   the legacy-profile-without-agentAddress fallback (returns
   `undefined`, doesn't throw).

2. `test/swm-sender-key-pending-by-agent.test.ts` — new
   "real discovery + agent registry CG" describe block — two
   integration tests that exercise the drain path **without**
   stubbing discovery:

   a. Boot agent → enqueue pending sender key via the no-peerId
      path → publish the recipient's profile (real `buildAgentProfile`
      output → `agent.store.insert`) → call `drainPendingSenderKeyForPeer`
      with the real `DiscoveryClient` → assert messenger.sendReliable
      was called and the queue is empty.

   b. Same flow but the recipient profile omits `dkg:agentAddress`
      (legacy profile) → assert drain returns 0, queue stays in
      place, messenger is never called.

Test (b) would have caught the original bug — every assertion in
the all-stubbed path was satisfied even with the broken implementation.

Verified
========

- `pnpm --filter @origintrail-official/dkg-agent build` — clean
- All 18 SWM sender-key tests pass (+2 new + 16 existing)
- All 7 Discovery Client tests pass (+1 new + 6 existing)

Closes audit finding flagged by the #716 review-consolidation deep dive.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review completed — no issues found.

@branarakic branarakic merged commit 6b594aa into release/rc.12 May 27, 2026
3 checks passed
@branarakic branarakic deleted the fix/rc12-700-pending-drain-bug branch May 27, 2026 12:11
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
…dler + canonical CG keying (OriginTrail#729 Bug 4)

Adds `packages/publisher/test/v10-ack-v2-chunked.test.ts` — 8 cases
exercising the LU-11 / OT-RFC-39 V2 chunked ACK path landed by
PR OriginTrail#715/OriginTrail#717 and the canonical-CG-keying fix in PR OriginTrail#729 Bug 4.

Background
==========

OriginTrail#716 review-consolidation audit flagged this as the closest analogue
of the gap PR OriginTrail#735 closed for OriginTrail#720: helper-level primitives (chunked
AEAD, ciphertext Merkle tree, proto wire format, on-chain commitment
fields) are well tested, but the StorageACKHandler that wires them
together at the ACK boundary has zero direct coverage. The existing
`v10-ack-edge-cases.test.ts` thoroughly covers the V1 (inline
staging quads / inline encrypted blob) paths but skips V2 entirely.

Worse, PR OriginTrail#729 Bug 4 shipped a fix to the V2 ACK `loadChunk` graph
canonicalisation (`normalizeContextGraphIdForChunkStore` returning
null → wildcard `GRAPH ?g` fallback) without a regression test —
any future refactor of that hook could silently re-introduce the
"keccak the decimal string '42'" miss.

Scope
=====

Eight cases — happy paths first, then the regression and decline
shapes:

1. **Happy path** — cleartext CG, canonicalising normalizer, chunks
   persisted under `ciphertextChunkStoreGraph(canonical(swmGraphId))`,
   ACK signed. Pins the production canonicalisation path.

2. **OriginTrail#729 Bug 4 regression** — V2 intent omits `swmGraphId`,
   normalizer returns `null` on every input → handler MUST widen
   to `GRAPH ?g` and still find the chunks. The pre-fix code
   keccak'd the decimal `cgId` ('42') and missed every persisted
   chunk; the test now locks the fix in.

3. **Legacy no-normalizer shim** — `normalizeContextGraphIdForChunkStore`
   absent → uses raw `swmGraphId` literally, preserving pre-fix
   behaviour for callers that haven't wired the hook.

4. **Multi-CG isolation (PR OriginTrail#715 / ciphertext-chunk-store.ts:28-44)**
   — two CGs publish identical V10 KCs (same batchId, since it's
   plaintext-derived) but persist chunks under different canonical
   named graphs. ACK for CG-A must only see CG-A's chunks; an
   intent claiming CG-B's root for the same batchId is correctly
   declined with `CIPHERTEXT_ROOT_MISMATCH`. This pins the Codex
   finding that drove the per-CG named-graph design.

5. **`MISSING_CIPHERTEXT_CHUNKS` decline** — claim count=4, persist
   only indexes 0 and 2 → declines after the 10-second retry window
   with the missing indexes in the message ("missing 2/4 ... 1,3").

6. **`CIPHERTEXT_ROOT_MISMATCH` decline** — all chunks present but
   publisher lies about the root → declines fast.

7. **Curated-only gate** — `isCgCurated` returns false → declines
   with `SIGNER_NOT_REGISTERED` (the V2-curated-only gate at
   `storage-ack-handler.ts:296-310`). Closes the bypass concern
   called out in the inline comment above the V2 branch.

8. **stagingQuads forbidden on V2 wire** — V2 intent with non-empty
   `stagingQuads` is rejected (the chunked path never carries
   inline ciphertext).

Test harness
============

Real `OxigraphStore` (mirrors `v10-ack-edge-cases.test.ts` pattern),
real `encodePublishIntent` / `decodeStorageACK` proto round-trips,
real `buildCiphertextChunksRoot` Merkle tree construction. The new
`buildV2IntentBytes(opts)` helper derives `ciphertextChunksRoot`,
`ciphertextChunkCount`, and `publicByteSize` from the chunks list
unless explicitly overridden — production wire shape, not a fake.
The `seedChunks(store, opts)` helper inserts chunk literals under
the same `(graph, subject, predicate)` layout that
`ingestSwmCiphertextChunkEnvelope` in `dkg-agent.ts` writes.

Verification
============

- `pnpm --filter @origintrail-official/dkg-publisher build` — clean
- `vitest run test/v10-ack-v2-chunked.test.ts` — 8/8 pass
- `vitest run test/v10-ack-v2-chunked.test.ts test/v10-ack-edge-cases.test.ts`
  — 53/53 pass (8 new + 45 existing); no regressions to the V1 suite

Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive.
Companion to PR OriginTrail#735 (OriginTrail#720 adapter coverage) and OriginTrail#737 (OriginTrail#700 drain bug).

Co-authored-by: Cursor <cursoragent@cursor.com>
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
…onder + OriginTrail#729 Bug 5 canonical CG keying

Adds `packages/agent/test/lu11-handle-get-ciphertext-chunk.test.ts`
— 6 cases exercising the LU-11 / OT-RFC-39 `get-ciphertext-chunk`
sync verb responder shipped by PR OriginTrail#717 and the canonical-CG-keying
fix in PR OriginTrail#729 Bug 5.

Background
==========

OriginTrail#716 review-consolidation audit flagged this as one of the
critical adapter-wiring gaps in the same class as OriginTrail#720/OriginTrail#735:
helper-level signing primitives (`mintSignedCiphertextChunkCatchupRequest`,
`verifySignedCiphertextChunkCatchupRequest`, replay guard) are
present, but the responder that wires them together with the
5-layer authority stack and the SPARQL chunk lookup is completely
untested.

Worse, PR OriginTrail#729 Bug 5 shipped a fix to the responder's chunk-graph
lookup without a regression test:

  - Pre-fix: `gossipWireIdFor(req.contextGraphId)` was called
    unconditionally. For a numeric on-chain id like "42" this
    keccak'd the literal decimal string and produced a graph URI
    nothing was ever persisted under — every late-joining core's
    backfill request silently returned "chunk not found" even
    when the bytes were on disk under the curator's nameHash.

  - Fix: routes through `canonicalChunkStoreCgIdOrNull` which
    returns `null` for inputs it can't safely canonicalise; the
    responder then widens to a `GRAPH ?g` wildcard scan.

Scope
=====

Six cases — happy path, the Bug 5 regression, structured-deny shapes,
and the replay-guard boundary:

1. **Happy path** — subscribed cleartext CG → keccak wire hash →
   scoped lookup → chunk returned. Pins the production
   canonicalisation path the responder relies on.

2. **OriginTrail#729 Bug 5 regression** — numeric `contextGraphId = "42"` with
   no local CG mapping → `canonicalChunkStoreCgIdOrNull` returns
   `null` → handler widens to `GRAPH ?g` → finds the chunk
   persisted under the curator's real nameHash graph. Locks the
   fix in so a future refactor can't silently re-introduce the
   decimal-string keccak miss.

3. **`chunk not found` decline** — authorised requester +
   canonicalisable CG but no chunk persisted → structured
   `denied: 'chunk not found'` with echoed `(contextGraphId,
   batchIdHex, chunkIndex)` for requester correlation.

4. **Unauthorised requester** — full 5-layer auth fall-through:
   none of `resolveOnChainParticipantAgents`, `resolveBeaconPinnedCuratorEoa`,
   `getContextGraphAgentGateAddresses`, `getContextGraphAllowedPeers`,
   `chain.getIdentityIdForAddress` admit the requester →
   structured `denied: 'requester EOA not in any of: ...'`
   (or `'no authority source available'` when every probe
   returns null/undefined on MockChainAdapter).

5. **Malformed request bytes** — decoder throws → handler maps to
   `denied: 'malformed request: ...'` with defensive defaults on
   the echo fields (empty `contextGraphId`, empty `batchIdHex`,
   `-1` chunkIndex) so an attacker-controlled garbage payload
   can't leak back through the response envelope.

6. **Replay-guard boundary** — same wire bytes twice (same nonce,
   same issuedAtMs, same signature) → first attempt succeeds,
   second is rejected with `denied: 'replayed chunk-catchup nonce'`.
   Pins `ciphertextChunkCatchupReplayGuard.recordIfFresh()` so
   the defensive boundary against signed-envelope replay holds.

Test harness
============

Real `DKGAgent` instance booted on `MockChainAdapter`. The
responder method is reached through an `(agent as unknown as
ResponderInternals)` cast — same pattern as the existing
`swm-sender-key-pending-by-agent.test.ts`. Real ciphertext-catchup
proto round-trip (`mintSignedCiphertextChunkCatchupRequest` →
`encodeCiphertextChunkCatchupRequest` →
`handleGetCiphertextChunk` → `decodeCiphertextChunkCatchupResponse`),
real EIP-191 personal-sign via `ethers.Wallet.signMessage`. The
OT-RFC-39 fifth authority is exercised by monkey-patching
`chain.getIdentityIdForAddress` on the agent's chain adapter —
the other four authorities answer null/undefined naturally on
MockChainAdapter.

Verification
============

- `pnpm --filter @origintrail-official/dkg-agent build` — clean
- `vitest run test/lu11-handle-get-ciphertext-chunk.test.ts` —
  6/6 pass

No production code changes. Test-only PR.

Closes audit finding flagged by the OriginTrail#716 review-consolidation deep dive.
Companion to PR OriginTrail#735 (OriginTrail#720), PR OriginTrail#737 (OriginTrail#700 drain bug), PR OriginTrail#738
(OriginTrail#729 Bug 4 V2 ACK loadChunk canonical keying).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant