feat(retrieval): parallel IPNI + transport with retrievalTransportStatus counter#538
Conversation
…s counters In pgBoss mode, performAllRetrievals previously gated /ipfs transport on IPNI verification: any IPNI lag or failure marked the deal as retrievalStatus=failure.* without ever exercising the transport path. This conflated discoverability and transport failures on the dashboard, and when paired with the silent-timeout bug (#524) made the failure mode invisible in logs. Run both stages concurrently via Promise.allSettled and compose the terminal retrievalStatus as AND(discoverabilityStatus, retrievalTransportStatus). A new retrievalTransportStatus counter records the /ipfs-only outcome so operators can attribute a retrievalStatus failure to IPNI vs transport from a single dashboard. Tracks dealbot#524 option A. Tests: 9/9 retrieval.service.spec pass, 354/354 backend pass. Typecheck + biome lint clean.
There was a problem hiding this comment.
Pull request overview
This PR updates the backend Retrieval check to run IPNI verification and /ipfs transport in parallel, and introduces a new Prometheus counter (retrievalTransportStatus) to separately attribute failures to transport vs discoverability while keeping retrievalStatus as the composite (AND) outcome.
Changes:
- Parallelize IPNI + transport execution in
RetrievalServiceusingPromise.allSettled, then classify and compose terminal status from both sub-statuses. - Add and register the
retrievalTransportStatusPrometheus counter and expose arecordTransportStatushelper onRetrievalCheckMetrics. - Update retrieval/check docs and add tests covering combined outcomes plus a concurrency-ordering test.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/checks/retrievals.md | Documents parallel IPNI + transport flow and clarifies composite retrievalStatus semantics. |
| docs/checks/events-and-metrics.md | Defines the new retrievalTransportStatus metric and documents the composite rule for retrievalStatus. |
| apps/backend/src/retrieval/retrieval.service.ts | Runs IPNI + transport concurrently; records retrievalTransportStatus; composes terminal retrievalStatus. |
| apps/backend/src/retrieval/retrieval.service.spec.ts | Adds pgBoss-mode tests for combined IPNI/transport outcomes and a concurrency assertion. |
| apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts | Registers the new retrievalTransportStatus counter provider. |
| apps/backend/src/metrics-prometheus/check-metrics.service.ts | Injects the new counter and adds recordTransportStatus. |
- Split retrievals_completed message by allSuccess so partial-failure runs are obvious from the log line (Copilot #538 review). - Replace setTimeout-based concurrency test with deferred-promise barrier; both sides resolve their started signal before awaiting the shared release. Removes wall-clock dependency. - Drop "pgBoss mode" from describe block; the conditional now reads as IPNI orchestration, not a separate run mode.
Two adjacent observability bugs left IPNI failures invisible whenever the outer pg-boss retrieval-job timeout fired: 1. IpniVerificationService.verify checked signal?.aborted before verificationSignal.aborted. On a race where both signals were asserted by the time the catch handler ran, signal.throwIfAborted() re-threw before the ipni_verification_timed_out log could fire. Inner check now runs first so the inner timeout log fires whenever the inner timeout signal aborted, regardless of outer state. 2. RetrievalService.verifyIpniForRetrieval's catch returned a silent failure.timedout when signal?.aborted, with no log. Added retrieval_ipni_verification_timed_out so the retrieval-side caller records the abort. Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
BigLep
left a comment
There was a problem hiding this comment.
Requesting changes to get aligned on what counters are emitted and when.
silent-cipher
left a comment
There was a problem hiding this comment.
code portion looks good overall. Just one comment.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
apps/backend/src/retrieval/retrieval.service.spec.ts:506
- Blocker: Similar to the previous case, this test expects
retrievalStatus=successwhen IPNI times out. IfretrievalStatusis the composite retrieval outcome (as documented), an IPNI timeout should makeretrievalStatusa failure (with a separate transport-only metric tracking/ipfssuccess).
it("keeps retrievalStatus=success when IPNI times out but transport succeeds (timeout recorded on discoverabilityStatus)", async () => {
service = await createService();
setupCommonMocks();
mockIpniVerificationService.verify.mockResolvedValue({
verified: 0,
unverified: 1,
total: 1,
rootCIDVerified: false,
durationMs: 10_000,
failedCIDs: [{ cid: "x", reason: "timeout" }],
verifiedAt: new Date().toISOString(),
});
mockRetrievalAddonsService.testAllRetrievalMethods.mockResolvedValue(successfulTransport);
await service.performAllRetrievals(buildDealWithIpni());
expect(mockRetrievalMetrics.recordStatus).toHaveBeenCalledWith(labels, "success");
expect(mockDiscoverabilityMetrics.recordStatus).toHaveBeenCalledWith(labels, "failure.timedout");
});
Two adjacent observability bugs left IPNI failures invisible whenever the outer pg-boss retrieval-job timeout fired: 1. IpniVerificationService.verify checked signal?.aborted before verificationSignal.aborted. On a race where both signals were asserted by the time the catch handler ran, signal.throwIfAborted() re-threw before the ipni_verification_timed_out log could fire. Inner check now runs first so the inner timeout log fires whenever the inner timeout signal aborted, regardless of outer state. 2. RetrievalService.verifyIpniForRetrieval's catch returned a silent failure.timedout when signal?.aborted, with no log. Added retrieval_ipni_verification_timed_out so the retrieval-side caller records the abort. Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
BigLep
left a comment
There was a problem hiding this comment.
One comment to review, but looks good to me from a docs regard. I'll leave to @silent-cipher to give the code approval.
Two adjacent observability bugs left IPNI failures invisible whenever the outer pg-boss retrieval-job timeout fired: 1. IpniVerificationService.verify checked signal?.aborted before verificationSignal.aborted. On a race where both signals were asserted by the time the catch handler ran, signal.throwIfAborted() re-threw before the ipni_verification_timed_out log could fire. Inner check now runs first so the inner timeout log fires whenever the inner timeout signal aborted, regardless of outer state. 2. RetrievalService.verifyIpniForRetrieval's catch returned a silent failure.timedout when signal?.aborted, with no log. Added retrieval_ipni_verification_timed_out so the retrieval-side caller records the abort. Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
Two adjacent observability bugs left IPNI failures invisible whenever the outer pg-boss retrieval-job timeout fired: 1. IpniVerificationService.verify checked signal?.aborted before verificationSignal.aborted. On a race where both signals were asserted by the time the catch handler ran, signal.throwIfAborted() re-threw before the ipni_verification_timed_out log could fire. Inner check now runs first so the inner timeout log fires whenever the inner timeout signal aborted, regardless of outer state. 2. RetrievalService.verifyIpniForRetrieval's catch returned a silent failure.timedout when signal?.aborted, with no log. Added retrieval_ipni_verification_timed_out so the retrieval-side caller records the abort. Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
What changed
In pgBoss mode, `performAllRetrievals` gated the `/ipfs` transport on IPNI verification: any IPNI lag bypassed the transport path entirely and pinned `retrievalStatus=failure.*`. Combined with the silent-timeout bug (#524), the failure mode was invisible in logs.
This PR runs IPNI and transport concurrently and composes the terminal status as `AND(discoverabilityStatus, retrievalTransportStatus)`. A new `retrievalTransportStatus` counter records the `/ipfs`-only outcome so a single dashboard panel can attribute a `retrievalStatus` failure to IPNI vs transport.
Implements option A from #524. Matches the parallel flow already documented in `docs/checks/retrievals.md`.
Files
How to verify
```
pnpm --filter dealbot-backend typecheck
pnpm --filter dealbot-backend lint
pnpm --filter dealbot-backend test
```
After deploy, watch:
Notes
`retrievalStatus` semantics unchanged (AND of discoverability and transport). No dashboard migration needed; the new `retrievalTransportStatus` chart is opt-in.
Observability follow-ups for the silent-timeout cases (the catch-order bug in `ipni-verification.service.ts` and the missing `retrieval_skipped_ipni_failure` log) ship as a separate PR (option C from #524) on top of this one.