feat(retrieval): parallel IPNI + transport with retrievalTransportStatus counter#538
feat(retrieval): parallel IPNI + transport with retrievalTransportStatus counter#538SgtPooki wants to merge 2 commits into
Conversation
…s counters In pgBoss mode, performAllRetrievals previously gated /ipfs transport on IPNI verification: any IPNI lag or failure marked the deal as retrievalStatus=failure.* without ever exercising the transport path. This conflated discoverability and transport failures on the dashboard, and when paired with the silent-timeout bug (#524) made the failure mode invisible in logs. Run both stages concurrently via Promise.allSettled and compose the terminal retrievalStatus as AND(discoverabilityStatus, retrievalTransportStatus). A new retrievalTransportStatus counter records the /ipfs-only outcome so operators can attribute a retrievalStatus failure to IPNI vs transport from a single dashboard. Tracks dealbot#524 option A. Tests: 9/9 retrieval.service.spec pass, 354/354 backend pass. Typecheck + biome lint clean.
There was a problem hiding this comment.
Pull request overview
This PR updates the backend Retrieval check to run IPNI verification and /ipfs transport in parallel, and introduces a new Prometheus counter (retrievalTransportStatus) to separately attribute failures to transport vs discoverability while keeping retrievalStatus as the composite (AND) outcome.
Changes:
- Parallelize IPNI + transport execution in
RetrievalServiceusingPromise.allSettled, then classify and compose terminal status from both sub-statuses. - Add and register the
retrievalTransportStatusPrometheus counter and expose arecordTransportStatushelper onRetrievalCheckMetrics. - Update retrieval/check docs and add tests covering combined outcomes plus a concurrency-ordering test.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/checks/retrievals.md | Documents parallel IPNI + transport flow and clarifies composite retrievalStatus semantics. |
| docs/checks/events-and-metrics.md | Defines the new retrievalTransportStatus metric and documents the composite rule for retrievalStatus. |
| apps/backend/src/retrieval/retrieval.service.ts | Runs IPNI + transport concurrently; records retrievalTransportStatus; composes terminal retrievalStatus. |
| apps/backend/src/retrieval/retrieval.service.spec.ts | Adds pgBoss-mode tests for combined IPNI/transport outcomes and a concurrency assertion. |
| apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts | Registers the new retrievalTransportStatus counter provider. |
| apps/backend/src/metrics-prometheus/check-metrics.service.ts | Injects the new counter and adds recordTransportStatus. |
| const successCount = retrievals.filter((r) => r.status === RetrievalStatus.SUCCESS).length; | ||
| this.logger.log({ | ||
| ...retrievalLogContext, | ||
| event: "retrievals_completed", | ||
| message: "Retrievals successful", | ||
| successCount, | ||
| totalCount: retrievals.length, | ||
| }); |
| mockIpniVerificationService.verify.mockImplementation(async () => { | ||
| order.push("ipni-start"); | ||
| await new Promise((r) => setTimeout(r, 20)); | ||
| order.push("ipni-end"); | ||
| return { | ||
| verified: 1, | ||
| unverified: 0, | ||
| total: 1, | ||
| rootCIDVerified: true, | ||
| durationMs: 20, | ||
| failedCIDs: [], | ||
| verifiedAt: new Date().toISOString(), | ||
| }; | ||
| }); | ||
| mockRetrievalAddonsService.testAllRetrievalMethods.mockImplementation(async () => { | ||
| order.push("transport-start"); | ||
| await new Promise((r) => setTimeout(r, 10)); | ||
| order.push("transport-end"); | ||
| return successfulTransport; | ||
| }); |
- Split retrievals_completed message by allSuccess so partial-failure runs are obvious from the log line (Copilot #538 review). - Replace setTimeout-based concurrency test with deferred-promise barrier; both sides resolve their started signal before awaiting the shared release. Removes wall-clock dependency. - Drop "pgBoss mode" from describe block; the conditional now reads as IPNI orchestration, not a separate run mode.
Two adjacent observability bugs left IPNI failures invisible whenever the outer pg-boss retrieval-job timeout fired: 1. IpniVerificationService.verify checked signal?.aborted before verificationSignal.aborted. On a race where both signals were asserted by the time the catch handler ran, signal.throwIfAborted() re-threw before the ipni_verification_timed_out log could fire. Inner check now runs first so the inner timeout log fires whenever the inner timeout signal aborted, regardless of outer state. 2. RetrievalService.verifyIpniForRetrieval's catch returned a silent failure.timedout when signal?.aborted, with no log. Added retrieval_ipni_verification_timed_out so the retrieval-side caller records the abort. Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
BigLep
left a comment
There was a problem hiding this comment.
Requesting changes to get aligned on what counters are emitted and when.
There was a problem hiding this comment.
Per https://github.com/FilOzone/dealbot/blob/main/docs/checks/data-storage.md#deal-status-progression and https://github.com/FilOzone/dealbot/blob/main/docs/checks/events-and-metrics.md#retrievalStatus, this retrievalStatus counter is supposed to increase when the ipfsRetrievalIntegrityChecked event fires. It wasn't intended to be when ipfsRetrievalIntegrityChecked and ipniVerificationComplete.
Assuming we implement it that way, do we even need a counter for when ipfsRetrievalIntegrityChecked and ipniVerificationComplete occur? Back when specing this out, I wasn't seeing what that got us.
My understanding of the situation:
- Today:
retrievalStatusis successful whenipfsRetrievalIntegrityCheckedandipniVerificationCompleteoccur with success, and those happen sequentially. - Your proposal:
retrievalStatusis successful whenipfsRetrievalIntegrityCheckedandipniVerificationCompleteoccur with success, and those happen in parallel.- Add
retrievalTransportStatusfor whenipfsRetrievalIntegrityCheckedfires.
- My proposal (as originally speced, or at least what was meant but I can accept it wasn't clear)
retrievalStatuswhenipfsRetrievalIntegrityCheckedfires.- Discoverability and Retrieval portions happen in parallel.
- No additional counters.
| | <a id="retrievalStatus"></a>`retrievalStatus` | Data Storage, Retrieval | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). Composite: `success` requires both `discoverabilityStatus=success` and `retrievalTransportStatus=success`. | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) | | ||
| | <a id="retrievalTransportStatus"></a>`retrievalTransportStatus` | Retrieval | When the `/ipfs` transport stage completes (success, all-blocks fetched and validated; or failure on HTTP, content, or abort). Scoped to the transport stage; does **not** include IPNI discoverability. Pair with [`discoverabilityStatus`](#discoverabilityStatus) to attribute a `retrievalStatus` failure to IPNI vs transport. | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) | |
There was a problem hiding this comment.
Per file comment, I'm questioning whether we need to change anything here. I think we can just implement retrievalStatus as originally specced aand just have one counter here.
| ### Retrieval Checks | ||
|
|
||
| For each selected piece, dealbot performs the following in parallel: | ||
| For each selected piece, dealbot performs the following in parallel. Each stage emits its own sub-status counter ([`discoverabilityStatus`](./events-and-metrics.md#discoverabilityStatus) for IPNI, [`retrievalTransportStatus`](./events-and-metrics.md#retrievalTransportStatus) for `/ipfs`). The composite [`retrievalStatus`](./events-and-metrics.md#retrievalStatus) records `success` only when both sub-statuses succeed. |
There was a problem hiding this comment.
I'm not sure more documentation is needed here, but if we are wanting to add extra documentation around metrics, lets put it in https://github.com/FilOzone/dealbot/blob/main/docs/checks/retrievals.md#metrics-recorded ?
silent-cipher
left a comment
There was a problem hiding this comment.
code portion looks good overall. Just one comment.
| ) | ||
| : Promise.resolve({ ok: true as const }); | ||
|
|
||
| const [transportSettled, ipniSettled] = await Promise.allSettled([transportPromise, ipniPromise]); |
There was a problem hiding this comment.
This may extend runtime to the slower of the two timeouts. For example, we could end up waiting for the IPNI timeout even though the transport has already implied failure.timedout since there’s no early exit.
What changed
In pgBoss mode, `performAllRetrievals` gated the `/ipfs` transport on IPNI verification: any IPNI lag bypassed the transport path entirely and pinned `retrievalStatus=failure.*`. Combined with the silent-timeout bug (#524), the failure mode was invisible in logs.
This PR runs IPNI and transport concurrently and composes the terminal status as `AND(discoverabilityStatus, retrievalTransportStatus)`. A new `retrievalTransportStatus` counter records the `/ipfs`-only outcome so a single dashboard panel can attribute a `retrievalStatus` failure to IPNI vs transport.
Implements option A from #524. Matches the parallel flow already documented in `docs/checks/retrievals.md`.
Files
How to verify
```
pnpm --filter dealbot-backend typecheck
pnpm --filter dealbot-backend lint
pnpm --filter dealbot-backend test
```
After deploy, watch:
Notes
`retrievalStatus` semantics unchanged (AND of discoverability and transport). No dashboard migration needed; the new `retrievalTransportStatus` chart is opt-in.
Observability follow-ups for the silent-timeout cases (the catch-order bug in `ipni-verification.service.ts` and the missing `retrieval_skipped_ipni_failure` log) ship as a separate PR (option C from #524) on top of this one.