Feat: add ipfs retrieval checks#52
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds comprehensive IPNI (InterPlanetary Network Indexer) and IPFS metrics tracking to monitor deal indexing performance and IPFS retrieval capabilities. The changes introduce background monitoring of IPNI deal lifecycle stages, new database fields for tracking metrics, and UI components to display these metrics.
- Adds IPNI tracking with status progression (PENDING → INDEXED → ADVERTISED → RETRIEVED)
- Introduces IPFS-specific retrieval metrics separate from direct provider retrievals
- Updates the Synapse SDK configuration to use metadata-based approach instead of boolean flags
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| web/src/types/providers.ts | Adds IPNI and IPFS metrics fields to provider performance DTO |
| web/src/components/ProviderCard.tsx | Implements collapsible UI section for IPNI & IPFS metrics display |
| src/retrieval-addons/strategies/ipni.strategy.ts | Updates IPNI retrieval URL path from /piece/ to /ipfs/ |
| src/metrics/services/providers.service.ts | Adds IPNI and IPFS metrics aggregation logic for performance calculations |
| src/metrics/services/metrics-scheduler.service.ts | Adds IPNI metrics to daily aggregation queries |
| src/metrics/dto/provider-performance.dto.ts | Defines API schema for IPNI and IPFS metrics |
| src/deal/deal.service.ts | Refactors to trigger upload complete handlers and load provider earlier |
| src/deal-addons/types.ts | Updates types to use ServiceType enum and adds SynapseConfig interface |
| src/deal-addons/strategies/ipni.strategy.ts | Implements IPNI monitoring workflow with background tracking and metrics storage |
| src/deal-addons/strategies/{direct,cdn}.strategy.ts | Updates to use new SynapseConfig interface |
| src/deal-addons/interfaces/deal-addon.interface.ts | Adds optional onUploadComplete handler to addon interface |
| src/deal-addons/deal-addons.service.ts | Implements upload complete handler orchestration |
| src/deal-addons/deal-addons.module.ts | Adds Deal entity TypeORM import for repository injection |
| src/database/types.ts | Defines IpniStatus enum |
| src/database/migrations/*.ts | Database migrations for IPNI tracking fields and performance views |
| src/database/entities/*.ts | Adds IPNI and IPFS metric columns to entities and materialized views |
Comments suppressed due to low confidence (1)
src/deal-addons/strategies/ipni.strategy.ts:1
- Duplicate addition to
totalDataRetrievedByteson line 714 for IPFS retrievals. This value is already added on line 707 for all successful retrievals. This will cause double-counting of data retrieved via IPFS, inflating the total data retrieved metric.
import { request } from "node:https";
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
I do not think it is necessary for IPNI itself. For example, provider ID 6: beck-calib in dealbot uses http, but has working IPNI |
|
I am working on reviewing the logic and output. I'll post comments as soon as I'm done. |
There was a problem hiding this comment.
Thanks for putting this together. Lots to pull together and very important. I'll respond separately on the screenshot itself. Happy to talk verbally on any of this if it's useful.
Concern 1: terminal state should be "verified" not "retrieved".
We're putting a lot of reliance on the PDPServer status of "retrieved".
Per the docs:
* Whether the piece has been retrieved
* This does not necessarily mean it was retrieved by a particular indexer,
* only that the PDP server witnessed a retrieval event. Care should be
* taken when interpreting this field.
I think it's fine to include measurements (counts/latencies) around this "retrieved" stat, but I don't think we should treat it as a our terminal state for IPNI checking. IPNI checking is done when we've verified CIDs are showing up in filecoinpin.contact.
Concern 2: state names
I would like to make the state names more self-documenting:
- indexed → sp_indexed ?: the SP claimed they indexed the piece
- advertised → sp_announced_advertisement : the SP announced a new advertisement chain to the indexer.
- That said, I realize PDPServer calls this "advertised", so I can see if we we want to be consistent there. In that case, "sp_advertised".
- retrieved → sp_received_retrieve_request OR indexer_retrieved_from_sp
- I think I prefer the first from an accuracy regard. I don't think we actually know if the Indexer retrieved the CIDs from the SP.
- (add this) verified: meaning that we confirmed CIDs are actually being returned by the IPNI indexer (filecoinpin.contact).
- A key decision point is when we can consider a "deal" verified. Ideas:
- When the ipfsRootCid shows up properly in the IPNI response
- When more than a certain percentage of CIDs show up properly
- When all the CIDs have been verified
- I think we definitely need to check the ipfsRootCID at the minimum. Beyond that is bonus in my opinion.
- A key decision point is when we can consider a "deal" verified. Ideas:
(do whatever you want with the casing - I'm just trying to convey an idea)
Concern 3: transitioning between "sp_received_retrieve_request" and verifying with IPNI
Right now, as soon as we get to the "sp_received_retrieve_request" state we immediately start trying to verify IPNI results (see monitorAndVerifyIPNI). That makes sense, but we should intentionally allow there to be some time between these two states. Right now in verifyIPNIAdvertisement we just start iterating over each CID and marking them failed if we don't get a valid response. In this asynchronous system, there may be some time between when CIDs are retrieved from the SP by the IPNI indexer and before they start showing up in IPNI responses.
| ROUND(AVG(d.ingest_throughput_bps) FILTER (WHERE d.ingest_throughput_bps IS NOT NULL))::bigint as avg_ingest_throughput_bps, | ||
|
|
||
| -- Retrieval metrics (all time) | ||
| -- Retrieval metrics (all time - DIRECT_SP only) |
There was a problem hiding this comment.
| -- Retrieval metrics (all time - DIRECT_SP only) | |
| -- Direct SP /piece Retrieval metrics (all time) |
| COUNT(DISTINCT r.id) FILTER (WHERE r.status = 'success' AND r.service_type = '${ServiceType.DIRECT_SP}') as successful_retrievals, | ||
| COUNT(DISTINCT r.id) FILTER (WHERE r.status = 'failed' AND r.service_type = '${ServiceType.DIRECT_SP}') as failed_retrievals, | ||
|
|
||
| -- Retrieval success rate (all time) |
There was a problem hiding this comment.
| -- Retrieval success rate (all time) | |
| -- Direct SP /piece Retrieval success rate (all time) |
| -- Retrieval throughput (all time) | ||
| ROUND(AVG(r.throughput_bps) FILTER (WHERE r.throughput_bps IS NOT NULL AND r.service_type = '${ServiceType.DIRECT_SP}'))::bigint as avg_throughput_bps, | ||
|
|
||
| -- IPFS retrieval metrics (all time - IPFS_PIN only) |
There was a problem hiding this comment.
To be really clear I would call this "IPFS Mainnet retrieval metrics" here and below
| COUNT(DISTINCT r_ipfs.id) FILTER (WHERE r_ipfs.status = 'success' AND r_ipfs.service_type = '${ServiceType.IPFS_PIN}') as successful_ipfs_retrievals, | ||
| COUNT(DISTINCT r_ipfs.id) FILTER (WHERE r_ipfs.status = 'failed' AND r_ipfs.service_type = '${ServiceType.IPFS_PIN}') as failed_ipfs_retrievals, | ||
|
|
||
| -- IPFS retrieval success rate |
There was a problem hiding this comment.
nit (not that you need to change) , but my understanding is that "success rates" are between 0 and 1 while "success percentages" are between 0 and 100. In this case we are calling a "success percentage" a "success rate".
| ELSE 0 | ||
| END as ipfs_retrieval_success_rate, | ||
|
|
||
| -- IPFS retrieval performance |
There was a problem hiding this comment.
Average is usually not the best metric when it comes to aggregating latencies. p50 and p90 are ideal, but I understand that may not be easily possible with a db query.
There was a problem hiding this comment.
Would be ideal not to have so much copy paste between this and the all-time file. ideally one shared file and you pass a date range to each? (your call on this)
There was a problem hiding this comment.
nit: I think it would be easier to read this code from top to bottom if the functions were laid out in their natural call order (i.e., public function → helper function 1 → helper function 2). For example, put queryIPNI > httpsGet > extractProviderAddrs.
| }; | ||
| } | ||
|
|
||
| async verifyIPNIAdvertisement(blockCIDs: string[], expectedMultiaddr: string, maxDurationMs: number) { |
There was a problem hiding this comment.
I think some comments would be useful here. Basically this is the function that checks that the IPNI indexer (filecoinpin.contact) has the advertised CIDs.
| if (failedCIDs.length > 0) { | ||
| throw new Error( | ||
| `IPNI verification failed: ${successCount}/${blockCIDs.length} CIDs verified. ` + | ||
| `Failed CIDs: ${JSON.stringify(failedCIDs, null, 2)}`, | ||
| this.logger.warn( | ||
| `IPNI verification: ${successCount}/${blockCIDs.length} CIDs verified, ` + `${failedCIDs.length} failed`, | ||
| ); | ||
| throw new Error(`IPNI verification failed: ${successCount}/${blockCIDs.length} CIDs verified`); | ||
| } |
There was a problem hiding this comment.
We're not doing anything with failedCid.warning. is that intentional?
There was a problem hiding this comment.
I think these could be renamed to be more self-documenting
| numVerifiedCids: successCount, // verified by itself seems like it could just be a boolean | |
| numCidsToIdeallyVerify: blockCIDs.length, // otherwise, total makes me wonder "total what"? do we really need this though? Doesn't the caller already know this? | |
| verificationDurationMs : durationMs, |
They aren't the greatest name but I think remove some ambiguity
Item 1: terminology"Retrieval Success" is ambiguous to me. Whenever we talk about "retrieval", I think we need to qualify it. In my mind, we either have:
Item 2: different success ratesI'm confused why we have one success rate above that is a bar and the 7-day success rate below just as a number. I assume the top "Success Rate" is for all time. Here is an idea for combining this to just using numbers rather than some numbers and some bars.
You don't have to go with this, but lets at least label clearly and be intentional on our organizing. More on that below. Mixing Success Rates and Latency MetricsThe screenshot is odd to me right with the flow of:
I think either group by metric type (e.g., success rates then latency metrics) or group by system (Direct SP interactions, IPNI, IPFS Mainnet). Right now you're mixing them. My recommendation would be to move the IPNI and IPFS latency metrics down to the Latency section where you cover other latencies. Define the metricThe only state that is self-describing is "success rate". I know how to combining multiple data points for a success rate. But how are you combining:
Are you averaging, summing, counting, p50, etc. Lets label to make it clear. Again, feel free to pick through the feedback and I'm happy to talk verbally if that's helpful. Thanks for your work here. |
|
I realized I gave a bunch of feedback without any sense of priority. The minimum amount of things I would want to see before merging (rest could be backlogged):
|
|
@silent-cipher : we've also learned that it's important that the https://sp.domain/ipfs endpoint support http2. When doing the URL check for |
|
I was using datacenter proxies for retrievals. I need to confirm if they support HTTP/2 or not. If not, I'll bypass proxy and make direct request using HTTP/2 client for |
I am making comments without knowing the full architecture here. Maybe there are other good reasons to be using data center proxies. So maybe the thing to is:
|
|
Thanks for clarifying! I’ve verified that the datacenter proxies I’m using do support HTTP/2, so that shouldn’t be an issue. |
Yes, HTTP/2 support is a strict requirement. This will be getting added to the IPFS HTTP Gateway specs per ipfs/specs#519 Given that, it would be ideal if all of our http retrieval checks using HTTP/2 only (didn't allow HTTP/1.1). |
fix: bump synapse-sdk to v0.35.3
|
Merging this for now after addressing the priority reviews. I’ll follow up on the remaining feedback in a later update. |
|
@silent-cipher : what is the list of what is remaining? Can we move that into an issue? |
|
Giving feedback on https://dealbot.fwss.io/ on 2025-11-14T19:08:41Z
Item 1: IPNI Latency Metric NamesI think these terms could be more self documenting (similar to concern 2 in #52 (review))
Item 2: how is IPFS Mainnet retrieval working if IPNI is not?I'm confused how IPNI Indexer requests are failing, but IPFS Mainnet retrievals are succeeding, given IPFS Mainnet retrieval depends on IPNI Indexing? Are we maybe not using unique CIDs for each run? Feel free to move this into an issue if that is preferred. |


Failed Ipfs retrieval details can be seen in failed retrievals section.

Preview -
closes #51