Feat: add ipfs retrieval checks by silent-cipher · Pull Request #52 · FilOzone/dealbot

silent-cipher · 2025-11-06T08:03:04Z

Failed Ipfs retrieval details can be seen in failed retrievals section.
Preview -

closes #51

Copilot

Pull Request Overview

This PR adds comprehensive IPNI (InterPlanetary Network Indexer) and IPFS metrics tracking to monitor deal indexing performance and IPFS retrieval capabilities. The changes introduce background monitoring of IPNI deal lifecycle stages, new database fields for tracking metrics, and UI components to display these metrics.

Adds IPNI tracking with status progression (PENDING → INDEXED → ADVERTISED → RETRIEVED)
Introduces IPFS-specific retrieval metrics separate from direct provider retrievals
Updates the Synapse SDK configuration to use metadata-based approach instead of boolean flags

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
web/src/types/providers.ts	Adds IPNI and IPFS metrics fields to provider performance DTO
web/src/components/ProviderCard.tsx	Implements collapsible UI section for IPNI & IPFS metrics display
src/retrieval-addons/strategies/ipni.strategy.ts	Updates IPNI retrieval URL path from `/piece/` to `/ipfs/`
src/metrics/services/providers.service.ts	Adds IPNI and IPFS metrics aggregation logic for performance calculations
src/metrics/services/metrics-scheduler.service.ts	Adds IPNI metrics to daily aggregation queries
src/metrics/dto/provider-performance.dto.ts	Defines API schema for IPNI and IPFS metrics
src/deal/deal.service.ts	Refactors to trigger upload complete handlers and load provider earlier
src/deal-addons/types.ts	Updates types to use ServiceType enum and adds SynapseConfig interface
src/deal-addons/strategies/ipni.strategy.ts	Implements IPNI monitoring workflow with background tracking and metrics storage
src/deal-addons/strategies/{direct,cdn}.strategy.ts	Updates to use new SynapseConfig interface
src/deal-addons/interfaces/deal-addon.interface.ts	Adds optional onUploadComplete handler to addon interface
src/deal-addons/deal-addons.service.ts	Implements upload complete handler orchestration
src/deal-addons/deal-addons.module.ts	Adds Deal entity TypeORM import for repository injection
src/database/types.ts	Defines IpniStatus enum
src/database/migrations/*.ts	Database migrations for IPNI tracking fields and performance views
src/database/entities/*.ts	Adds IPNI and IPFS metric columns to entities and materialized views

Comments suppressed due to low confidence (1)

src/deal-addons/strategies/ipni.strategy.ts:1

Duplicate addition to totalDataRetrievedBytes on line 714 for IPFS retrievals. This value is already added on line 707 for all successful retrievals. This will cause double-counting of data retrieved via IPFS, inflating the total data retrieved metric.

import { request } from "node:https";

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

silent-cipher · 2025-11-06T08:49:02Z

@rjan90 , Can you please confirm if this check is necessary for ipni? This check will mark IPNI operation as failed if the provider’s serviceUrl uses http instead of https.

rjan90 · 2025-11-06T13:11:12Z

@rjan90 , Can you please confirm if this check is necessary for ipni? This check will mark IPNI operation as failed if the provider’s serviceUrl uses http instead of https.

I do not think it is necessary for IPNI itself. For example, provider ID 6: beck-calib in dealbot uses http, but has working IPNI

BigLep · 2025-11-06T18:56:58Z

I am working on reviewing the logic and output. I'll post comments as soon as I'm done.

BigLep

Thanks for putting this together. Lots to pull together and very important. I'll respond separately on the screenshot itself. Happy to talk verbally on any of this if it's useful.

Concern 1: terminal state should be "verified" not "retrieved".

We're putting a lot of reliance on the PDPServer status of "retrieved".

Per the docs:

   * Whether the piece has been retrieved
   * This does not necessarily mean it was retrieved by a particular indexer,
   * only that the PDP server witnessed a retrieval event. Care should be
   * taken when interpreting this field.

I think it's fine to include measurements (counts/latencies) around this "retrieved" stat, but I don't think we should treat it as a our terminal state for IPNI checking. IPNI checking is done when we've verified CIDs are showing up in filecoinpin.contact.

Concern 2: state names

I would like to make the state names more self-documenting:

indexed → sp_indexed ?: the SP claimed they indexed the piece
advertised → sp_announced_advertisement : the SP announced a new advertisement chain to the indexer.
- That said, I realize PDPServer calls this "advertised", so I can see if we we want to be consistent there. In that case, "sp_advertised".
retrieved → sp_received_retrieve_request OR indexer_retrieved_from_sp
- I think I prefer the first from an accuracy regard. I don't think we actually know if the Indexer retrieved the CIDs from the SP.
(add this) verified: meaning that we confirmed CIDs are actually being returned by the IPNI indexer (filecoinpin.contact).
- A key decision point is when we can consider a "deal" verified. Ideas:
  1. When the ipfsRootCid shows up properly in the IPNI response
  2. When more than a certain percentage of CIDs show up properly
  3. When all the CIDs have been verified
  - I think we definitely need to check the ipfsRootCID at the minimum. Beyond that is bonus in my opinion.

(do whatever you want with the casing - I'm just trying to convey an idea)

Concern 3: transitioning between "sp_received_retrieve_request" and verifying with IPNI

Right now, as soon as we get to the "sp_received_retrieve_request" state we immediately start trying to verify IPNI results (see monitorAndVerifyIPNI). That makes sense, but we should intentionally allow there to be some time between these two states. Right now in verifyIPNIAdvertisement we just start iterating over each CID and marking them failed if we don't get a valid response. In this asynchronous system, there may be some time between when CIDs are retrieved from the SP by the IPNI indexer and before they start showing up in IPNI responses.

BigLep · 2025-11-06T18:25:42Z

        ROUND(AVG(d.ingest_throughput_bps) FILTER (WHERE d.ingest_throughput_bps IS NOT NULL))::bigint as avg_ingest_throughput_bps,

-        -- Retrieval metrics (all time)
+        -- Retrieval metrics (all time - DIRECT_SP only)


Suggested change

-- Retrieval metrics (all time - DIRECT_SP only)

-- Direct SP /piece Retrieval metrics (all time)

BigLep · 2025-11-06T18:26:15Z

        COUNT(DISTINCT r.id) FILTER (WHERE r.status = 'success' AND r.service_type = '${ServiceType.DIRECT_SP}') as successful_retrievals,
        COUNT(DISTINCT r.id) FILTER (WHERE r.status = 'failed' AND r.service_type = '${ServiceType.DIRECT_SP}') as failed_retrievals,

        -- Retrieval success rate (all time)


Suggested change

-- Retrieval success rate (all time)

-- Direct SP /piece Retrieval success rate (all time)

BigLep · 2025-11-06T18:28:49Z

        -- Retrieval throughput (all time)
        ROUND(AVG(r.throughput_bps) FILTER (WHERE r.throughput_bps IS NOT NULL AND r.service_type = '${ServiceType.DIRECT_SP}'))::bigint as avg_throughput_bps,

+        -- IPFS retrieval metrics (all time - IPFS_PIN only)


To be really clear I would call this "IPFS Mainnet retrieval metrics" here and below

BigLep · 2025-11-06T18:31:07Z

+        COUNT(DISTINCT r_ipfs.id) FILTER (WHERE r_ipfs.status = 'success' AND r_ipfs.service_type = '${ServiceType.IPFS_PIN}') as successful_ipfs_retrievals,
+        COUNT(DISTINCT r_ipfs.id) FILTER (WHERE r_ipfs.status = 'failed' AND r_ipfs.service_type = '${ServiceType.IPFS_PIN}') as failed_ipfs_retrievals,
+
+        -- IPFS retrieval success rate


nit (not that you need to change) , but my understanding is that "success rates" are between 0 and 1 while "success percentages" are between 0 and 100. In this case we are calling a "success percentage" a "success rate".

BigLep · 2025-11-06T18:32:49Z

+          ELSE 0 
+        END as ipfs_retrieval_success_rate,
+
+        -- IPFS retrieval performance


Average is usually not the best metric when it comes to aggregating latencies. p50 and p90 are ideal, but I understand that may not be easily possible with a db query.

BigLep · 2025-11-06T18:37:59Z

Would be ideal not to have so much copy paste between this and the all-time file. ideally one shared file and you pass a date range to each? (your call on this)

BigLep · 2025-11-06T19:10:20Z

nit: I think it would be easier to read this code from top to bottom if the functions were laid out in their natural call order (i.e., public function → helper function 1 → helper function 2). For example, put queryIPNI > httpsGet > extractProviderAddrs.

BigLep · 2025-11-06T19:11:56Z

    };
  }

  async verifyIPNIAdvertisement(blockCIDs: string[], expectedMultiaddr: string, maxDurationMs: number) {


I think some comments would be useful here. Basically this is the function that checks that the IPNI indexer (filecoinpin.contact) has the advertised CIDs.

BigLep · 2025-11-06T19:15:46Z

    if (failedCIDs.length > 0) {
-      throw new Error(
-        `IPNI verification failed: ${successCount}/${blockCIDs.length} CIDs verified. ` +
-          `Failed CIDs: ${JSON.stringify(failedCIDs, null, 2)}`,
+      this.logger.warn(
+        `IPNI verification: ${successCount}/${blockCIDs.length} CIDs verified, ` + `${failedCIDs.length} failed`,
      );
+      throw new Error(`IPNI verification failed: ${successCount}/${blockCIDs.length} CIDs verified`);
    }


We're not doing anything with failedCid.warning. is that intentional?

BigLep · 2025-11-06T19:20:30Z

I think these could be renamed to be more self-documenting

Suggested change

numVerifiedCids: successCount, // verified by itself seems like it could just be a boolean

numCidsToIdeallyVerify: blockCIDs.length, // otherwise, total makes me wonder "total what"? do we really need this though? Doesn't the caller already know this?

verificationDurationMs : durationMs,

They aren't the greatest name but I think remove some ambiguity

BigLep · 2025-11-06T20:31:46Z

Concerning

Item 1: terminology

"Retrieval Success" is ambiguous to me. Whenever we talk about "retrieval", I think we need to qualify it.

In my mind, we either have:

"SP /piece Retrieval"
"IPFS Mainnet Retrieval"

Item 2: different success rates

I'm confused why we have one success rate above that is a bar and the 7-day success rate below just as a number.

I assume the top "Success Rate" is for all time.

Here is an idea for combining this to just using numbers rather than some numbers and some bars.

	All Time Attempts	All Time Success Rate	7 Day Attempts	7 Day Success Rate
Uploads	50	100%	10	100%
SP /piece Retrieval	50	100%	10	100%
IPNI Indexing	50	80%	10	90%
IPFS Mainnet Retrieval	50	70%	10	90%

You don't have to go with this, but lets at least label clearly and be intentional on our organizing. More on that below.

Mixing Success Rates and Latency Metrics

The screenshot is odd to me right with the flow of:

Success Rate (I assume all-time)
IPNI & IPFS Metrics: here we are covering success rates and but also very latencies
7 Day Retrieval Success rate

I think either group by metric type (e.g., success rates then latency metrics) or group by system (Direct SP interactions, IPNI, IPFS Mainnet). Right now you're mixing them.

My recommendation would be to move the IPNI and IPFS latency metrics down to the Latency section where you cover other latencies.

Define the metric

The only state that is self-describing is "success rate". I know how to combining multiple data points for a success rate. But how are you combining:

latencies
TTFB's
Time to Index
Time to Advertise
etc.

Are you averaging, summing, counting, p50, etc. Lets label to make it clear.

Again, feel free to pick through the feedback and I'm happy to talk verbally if that's helpful. Thanks for your work here.

BigLep · 2025-11-06T21:20:28Z

I realized I gave a bunch of feedback without any sense of priority. The minimum amount of things I would want to see before merging (rest could be backlogged):

Concern 1: terminal state should be "verified" not "retrieved"
Concern 3: transitioning between "sp_received_retrieve_request" and verifying with IPNI
Item 1: terminology

BigLep · 2025-11-08T15:09:01Z

@silent-cipher : we've also learned that it's important that the https://sp.domain/ipfs endpoint support http2. When doing the URL check for /ipfs can you make sure to use an client that doesn't have HTTP1 support? Boxo (a Go IPFS client requires HTTP2 per https://github.com/ipfs/boxo/blob/fc4c0e6a90715d788462e715e8543a41abb17491/bitswap/network/httpnet/httpnet.go#L613-L621 )

silent-cipher · 2025-11-08T16:07:39Z

I was using datacenter proxies for retrievals. I need to confirm if they support HTTP/2 or not. If not, I'll bypass proxy and make direct request using HTTP/2 client for /ipfs url check.

BigLep · 2025-11-08T16:30:02Z

I was using datacenter proxies for retrievals. I need to confirm if they support HTTP/2 or not. If not, I'll bypass proxy and make direct request using HTTP/2 client for /ipfs url check.

I am making comments without knowing the full architecture here. Maybe there are other good reasons to be using data center proxies.

So maybe the thing to is:

do both OR
separately check if an SP supports HTTP2 (rather than checking on each request)
?

…o 'verified'

silent-cipher · 2025-11-09T14:25:41Z

Thanks for clarifying!
I had the same question: since the Go IPFS (Boxo) client requires HTTP/2, doesn’t that mean an SP that doesn’t support HTTP/2 wouldn’t work for /ipfs retrievals anyway?
If that’s the case, HTTP/2 support seems like a strict requirement rather than something we’d need to check dynamically for each request.

I’ve verified that the datacenter proxies I’m using do support HTTP/2, so that shouldn’t be an issue.
That said, I agree it might still make sense to do a one-time or periodic check for HTTP/2 support per SP instead of checking on every retrieval.

BigLep · 2025-11-10T21:13:06Z

HTTP/2 support seems like a strict requirement rather than something we’d need to check dynamically for each request.

Yes, HTTP/2 support is a strict requirement. This will be getting added to the IPFS HTTP Gateway specs per ipfs/specs#519

Given that, it would be ideal if all of our http retrieval checks using HTTP/2 only (didn't allow HTTP/1.1).

fix: bump synapse-sdk to v0.35.3

silent-cipher · 2025-11-12T16:01:21Z

Merging this for now after addressing the priority reviews. I’ll follow up on the remaining feedback in a later update.

BigLep · 2025-11-12T19:52:17Z

@silent-cipher : what is the list of what is remaining? Can we move that into an issue?

BigLep · 2025-11-14T19:22:51Z

Giving feedback on https://dealbot.fwss.io/ on 2025-11-14T19:08:41Z

Item 1: IPNI Latency Metric Names

I think these terms could be more self documenting (similar to concern 2 in #52 (review))

Current	Idea 1	Idea 2
Avg Time to Index	Avg SP Index Latency
Avg Time to Advertise	Avg SP to Announce Advertisement Chain Latency	Avg SP to Announce Latency
Avg Time to Retrieve Request	Avg Indexer Retrieve from SP Latency	Avg SP Retrieve Received Request Latency
Avg Time to Verify	Avg Verify IPNI Indexer Latency	Avg Verify IPNI Latency

Item 2: how is IPFS Mainnet retrieval working if IPNI is not?

I'm confused how IPNI Indexer requests are failing, but IPFS Mainnet retrievals are succeeding, given IPFS Mainnet retrieval depends on IPNI Indexing?

Are we maybe not using unique CIDs for each run?

Feel free to move this into an issue if that is preferred.

silent-cipher added 6 commits November 5, 2025 19:23

feat: update synapse to use metadata-based configuration

deb1f50

feat: add ipni tracking and monitoring for deals

1527392

fix: retrieve ipni deal from sp's ipfs endpoint

5047fa9

feat: add IPNI metrics tracking to storage provider performance views

bf6d42b

fix: correct IPNI status counting

2f67aea

feat(web): add ipni metrics

b99c571

FilOzzy added this to FOC Nov 6, 2025

github-project-automation Bot moved this to 📌 Triage in FOC Nov 6, 2025

silent-cipher requested a review from Copilot November 6, 2025 08:05

Copilot AI reviewed Nov 6, 2025

View reviewed changes

Comment thread src/metrics/services/providers.service.ts Outdated

Comment thread src/metrics/services/providers.service.ts

Comment thread src/metrics/services/providers.service.ts Outdated

Comment thread src/deal-addons/deal-addons.service.ts Outdated

fix: correct IPNI and IPFS retrieval metrics calculations

bd16495

refactor: simplify PDP serviceURL to multiaddr conversion

ae5619d

BigLep reviewed Nov 6, 2025

View reviewed changes

silent-cipher added 2 commits November 9, 2025 14:19

refactor: change terminal state from 'sp_received_retrieve_request' t…

c8478fb

…o 'verified'

feat: update web to use new terminal state

2a504d1

format

652fedb

feat: add HTTP/2 support with undici for IPNI retrieval

8ac8693

rjan90 moved this from 📌 Triage to ⌨️ In Progress in FOC Nov 11, 2025

rjan90 and others added 2 commits November 12, 2025 16:29

fix: bump synapse-sdk to v0.35.3 (#54)

50c67cd

fix: bump synapse-sdk to v0.35.3

feat: add precise timestamps for IPNI status transitions

a5cde22

silent-cipher merged commit 5ee3e74 into main Nov 12, 2025
1 check passed

github-project-automation Bot moved this from ⌨️ In Progress to 🎉 Done in FOC Nov 12, 2025

silent-cipher mentioned this pull request Jan 21, 2026

fix(backend): negative avg time to retrieve metrics #140

Merged

	-- Retrieval metrics (all time - DIRECT_SP only)
	-- Direct SP /piece Retrieval metrics (all time)

	-- Retrieval success rate (all time)
	-- Direct SP /piece Retrieval success rate (all time)

+numVerifiedCids: successCount, // verified by itself seems like it could just be a boolean
+numCidsToIdeallyVerify: blockCIDs.length, // otherwise, total makes me wonder "total what"?  do we really need this though?  Doesn't the caller already know this?
+verificationDurationMs : durationMs,

Conversation

silent-cipher commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silent-cipher commented Nov 6, 2025

Uh oh!

rjan90 commented Nov 6, 2025

Uh oh!

BigLep commented Nov 6, 2025

Uh oh!

BigLep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Concern 1: terminal state should be "verified" not "retrieved".

Concern 2: state names

Concern 3: transitioning between "sp_received_retrieve_request" and verifying with IPNI

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BigLep commented Nov 6, 2025

Item 1: terminology

Item 2: different success rates

Mixing Success Rates and Latency Metrics

Define the metric

Uh oh!

BigLep commented Nov 6, 2025

Uh oh!

BigLep commented Nov 8, 2025

Uh oh!

silent-cipher commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BigLep commented Nov 8, 2025

Uh oh!

silent-cipher commented Nov 9, 2025

Uh oh!

BigLep commented Nov 10, 2025

Uh oh!

silent-cipher commented Nov 12, 2025

Uh oh!

Uh oh!

BigLep commented Nov 12, 2025

Uh oh!

BigLep commented Nov 14, 2025

Item 1: IPNI Latency Metric Names

Item 2: how is IPFS Mainnet retrieval working if IPNI is not?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

BigLep left a comment •

edited

Loading

silent-cipher commented Nov 8, 2025 •

edited

Loading