Skip to content

Feat: add ipfs retrieval checks#52

Merged
silent-cipher merged 14 commits intomainfrom
fix/ipfs-indexing
Nov 12, 2025
Merged

Feat: add ipfs retrieval checks#52
silent-cipher merged 14 commits intomainfrom
fix/ipfs-indexing

Conversation

@silent-cipher
Copy link
Copy Markdown
Collaborator

Failed Ipfs retrieval details can be seen in failed retrievals section.
Preview -
Screenshot 2025-11-06 at 1 31 48 PM

closes #51

@FilOzzy FilOzzy added this to FOC Nov 6, 2025
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC Nov 6, 2025
@silent-cipher silent-cipher requested a review from Copilot November 6, 2025 08:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive IPNI (InterPlanetary Network Indexer) and IPFS metrics tracking to monitor deal indexing performance and IPFS retrieval capabilities. The changes introduce background monitoring of IPNI deal lifecycle stages, new database fields for tracking metrics, and UI components to display these metrics.

  • Adds IPNI tracking with status progression (PENDING → INDEXED → ADVERTISED → RETRIEVED)
  • Introduces IPFS-specific retrieval metrics separate from direct provider retrievals
  • Updates the Synapse SDK configuration to use metadata-based approach instead of boolean flags

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
web/src/types/providers.ts Adds IPNI and IPFS metrics fields to provider performance DTO
web/src/components/ProviderCard.tsx Implements collapsible UI section for IPNI & IPFS metrics display
src/retrieval-addons/strategies/ipni.strategy.ts Updates IPNI retrieval URL path from /piece/ to /ipfs/
src/metrics/services/providers.service.ts Adds IPNI and IPFS metrics aggregation logic for performance calculations
src/metrics/services/metrics-scheduler.service.ts Adds IPNI metrics to daily aggregation queries
src/metrics/dto/provider-performance.dto.ts Defines API schema for IPNI and IPFS metrics
src/deal/deal.service.ts Refactors to trigger upload complete handlers and load provider earlier
src/deal-addons/types.ts Updates types to use ServiceType enum and adds SynapseConfig interface
src/deal-addons/strategies/ipni.strategy.ts Implements IPNI monitoring workflow with background tracking and metrics storage
src/deal-addons/strategies/{direct,cdn}.strategy.ts Updates to use new SynapseConfig interface
src/deal-addons/interfaces/deal-addon.interface.ts Adds optional onUploadComplete handler to addon interface
src/deal-addons/deal-addons.service.ts Implements upload complete handler orchestration
src/deal-addons/deal-addons.module.ts Adds Deal entity TypeORM import for repository injection
src/database/types.ts Defines IpniStatus enum
src/database/migrations/*.ts Database migrations for IPNI tracking fields and performance views
src/database/entities/*.ts Adds IPNI and IPFS metric columns to entities and materialized views
Comments suppressed due to low confidence (1)

src/deal-addons/strategies/ipni.strategy.ts:1

  • Duplicate addition to totalDataRetrievedBytes on line 714 for IPFS retrievals. This value is already added on line 707 for all successful retrievals. This will cause double-counting of data retrieved via IPFS, inflating the total data retrieved metric.
import { request } from "node:https";

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/metrics/services/providers.service.ts Outdated
Comment thread src/metrics/services/providers.service.ts
Comment thread src/metrics/services/providers.service.ts Outdated
Comment thread src/deal-addons/deal-addons.service.ts Outdated
@silent-cipher
Copy link
Copy Markdown
Collaborator Author

@rjan90 , Can you please confirm if this check is necessary for ipni? This check will mark IPNI operation as failed if the provider’s serviceUrl uses http instead of https.

@rjan90
Copy link
Copy Markdown
Contributor

rjan90 commented Nov 6, 2025

@rjan90 , Can you please confirm if this check is necessary for ipni? This check will mark IPNI operation as failed if the provider’s serviceUrl uses http instead of https.

I do not think it is necessary for IPNI itself. For example, provider ID 6: beck-calib in dealbot uses http, but has working IPNI

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Nov 6, 2025

I am working on reviewing the logic and output. I'll post comments as soon as I'm done.

Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. Lots to pull together and very important. I'll respond separately on the screenshot itself. Happy to talk verbally on any of this if it's useful.

Concern 1: terminal state should be "verified" not "retrieved".

We're putting a lot of reliance on the PDPServer status of "retrieved".

Per the docs:

   * Whether the piece has been retrieved
   * This does not necessarily mean it was retrieved by a particular indexer,
   * only that the PDP server witnessed a retrieval event. Care should be
   * taken when interpreting this field.

I think it's fine to include measurements (counts/latencies) around this "retrieved" stat, but I don't think we should treat it as a our terminal state for IPNI checking. IPNI checking is done when we've verified CIDs are showing up in filecoinpin.contact.

Concern 2: state names

I would like to make the state names more self-documenting:

  • indexed → sp_indexed ?: the SP claimed they indexed the piece
  • advertised → sp_announced_advertisement : the SP announced a new advertisement chain to the indexer.
    • That said, I realize PDPServer calls this "advertised", so I can see if we we want to be consistent there. In that case, "sp_advertised".
  • retrieved → sp_received_retrieve_request OR indexer_retrieved_from_sp
    • I think I prefer the first from an accuracy regard. I don't think we actually know if the Indexer retrieved the CIDs from the SP.
  • (add this) verified: meaning that we confirmed CIDs are actually being returned by the IPNI indexer (filecoinpin.contact).
    • A key decision point is when we can consider a "deal" verified. Ideas:
      1. When the ipfsRootCid shows up properly in the IPNI response
      2. When more than a certain percentage of CIDs show up properly
      3. When all the CIDs have been verified
      • I think we definitely need to check the ipfsRootCID at the minimum. Beyond that is bonus in my opinion.

(do whatever you want with the casing - I'm just trying to convey an idea)

Concern 3: transitioning between "sp_received_retrieve_request" and verifying with IPNI

Right now, as soon as we get to the "sp_received_retrieve_request" state we immediately start trying to verify IPNI results (see monitorAndVerifyIPNI). That makes sense, but we should intentionally allow there to be some time between these two states. Right now in verifyIPNIAdvertisement we just start iterating over each CID and marking them failed if we don't get a valid response. In this asynchronous system, there may be some time between when CIDs are retrieved from the SP by the IPNI indexer and before they start showing up in IPNI responses.

ROUND(AVG(d.ingest_throughput_bps) FILTER (WHERE d.ingest_throughput_bps IS NOT NULL))::bigint as avg_ingest_throughput_bps,

-- Retrieval metrics (all time)
-- Retrieval metrics (all time - DIRECT_SP only)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- Retrieval metrics (all time - DIRECT_SP only)
-- Direct SP /piece Retrieval metrics (all time)

COUNT(DISTINCT r.id) FILTER (WHERE r.status = 'success' AND r.service_type = '${ServiceType.DIRECT_SP}') as successful_retrievals,
COUNT(DISTINCT r.id) FILTER (WHERE r.status = 'failed' AND r.service_type = '${ServiceType.DIRECT_SP}') as failed_retrievals,

-- Retrieval success rate (all time)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- Retrieval success rate (all time)
-- Direct SP /piece Retrieval success rate (all time)

-- Retrieval throughput (all time)
ROUND(AVG(r.throughput_bps) FILTER (WHERE r.throughput_bps IS NOT NULL AND r.service_type = '${ServiceType.DIRECT_SP}'))::bigint as avg_throughput_bps,

-- IPFS retrieval metrics (all time - IPFS_PIN only)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be really clear I would call this "IPFS Mainnet retrieval metrics" here and below

COUNT(DISTINCT r_ipfs.id) FILTER (WHERE r_ipfs.status = 'success' AND r_ipfs.service_type = '${ServiceType.IPFS_PIN}') as successful_ipfs_retrievals,
COUNT(DISTINCT r_ipfs.id) FILTER (WHERE r_ipfs.status = 'failed' AND r_ipfs.service_type = '${ServiceType.IPFS_PIN}') as failed_ipfs_retrievals,

-- IPFS retrieval success rate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (not that you need to change) , but my understanding is that "success rates" are between 0 and 1 while "success percentages" are between 0 and 100. In this case we are calling a "success percentage" a "success rate".

ELSE 0
END as ipfs_retrieval_success_rate,

-- IPFS retrieval performance
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Average is usually not the best metric when it comes to aggregating latencies. p50 and p90 are ideal, but I understand that may not be easily possible with a db query.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be ideal not to have so much copy paste between this and the all-time file. ideally one shared file and you pass a date range to each? (your call on this)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be easier to read this code from top to bottom if the functions were laid out in their natural call order (i.e., public function → helper function 1 → helper function 2). For example, put queryIPNI > httpsGet > extractProviderAddrs.

};
}

async verifyIPNIAdvertisement(blockCIDs: string[], expectedMultiaddr: string, maxDurationMs: number) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some comments would be useful here. Basically this is the function that checks that the IPNI indexer (filecoinpin.contact) has the advertised CIDs.

Comment on lines 361 to 366
if (failedCIDs.length > 0) {
throw new Error(
`IPNI verification failed: ${successCount}/${blockCIDs.length} CIDs verified. ` +
`Failed CIDs: ${JSON.stringify(failedCIDs, null, 2)}`,
this.logger.warn(
`IPNI verification: ${successCount}/${blockCIDs.length} CIDs verified, ` + `${failedCIDs.length} failed`,
);
throw new Error(`IPNI verification failed: ${successCount}/${blockCIDs.length} CIDs verified`);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not doing anything with failedCid.warning. is that intentional?

Comment on lines 371 to 373
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these could be renamed to be more self-documenting

Suggested change
numVerifiedCids: successCount, // verified by itself seems like it could just be a boolean
numCidsToIdeallyVerify: blockCIDs.length, // otherwise, total makes me wonder "total what"? do we really need this though? Doesn't the caller already know this?
verificationDurationMs : durationMs,

They aren't the greatest name but I think remove some ambiguity

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Nov 6, 2025

Concerning
Screenshot 2025-11-06 at 1 31 48 PM

Item 1: terminology

"Retrieval Success" is ambiguous to me. Whenever we talk about "retrieval", I think we need to qualify it.

In my mind, we either have:

  • "SP /piece Retrieval"
  • "IPFS Mainnet Retrieval"

Item 2: different success rates

I'm confused why we have one success rate above that is a bar and the 7-day success rate below just as a number.

I assume the top "Success Rate" is for all time.

Here is an idea for combining this to just using numbers rather than some numbers and some bars.

  All Time Attempts All Time Success Rate 7 Day Attempts 7 Day Success Rate
Uploads 50 100% 10 100%
SP /piece Retrieval 50 100% 10 100%
IPNI Indexing 50 80% 10 90%
IPFS Mainnet Retrieval 50 70% 10 90%

You don't have to go with this, but lets at least label clearly and be intentional on our organizing. More on that below.

Mixing Success Rates and Latency Metrics

The screenshot is odd to me right with the flow of:

  • Success Rate (I assume all-time)
  • IPNI & IPFS Metrics: here we are covering success rates and but also very latencies
  • 7 Day Retrieval Success rate

I think either group by metric type (e.g., success rates then latency metrics) or group by system (Direct SP interactions, IPNI, IPFS Mainnet). Right now you're mixing them.

My recommendation would be to move the IPNI and IPFS latency metrics down to the Latency section where you cover other latencies.

Define the metric

The only state that is self-describing is "success rate". I know how to combining multiple data points for a success rate. But how are you combining:

  • latencies
  • TTFB's
  • Time to Index
  • Time to Advertise
    etc.

Are you averaging, summing, counting, p50, etc. Lets label to make it clear.


Again, feel free to pick through the feedback and I'm happy to talk verbally if that's helpful. Thanks for your work here.

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Nov 6, 2025

I realized I gave a bunch of feedback without any sense of priority. The minimum amount of things I would want to see before merging (rest could be backlogged):

  • Concern 1: terminal state should be "verified" not "retrieved"
  • Concern 3: transitioning between "sp_received_retrieve_request" and verifying with IPNI
  • Item 1: terminology

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Nov 8, 2025

@silent-cipher : we've also learned that it's important that the https://sp.domain/ipfs endpoint support http2. When doing the URL check for /ipfs can you make sure to use an client that doesn't have HTTP1 support? Boxo (a Go IPFS client requires HTTP2 per https://github.com/ipfs/boxo/blob/fc4c0e6a90715d788462e715e8543a41abb17491/bitswap/network/httpnet/httpnet.go#L613-L621 )

@silent-cipher
Copy link
Copy Markdown
Collaborator Author

silent-cipher commented Nov 8, 2025

I was using datacenter proxies for retrievals. I need to confirm if they support HTTP/2 or not. If not, I'll bypass proxy and make direct request using HTTP/2 client for /ipfs url check.

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Nov 8, 2025

I was using datacenter proxies for retrievals. I need to confirm if they support HTTP/2 or not. If not, I'll bypass proxy and make direct request using HTTP/2 client for /ipfs url check.

I am making comments without knowing the full architecture here. Maybe there are other good reasons to be using data center proxies.

So maybe the thing to is:

  1. do both OR
  2. separately check if an SP supports HTTP2 (rather than checking on each request)
    ?

@silent-cipher
Copy link
Copy Markdown
Collaborator Author

Thanks for clarifying!
I had the same question: since the Go IPFS (Boxo) client requires HTTP/2, doesn’t that mean an SP that doesn’t support HTTP/2 wouldn’t work for /ipfs retrievals anyway?
If that’s the case, HTTP/2 support seems like a strict requirement rather than something we’d need to check dynamically for each request.

I’ve verified that the datacenter proxies I’m using do support HTTP/2, so that shouldn’t be an issue.
That said, I agree it might still make sense to do a one-time or periodic check for HTTP/2 support per SP instead of checking on every retrieval.

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Nov 10, 2025

HTTP/2 support seems like a strict requirement rather than something we’d need to check dynamically for each request.

Yes, HTTP/2 support is a strict requirement. This will be getting added to the IPFS HTTP Gateway specs per ipfs/specs#519

Given that, it would be ideal if all of our http retrieval checks using HTTP/2 only (didn't allow HTTP/1.1).

@rjan90 rjan90 moved this from 📌 Triage to ⌨️ In Progress in FOC Nov 11, 2025
@silent-cipher
Copy link
Copy Markdown
Collaborator Author

Merging this for now after addressing the priority reviews. I’ll follow up on the remaining feedback in a later update.

@silent-cipher silent-cipher merged commit 5ee3e74 into main Nov 12, 2025
1 check passed
@github-project-automation github-project-automation Bot moved this from ⌨️ In Progress to 🎉 Done in FOC Nov 12, 2025
@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Nov 12, 2025

@silent-cipher : what is the list of what is remaining? Can we move that into an issue?

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Nov 14, 2025

Giving feedback on https://dealbot.fwss.io/ on 2025-11-14T19:08:41Z

image

Item 1: IPNI Latency Metric Names

I think these terms could be more self documenting (similar to concern 2 in #52 (review))

Current Idea 1 Idea 2
Avg Time to Index Avg SP Index Latency  
Avg Time to Advertise Avg SP to Announce Advertisement Chain Latency Avg SP to Announce Latency
Avg Time to Retrieve Request Avg Indexer Retrieve from SP Latency Avg SP Retrieve Received Request Latency
Avg Time to Verify Avg Verify IPNI Indexer Latency Avg Verify IPNI Latency

Item 2: how is IPFS Mainnet retrieval working if IPNI is not?

I'm confused how IPNI Indexer requests are failing, but IPFS Mainnet retrievals are succeeding, given IPFS Mainnet retrieval depends on IPNI Indexing?

Are we maybe not using unique CIDs for each run?


Feel free to move this into an issue if that is preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🎉 Done

Development

Successfully merging this pull request may close these issues.

Add IPFS tooling-based retrieval checks alongside direct-to-SP retrieval

5 participants