fix: data-retention metrics use exact subgraph values by SgtPooki · Pull Request #365 · FilOzone/dealbot

SgtPooki · 2026-03-17T15:06:17Z

fix: data-retention metrics systematically over-counting faults

Problem

Data-retention success percentages in Grafana were significantly lower than what pdp-explorer shows for the same providers. For example, a provider showing ~99% success on pdp-explorer might show 70-80% in our dashboards.

Root cause

The processProvider method in data-retention.service.ts estimated "overdue" proving periods between subgraph updates and assumed all of them were faults:

// OLD: speculative estimation added to both faulted and total
const estimatedOverduePeriods = proofSets.reduce((acc, proofSet) => {
  return acc + (blockNumberBigInt - (proofSet.nextDeadline + 1n)) / proofSet.maxProvingPeriod;
}, 0n);

const estimatedTotalFaulted = totalFaultedPeriods + estimatedOverduePeriods;
const estimatedTotalPeriods = totalProvingPeriods + estimatedOverduePeriods;

This created a systematic bias through the following cycle:

Between NextProvingPeriod events (overdue grows): faultedDelta > 0, successDelta = 0. Faults are emitted to Prometheus, successes are not.
When NextProvingPeriod fires (subgraph records actuals): the overdue estimate drops because nextDeadline moves forward. If the provider actually proved successfully, the confirmed faulted count is lower than our estimate, so faultedDelta goes negative.
Negative delta guard fires: both the faulted correction and the success increment are silently discarded.

Net effect: overdue periods are pessimistically counted as faults, and when the subgraph later confirms they were successes, the correction is thrown away. Over time this inflates fault rates and suppresses success rates.

Fix

Remove overdue estimation entirely. Use only subgraph-confirmed totals:

// NEW: use confirmed values only
const confirmedTotalSuccess = totalProvingPeriods - totalFaultedPeriods;

This aligns with how pdp-explorer computes its own fault rates -- it uses the subgraph's totalFaultedPeriods and totalProvingPeriods directly from NextProvingPeriod event handlers, without speculative estimates.

Why not fix the estimation instead of removing it?

The overdue estimation was intended to provide faster signal (detect faults before the subgraph records them). But:

The subgraph's NextProvingPeriod handler already accounts for skipped periods when it fires -- any periods that were actually missed get recorded as faults at that point.
There's no way to know during the overdue window whether the provider will prove or not. Assuming fault is always wrong for providers that are proving on time but whose events haven't been indexed yet.
The delta-based counter model (monotonic counters + negative delta guard) is fundamentally incompatible with volatile estimates that go up and down.

Removing the estimation trades a small delay in fault detection (we wait for the subgraph to confirm) for correct metrics. Given that we poll the subgraph regularly, this delay is minimal.

What about existing persisted baselines?

No migration needed. On the first poll after deploy:

The new code computes totalFaultedPeriods from the subgraph (lower than the old inflated baseline).
faultedDelta goes negative, triggering the negative delta guard.
The baseline resets to the correct values without emitting counters.
From the next poll onward, deltas are computed correctly.

The negative delta guard that caused the original problem acts as a graceful migration mechanism here.

What about the Prometheus counters that already have inflated values?

Prometheus counters are monotonic (only go up). The existing inflated values will remain, but rate() and increase() queries in Grafana only look at the change over a window, not absolute values. After deploy, the rate of fault increments will drop to match reality, and the success rate percentage will correct itself within one query window (e.g., [1d]).

Changes

queries.ts: Removed proofSets sub-query and $blockNumber variable from GET_PROVIDERS_WITH_DATASETS. The query now fetches only provider-level aggregates (totalFaultedPeriods, totalProvingPeriods), reducing payload size.
types.ts: Removed DataSet type, dataSetSchema Joi validator, proofSets from ProviderDataSetResponse, and blockNumber from ProvidersWithDataSetsOptions.
pdp-subgraph.service.ts: Removed blockNumber parameter from fetchProvidersWithDatasets, fetchWithRetry, and fetchMultipleBatchesWithRateLimit.
data-retention.service.ts: Removed overdue period estimation from processProvider. Method no longer needs blockNumberBigInt parameter. Uses subgraph-confirmed totals directly.
*.spec.ts: Updated all tests to reflect the simplified query/types. Removed overdue-specific and proofSet-specific tests.
docs/checks/data-retention.md: Updated documentation to reflect the removal and explain why.

Copilot

Pull request overview

This PR removes the speculative “overdue period” estimation from the data-retention check and switches metric computation to use only subgraph-confirmed cumulative totals, preventing systematic fault inflation and lost corrections due to the negative-delta guard.

Changes:

Update DataRetentionService.processProvider() to compute success totals as totalProvingPeriods - totalFaultedPeriods and compute deltas from these confirmed totals.
Update unit tests to reflect the new confirmed-totals behavior and remove overdue-estimation-specific expectations.
Update the data retention documentation to describe the confirmed-totals approach.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
docs/checks/data-retention.md	Updates the documented computation flow to remove overdue estimation and describe confirmed totals/deltas.
apps/backend/src/data-retention/data-retention.service.ts	Removes overdue-period estimation logic and computes deltas directly from subgraph-confirmed totals.
apps/backend/src/data-retention/data-retention.service.spec.ts	Adjusts tests and expectations to match confirmed-totals behavior and removes overdue-estimation test cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/checks/data-retention.md

Copilot

Pull request overview

This PR updates the data-retention check to stop using speculative “overdue period” estimation and instead rely solely on PDP subgraph provider-level cumulative totals, reducing systematic fault-rate inflation and simplifying the query/validation surface.

Changes:

Remove proofSets/deadline-based overdue estimation from the PDP subgraph query, types, and validation.
Compute success periods directly from subgraph-confirmed totals (totalProvingPeriods - totalFaultedPeriods) and use these for baseline/delta logic.
Update backend unit tests and the data-retention documentation to match the new provider-level totals approach.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
docs/checks/data-retention.md	Updates the explanation/flow to match the new “confirmed totals only” logic.
apps/backend/src/pdp-subgraph/types.ts	Removes dataset/proofSet fields from the validated response/types.
apps/backend/src/pdp-subgraph/types.spec.ts	Adjusts validation tests to reflect the simplified provider totals response.
apps/backend/src/pdp-subgraph/queries.ts	Simplifies the GraphQL query to fetch only provider-level totals (no `proofSets`).
apps/backend/src/pdp-subgraph/pdp-subgraph.service.ts	Removes `blockNumber` from provider fetch options/variables and batching calls.
apps/backend/src/pdp-subgraph/pdp-subgraph.service.spec.ts	Updates service tests to reflect the new variables/query shape.
apps/backend/src/data-retention/data-retention.service.ts	Removes overdue estimation and computes deltas from subgraph-confirmed totals only.
apps/backend/src/data-retention/data-retention.service.spec.ts	Updates/renames tests to validate the new confirmed-totals behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/checks/data-retention.md

apps/backend/src/pdp-subgraph/pdp-subgraph.service.ts

apps/backend/src/pdp-subgraph/types.ts

SgtPooki · 2026-03-17T19:43:07Z

i'm merging this so i can see if this fixes data retention fault rate on staging so we can get more accurate metrics.

BigLep

@silent-cipher : I would still like it if you look at this code change to make sure your mental model is updated and that there aren't other side affects we're missing.

BigLep · 2026-03-18T04:31:25Z

docs/checks/data-retention.md

-Source: [`data-retention.service.ts` (`processProvider`)](../../apps/backend/src/data-retention/data-retention.service.ts#L209)
-
-### 3. Compute Challenge Totals
+> **Note:** An earlier implementation estimated overdue periods (periods elapsed since the last recorded deadline) and pessimistically counted them as faults. This was removed because the speculative estimation systematically inflated fault rates — overdue periods were counted as faults immediately, but when the subgraph later confirmed them as successes, the correction was discarded by the negative-delta guard.


1 month from now, do we think this comment will be useful 1 month from now, then I think we should remove it.

silent-cipher · 2026-03-18T06:28:12Z

The removal of overdue estimation makes sense if the goal is to strictly rely on subgraph-confirmed data, but I don’t think it fully covers an important class of scenarios that the previous logic was handling.

The original motivation behind estimating overdue periods was to handle cases where NextProvingPeriod hasn’t been emitted for a long time. In such cases, the subgraph remains stale because it only updates when the handler fires. So relying purely on subgraph-confirmed totals introduces a blind spot.

The subgraph's NextProvingPeriod handler already accounts for skipped periods when it fires

The key issue here is when it fires. If it doesn’t fire for an extended duration, we effectively stop accounting for faults during that window.

Also, the estimation isn’t entirely speculative. We already filter datasets where: nextDeadline < indexed block number
And compute overdue periods as:

(blockNumber - (nextDeadline + 1)) / maxProvingPeriod

This indicates periods that have definitively passed without a proving event, meaning they are effectively skipped/faulted periods.

For example:

Dataset 256
nextDeadline = 3356327
assumed indexed block ≈ 3549371

This gives:

(3549371 - (3356327 + 1)) / 240 = 804 overdue periods

804 missed periods is significant and not something we can treat as negligible or “noise”.

SgtPooki · 2026-03-18T13:59:45Z

@silent-cipher, good callout. You're right that there's a blind spot here. I opened #374 for us to discuss further

@BigLep we could probably use your eyes on that as well

fix: data-retention metrics use exact subgraph values

51f4987

Copilot AI review requested due to automatic review settings March 17, 2026 15:06

SgtPooki self-assigned this Mar 17, 2026

FilOzzy added this to FOC Mar 17, 2026

github-project-automation bot moved this to 📌 Triage in FOC Mar 17, 2026

Copilot started reviewing on behalf of SgtPooki March 17, 2026 15:06 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

docs/checks/data-retention.md Show resolved Hide resolved

rjan90 moved this from 📌 Triage to 🔎 Awaiting review in FOC Mar 17, 2026

rjan90 added this to the M4.1: mainnet ready milestone Mar 17, 2026

SgtPooki added 2 commits March 17, 2026 11:32

fix: remove unecessary proofsets from pdp subgraph query

c207790

docs: update data-retention docs

9fccfe5

SgtPooki requested review from Copilot and silent-cipher March 17, 2026 15:35

Copilot started reviewing on behalf of SgtPooki March 17, 2026 15:36 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

docs/checks/data-retention.md Outdated Show resolved Hide resolved

apps/backend/src/pdp-subgraph/pdp-subgraph.service.ts Outdated Show resolved Hide resolved

apps/backend/src/pdp-subgraph/types.ts Show resolved Hide resolved

chore: address pr comments

f8b265a

SgtPooki mentioned this pull request Mar 17, 2026

review of dealbot on mainnet - 03-11-26 #349

Closed

SgtPooki merged commit 3682ecb into main Mar 17, 2026
6 checks passed

SgtPooki deleted the fix/data-retention-fault-count branch March 17, 2026 19:42

github-project-automation bot moved this from 🔎 Awaiting review to 🎉 Done in FOC Mar 17, 2026

github-actions bot mentioned this pull request Mar 17, 2026

chore: release to production (main) #361

Merged

BigLep reviewed Mar 18, 2026

View reviewed changes

SgtPooki mentioned this pull request Mar 18, 2026

Detect stalled data-retention counters (no NextProvingPeriod fired) #374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: data-retention metrics use exact subgraph values#365

fix: data-retention metrics use exact subgraph values#365
SgtPooki merged 4 commits intomainfrom
fix/data-retention-fault-count

SgtPooki commented Mar 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SgtPooki commented Mar 17, 2026

Uh oh!

BigLep left a comment

Uh oh!

BigLep Mar 18, 2026

Uh oh!

silent-cipher commented Mar 18, 2026

Uh oh!

SgtPooki commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

SgtPooki commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix: data-retention metrics systematically over-counting faults

Problem

Root cause

Fix

Why not fix the estimation instead of removing it?

What about existing persisted baselines?

What about the Prometheus counters that already have inflated values?

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SgtPooki commented Mar 17, 2026

Uh oh!

BigLep left a comment

Choose a reason for hiding this comment

Uh oh!

BigLep Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

silent-cipher commented Mar 18, 2026

Uh oh!

SgtPooki commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SgtPooki commented Mar 17, 2026 •

edited

Loading