Skip to content

fix: data-retention metrics use exact subgraph values#365

Merged
SgtPooki merged 4 commits intomainfrom
fix/data-retention-fault-count
Mar 17, 2026
Merged

fix: data-retention metrics use exact subgraph values#365
SgtPooki merged 4 commits intomainfrom
fix/data-retention-fault-count

Conversation

@SgtPooki
Copy link
Collaborator

@SgtPooki SgtPooki commented Mar 17, 2026

fix: data-retention metrics systematically over-counting faults

Problem

Data-retention success percentages in Grafana were significantly lower than what pdp-explorer shows for the same providers. For example, a provider showing ~99% success on pdp-explorer might show 70-80% in our dashboards.

Root cause

The processProvider method in data-retention.service.ts estimated "overdue" proving periods between subgraph updates and assumed all of them were faults:

// OLD: speculative estimation added to both faulted and total
const estimatedOverduePeriods = proofSets.reduce((acc, proofSet) => {
  return acc + (blockNumberBigInt - (proofSet.nextDeadline + 1n)) / proofSet.maxProvingPeriod;
}, 0n);

const estimatedTotalFaulted = totalFaultedPeriods + estimatedOverduePeriods;
const estimatedTotalPeriods = totalProvingPeriods + estimatedOverduePeriods;

This created a systematic bias through the following cycle:

  1. Between NextProvingPeriod events (overdue grows): faultedDelta > 0, successDelta = 0. Faults are emitted to Prometheus, successes are not.
  2. When NextProvingPeriod fires (subgraph records actuals): the overdue estimate drops because nextDeadline moves forward. If the provider actually proved successfully, the confirmed faulted count is lower than our estimate, so faultedDelta goes negative.
  3. Negative delta guard fires: both the faulted correction and the success increment are silently discarded.

Net effect: overdue periods are pessimistically counted as faults, and when the subgraph later confirms they were successes, the correction is thrown away. Over time this inflates fault rates and suppresses success rates.

Fix

Remove overdue estimation entirely. Use only subgraph-confirmed totals:

// NEW: use confirmed values only
const confirmedTotalSuccess = totalProvingPeriods - totalFaultedPeriods;

This aligns with how pdp-explorer computes its own fault rates -- it uses the subgraph's totalFaultedPeriods and totalProvingPeriods directly from NextProvingPeriod event handlers, without speculative estimates.

Why not fix the estimation instead of removing it?

The overdue estimation was intended to provide faster signal (detect faults before the subgraph records them). But:

  • The subgraph's NextProvingPeriod handler already accounts for skipped periods when it fires -- any periods that were actually missed get recorded as faults at that point.
  • There's no way to know during the overdue window whether the provider will prove or not. Assuming fault is always wrong for providers that are proving on time but whose events haven't been indexed yet.
  • The delta-based counter model (monotonic counters + negative delta guard) is fundamentally incompatible with volatile estimates that go up and down.

Removing the estimation trades a small delay in fault detection (we wait for the subgraph to confirm) for correct metrics. Given that we poll the subgraph regularly, this delay is minimal.

What about existing persisted baselines?

No migration needed. On the first poll after deploy:

  1. The new code computes totalFaultedPeriods from the subgraph (lower than the old inflated baseline).
  2. faultedDelta goes negative, triggering the negative delta guard.
  3. The baseline resets to the correct values without emitting counters.
  4. From the next poll onward, deltas are computed correctly.

The negative delta guard that caused the original problem acts as a graceful migration mechanism here.

What about the Prometheus counters that already have inflated values?

Prometheus counters are monotonic (only go up). The existing inflated values will remain, but rate() and increase() queries in Grafana only look at the change over a window, not absolute values. After deploy, the rate of fault increments will drop to match reality, and the success rate percentage will correct itself within one query window (e.g., [1d]).

Changes

  • queries.ts: Removed proofSets sub-query and $blockNumber variable from GET_PROVIDERS_WITH_DATASETS. The query now fetches only provider-level aggregates (totalFaultedPeriods, totalProvingPeriods), reducing payload size.
  • types.ts: Removed DataSet type, dataSetSchema Joi validator, proofSets from ProviderDataSetResponse, and blockNumber from ProvidersWithDataSetsOptions.
  • pdp-subgraph.service.ts: Removed blockNumber parameter from fetchProvidersWithDatasets, fetchWithRetry, and fetchMultipleBatchesWithRateLimit.
  • data-retention.service.ts: Removed overdue period estimation from processProvider. Method no longer needs blockNumberBigInt parameter. Uses subgraph-confirmed totals directly.
  • *.spec.ts: Updated all tests to reflect the simplified query/types. Removed overdue-specific and proofSet-specific tests.
  • docs/checks/data-retention.md: Updated documentation to reflect the removal and explain why.

Copilot AI review requested due to automatic review settings March 17, 2026 15:06
@SgtPooki SgtPooki self-assigned this Mar 17, 2026
@FilOzzy FilOzzy added this to FOC Mar 17, 2026
@github-project-automation github-project-automation bot moved this to 📌 Triage in FOC Mar 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the speculative “overdue period” estimation from the data-retention check and switches metric computation to use only subgraph-confirmed cumulative totals, preventing systematic fault inflation and lost corrections due to the negative-delta guard.

Changes:

  • Update DataRetentionService.processProvider() to compute success totals as totalProvingPeriods - totalFaultedPeriods and compute deltas from these confirmed totals.
  • Update unit tests to reflect the new confirmed-totals behavior and remove overdue-estimation-specific expectations.
  • Update the data retention documentation to describe the confirmed-totals approach.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
docs/checks/data-retention.md Updates the documented computation flow to remove overdue estimation and describe confirmed totals/deltas.
apps/backend/src/data-retention/data-retention.service.ts Removes overdue-period estimation logic and computes deltas directly from subgraph-confirmed totals.
apps/backend/src/data-retention/data-retention.service.spec.ts Adjusts tests and expectations to match confirmed-totals behavior and removes overdue-estimation test cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rjan90 rjan90 moved this from 📌 Triage to 🔎 Awaiting review in FOC Mar 17, 2026
@rjan90 rjan90 added this to the M4.1: mainnet ready milestone Mar 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the data-retention check to stop using speculative “overdue period” estimation and instead rely solely on PDP subgraph provider-level cumulative totals, reducing systematic fault-rate inflation and simplifying the query/validation surface.

Changes:

  • Remove proofSets/deadline-based overdue estimation from the PDP subgraph query, types, and validation.
  • Compute success periods directly from subgraph-confirmed totals (totalProvingPeriods - totalFaultedPeriods) and use these for baseline/delta logic.
  • Update backend unit tests and the data-retention documentation to match the new provider-level totals approach.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/checks/data-retention.md Updates the explanation/flow to match the new “confirmed totals only” logic.
apps/backend/src/pdp-subgraph/types.ts Removes dataset/proofSet fields from the validated response/types.
apps/backend/src/pdp-subgraph/types.spec.ts Adjusts validation tests to reflect the simplified provider totals response.
apps/backend/src/pdp-subgraph/queries.ts Simplifies the GraphQL query to fetch only provider-level totals (no proofSets).
apps/backend/src/pdp-subgraph/pdp-subgraph.service.ts Removes blockNumber from provider fetch options/variables and batching calls.
apps/backend/src/pdp-subgraph/pdp-subgraph.service.spec.ts Updates service tests to reflect the new variables/query shape.
apps/backend/src/data-retention/data-retention.service.ts Removes overdue estimation and computes deltas from subgraph-confirmed totals only.
apps/backend/src/data-retention/data-retention.service.spec.ts Updates/renames tests to validate the new confirmed-totals behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SgtPooki SgtPooki merged commit 3682ecb into main Mar 17, 2026
6 checks passed
@SgtPooki SgtPooki deleted the fix/data-retention-fault-count branch March 17, 2026 19:42
@github-project-automation github-project-automation bot moved this from 🔎 Awaiting review to 🎉 Done in FOC Mar 17, 2026
@SgtPooki
Copy link
Collaborator Author

i'm merging this so i can see if this fixes data retention fault rate on staging so we can get more accurate metrics.

Copy link
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@silent-cipher : I would still like it if you look at this code change to make sure your mental model is updated and that there aren't other side affects we're missing.

Source: [`data-retention.service.ts` (`processProvider`)](../../apps/backend/src/data-retention/data-retention.service.ts#L209)

### 3. Compute Challenge Totals
> **Note:** An earlier implementation estimated overdue periods (periods elapsed since the last recorded deadline) and pessimistically counted them as faults. This was removed because the speculative estimation systematically inflated fault rates — overdue periods were counted as faults immediately, but when the subgraph later confirmed them as successes, the correction was discarded by the negative-delta guard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 month from now, do we think this comment will be useful 1 month from now, then I think we should remove it.

@silent-cipher
Copy link
Collaborator

The removal of overdue estimation makes sense if the goal is to strictly rely on subgraph-confirmed data, but I don’t think it fully covers an important class of scenarios that the previous logic was handling.

The original motivation behind estimating overdue periods was to handle cases where NextProvingPeriod hasn’t been emitted for a long time. In such cases, the subgraph remains stale because it only updates when the handler fires. So relying purely on subgraph-confirmed totals introduces a blind spot.

The subgraph's NextProvingPeriod handler already accounts for skipped periods when it fires

The key issue here is when it fires. If it doesn’t fire for an extended duration, we effectively stop accounting for faults during that window.

Also, the estimation isn’t entirely speculative. We already filter datasets where: nextDeadline < indexed block number
And compute overdue periods as:

(blockNumber - (nextDeadline + 1)) / maxProvingPeriod

This indicates periods that have definitively passed without a proving event, meaning they are effectively skipped/faulted periods.

For example:

  • Dataset 256
  • nextDeadline = 3356327
  • assumed indexed block ≈ 3549371

This gives:

(3549371 - (3356327 + 1)) / 240 = 804 overdue periods

804 missed periods is significant and not something we can treat as negligible or “noise”.

@SgtPooki
Copy link
Collaborator Author

@silent-cipher, good callout. You're right that there's a blind spot here. I opened #374 for us to discuss further

@BigLep we could probably use your eyes on that as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🎉 Done

Development

Successfully merging this pull request may close these issues.

6 participants