fix: data-retention metrics use exact subgraph values#365
Conversation
There was a problem hiding this comment.
Pull request overview
This PR removes the speculative “overdue period” estimation from the data-retention check and switches metric computation to use only subgraph-confirmed cumulative totals, preventing systematic fault inflation and lost corrections due to the negative-delta guard.
Changes:
- Update
DataRetentionService.processProvider()to compute success totals astotalProvingPeriods - totalFaultedPeriodsand compute deltas from these confirmed totals. - Update unit tests to reflect the new confirmed-totals behavior and remove overdue-estimation-specific expectations.
- Update the data retention documentation to describe the confirmed-totals approach.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| docs/checks/data-retention.md | Updates the documented computation flow to remove overdue estimation and describe confirmed totals/deltas. |
| apps/backend/src/data-retention/data-retention.service.ts | Removes overdue-period estimation logic and computes deltas directly from subgraph-confirmed totals. |
| apps/backend/src/data-retention/data-retention.service.spec.ts | Adjusts tests and expectations to match confirmed-totals behavior and removes overdue-estimation test cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR updates the data-retention check to stop using speculative “overdue period” estimation and instead rely solely on PDP subgraph provider-level cumulative totals, reducing systematic fault-rate inflation and simplifying the query/validation surface.
Changes:
- Remove
proofSets/deadline-based overdue estimation from the PDP subgraph query, types, and validation. - Compute success periods directly from subgraph-confirmed totals (
totalProvingPeriods - totalFaultedPeriods) and use these for baseline/delta logic. - Update backend unit tests and the data-retention documentation to match the new provider-level totals approach.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/checks/data-retention.md | Updates the explanation/flow to match the new “confirmed totals only” logic. |
| apps/backend/src/pdp-subgraph/types.ts | Removes dataset/proofSet fields from the validated response/types. |
| apps/backend/src/pdp-subgraph/types.spec.ts | Adjusts validation tests to reflect the simplified provider totals response. |
| apps/backend/src/pdp-subgraph/queries.ts | Simplifies the GraphQL query to fetch only provider-level totals (no proofSets). |
| apps/backend/src/pdp-subgraph/pdp-subgraph.service.ts | Removes blockNumber from provider fetch options/variables and batching calls. |
| apps/backend/src/pdp-subgraph/pdp-subgraph.service.spec.ts | Updates service tests to reflect the new variables/query shape. |
| apps/backend/src/data-retention/data-retention.service.ts | Removes overdue estimation and computes deltas from subgraph-confirmed totals only. |
| apps/backend/src/data-retention/data-retention.service.spec.ts | Updates/renames tests to validate the new confirmed-totals behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
i'm merging this so i can see if this fixes data retention fault rate on staging so we can get more accurate metrics. |
BigLep
left a comment
There was a problem hiding this comment.
@silent-cipher : I would still like it if you look at this code change to make sure your mental model is updated and that there aren't other side affects we're missing.
| Source: [`data-retention.service.ts` (`processProvider`)](../../apps/backend/src/data-retention/data-retention.service.ts#L209) | ||
|
|
||
| ### 3. Compute Challenge Totals | ||
| > **Note:** An earlier implementation estimated overdue periods (periods elapsed since the last recorded deadline) and pessimistically counted them as faults. This was removed because the speculative estimation systematically inflated fault rates — overdue periods were counted as faults immediately, but when the subgraph later confirmed them as successes, the correction was discarded by the negative-delta guard. |
There was a problem hiding this comment.
1 month from now, do we think this comment will be useful 1 month from now, then I think we should remove it.
|
The removal of overdue estimation makes sense if the goal is to strictly rely on subgraph-confirmed data, but I don’t think it fully covers an important class of scenarios that the previous logic was handling. The original motivation behind estimating overdue periods was to handle cases where
The key issue here is when it fires. If it doesn’t fire for an extended duration, we effectively stop accounting for faults during that window. Also, the estimation isn’t entirely speculative. We already filter datasets where: This indicates periods that have definitively passed without a proving event, meaning they are effectively skipped/faulted periods. For example:
This gives: 804 missed periods is significant and not something we can treat as negligible or “noise”. |
|
@silent-cipher, good callout. You're right that there's a blind spot here. I opened #374 for us to discuss further @BigLep we could probably use your eyes on that as well |
fix: data-retention metrics systematically over-counting faults
Problem
Data-retention success percentages in Grafana were significantly lower than what pdp-explorer shows for the same providers. For example, a provider showing ~99% success on pdp-explorer might show 70-80% in our dashboards.
Root cause
The
processProvidermethod indata-retention.service.tsestimated "overdue" proving periods between subgraph updates and assumed all of them were faults:This created a systematic bias through the following cycle:
NextProvingPeriodevents (overdue grows):faultedDelta > 0,successDelta = 0. Faults are emitted to Prometheus, successes are not.NextProvingPeriodfires (subgraph records actuals): the overdue estimate drops becausenextDeadlinemoves forward. If the provider actually proved successfully, the confirmed faulted count is lower than our estimate, sofaultedDeltagoes negative.Net effect: overdue periods are pessimistically counted as faults, and when the subgraph later confirms they were successes, the correction is thrown away. Over time this inflates fault rates and suppresses success rates.
Fix
Remove overdue estimation entirely. Use only subgraph-confirmed totals:
This aligns with how pdp-explorer computes its own fault rates -- it uses the subgraph's
totalFaultedPeriodsandtotalProvingPeriodsdirectly fromNextProvingPeriodevent handlers, without speculative estimates.Why not fix the estimation instead of removing it?
The overdue estimation was intended to provide faster signal (detect faults before the subgraph records them). But:
NextProvingPeriodhandler already accounts for skipped periods when it fires -- any periods that were actually missed get recorded as faults at that point.Removing the estimation trades a small delay in fault detection (we wait for the subgraph to confirm) for correct metrics. Given that we poll the subgraph regularly, this delay is minimal.
What about existing persisted baselines?
No migration needed. On the first poll after deploy:
totalFaultedPeriodsfrom the subgraph (lower than the old inflated baseline).faultedDeltagoes negative, triggering the negative delta guard.The negative delta guard that caused the original problem acts as a graceful migration mechanism here.
What about the Prometheus counters that already have inflated values?
Prometheus counters are monotonic (only go up). The existing inflated values will remain, but
rate()andincrease()queries in Grafana only look at the change over a window, not absolute values. After deploy, the rate of fault increments will drop to match reality, and the success rate percentage will correct itself within one query window (e.g.,[1d]).Changes
queries.ts: RemovedproofSetssub-query and$blockNumbervariable fromGET_PROVIDERS_WITH_DATASETS. The query now fetches only provider-level aggregates (totalFaultedPeriods,totalProvingPeriods), reducing payload size.types.ts: RemovedDataSettype,dataSetSchemaJoi validator,proofSetsfromProviderDataSetResponse, andblockNumberfromProvidersWithDataSetsOptions.pdp-subgraph.service.ts: RemovedblockNumberparameter fromfetchProvidersWithDatasets,fetchWithRetry, andfetchMultipleBatchesWithRateLimit.data-retention.service.ts: Removed overdue period estimation fromprocessProvider. Method no longer needsblockNumberBigIntparameter. Uses subgraph-confirmed totals directly.*.spec.ts: Updated all tests to reflect the simplified query/types. Removed overdue-specific and proofSet-specific tests.docs/checks/data-retention.md: Updated documentation to reflect the removal and explain why.