*: insufficient round change metric#4512
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds observability for QBFT instances that time out in rounds >1 due to receiving fewer than quorum round-change messages (the scenario described in #4478, often caused by late leader payload propagation and peer set segregation). It introduces a new Prometheus counter with an outcome label to distinguish between nodes that later recover via MsgDecided vs instances that ultimately time out.
Changes:
- Added
core_consensus_insufficient_round_changes_total{protocol,duty,timer,outcome}and incremented it from the QBFT instance loop on the “recovered via decided” and “full timeout” paths. - Added a health check that warns when this counter increases over the scrape window.
- Added unit tests covering the detection helper, metric outcome behavior, and the new health check.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| p2p/sender.go | Minor formatting-only change (blank line) near deferred metric observation. |
| docs/metrics.md | Documents the new core_consensus_insufficient_round_changes_total metric and labels. |
| core/consensus/qbft/qbft.go | Tracks “insufficient round changes” during round timeouts and emits the new metric with `outcome=decided |
| core/consensus/qbft/qbft_internal_test.go | Adds tests for isInsufficientRoundChanges and for the two metric outcomes (decided vs timeout). |
| core/consensus/metrics/metrics.go | Registers the new counter vec and exposes IncInsufficientRoundChanges on the metrics interface/implementation. |
| core/consensus/metrics/metrics_test.go | Adds test for the new metric counter emission. |
| app/health/select.go | Simplifies sumLabels to always sum all series in a metric family (no label filtering). |
| app/health/checks.go | Adds insufficient_round_changes health check based on counter increase. |
| app/health/checks_internal_test.go | Adds coverage for the new health check behavior across decided/timeout outcomes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4512 +/- ##
==========================================
- Coverage 57.11% 57.09% -0.02%
==========================================
Files 245 245
Lines 32920 32965 +45
==========================================
+ Hits 18801 18822 +21
- Misses 11752 11770 +18
- Partials 2367 2373 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|



If a proposed value from leader is too close to the threshold of round expiration, we can see some peers receiving it on time while others don't. It'd be nice to track such discrepancies.
Here I'm adding a metric that tracks "insufficient round change messages" which in the majority of the scenarios means other peers did not agree on a round change. The case when they don't agree is if they are already doing another round. I'm splitting the metric in 2 - received "consensus decided" message or not. If a node receives "consensus decided", even if it wasn't part of the consensus (i.e.: stuck on another round), it still counts the consensus as successful and propagates the data decided upon to the subscribers (= the VC). If it doesn't receive it though, then the QBFT fails after some timeouts (= the VC never receives the data)
category: feature
ticket: #4478