Skip to content

fix(spartan): chain-halt alert + Block height panels — switch to aztec_status="proposed"#22978

Merged
alexghr merged 3 commits intomerge-train/spartanfrom
claudebox/3eebfbfc0754c503-4
May 6, 2026
Merged

fix(spartan): chain-halt alert + Block height panels — switch to aztec_status="proposed"#22978
alexghr merged 3 commits intomerge-train/spartanfrom
claudebox/3eebfbfc0754c503-4

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 6, 2026

Summary

After PR #21285 (Apr 9, 2026) split ArchiverInstrumentation.processNewBlocks into per-proposed-block and per-checkpoint methods, anything reading aztec_archiver_block_height{aztec_status=""} stopped reflecting the pending chain on healthy networks. Two consumers in this repo were affected:

  1. The Chain - no new blocks alert (spartan/metrics/grafana/alerts/rules.yaml) — fires for healthy networks.
  2. The Block height time-series panels in three dashboards (aztec_network.json, network-tps.json, fisherman.json) — go flat per service.

Both now read aztec_status="proposed" (the alert) / aztec_status=~"proposed|proven" (the panels), which match the post-split semantics.

Root cause

processNewProposedBlock (called per proposed block) writes aztec.archiver.block_height with aztec_status="proposed". processNewCheckpointedBlocks (called when L1-confirmed checkpoints get added) writes the same gauge with no status attribute. The fast-path checkpoint promotion added in #22716 means validators that already have the proposed checkpoint locally never enter addCheckpoints, so the empty-status series can sit flat for ≥ 10 min on healthy networks. That's enough to flip the alert and to flatten the dashboard panels.

The change reached production in early May 2026 — exactly when "Grafana shows the chain halted" reports started.

Changes

Alert (1 line):

- expr: max by (k8s_namespace_name) (increase(aztec_archiver_block_height{k8s_namespace_name!="",aztec_status=""}[10m]))
+ expr: max by (k8s_namespace_name) (increase(aztec_archiver_block_height{k8s_namespace_name!="",aztec_status="proposed"}[10m]))

Three dashboards (1 line each, identical change):

- "expr": "max by(service_name, aztec_status) (label_replace(aztec_archiver_block_height{service_namespace=\"$namespace\"}, \"aztec_status\", \"pending\", \"aztec_status\", \"^$\"))"
+ "expr": "max by(service_name, aztec_status) (aztec_archiver_block_height{service_namespace=\"$namespace\", aztec_status=~\"proposed|proven\"})"

The panels now show two series per service — the proposed chain (advances per proposed block) and the proven chain (advances per proven epoch) — instead of a single relabelled "pending" series that no longer ticks.

Notes

  • The Current Block Heights stat panel in aztec_network.json (lines ~255 / ~295) and the Sync efficiency panel (line ~375) still query aztec_status="" directly. Those are out of scope for this PR — left for a follow-up so this PR stays a one-symbol-per-line change. They're less visually misleading than the Block height time-series anyway.
  • The IRM exporter dashboard / alert metric-name drift in spartan/metrics/irm-monitor/ (closed PR fix(spartan): point IRM dashboard + alert at renamed checkpoint metrics #22972) is unrelated and was a red herring driven by repo-vs-deployed-state drift, not the May regression.
  • Once this lands, the in-cluster Chain - no new blocks alert should clear within ~10 min on its next evaluation. If it sticks (Grafana state caching), pause/unpause the rule.

@AztecBot AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels May 6, 2026
Dashboard panels still queried aztec_archiver_block_height with empty
aztec_status (the pre-PR-21285 'pending' series). After the per-block
instrumentation split, that label only advances on L1 checkpoint sync,
so the panel goes flat on healthy networks. Show proposed and proven
series directly — they match the post-split semantics and what the
panel title implies.
@AztecBot AztecBot changed the title fix(spartan): chain-halt alert query — switch to aztec_status="proposed" fix(spartan): chain-halt alert + Block height panels — switch to aztec_status="proposed" May 6, 2026
@alexghr alexghr changed the base branch from next to merge-train/spartan May 6, 2026 09:07
@alexghr alexghr marked this pull request as ready for review May 6, 2026 09:07
@alexghr alexghr added ci-skip and removed ci-draft Run CI on draft PRs. labels May 6, 2026
@alexghr alexghr merged commit 4a99638 into merge-train/spartan May 6, 2026
43 of 51 checks passed
@alexghr alexghr deleted the claudebox/3eebfbfc0754c503-4 branch May 6, 2026 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-skip claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants