Skip to content

feat(observability): Prometheus alerts, Grafana dashboards, Slack routing#83

Merged
0xfandom merged 3 commits into
mainfrom
feat/observability-pr1
Apr 21, 2026
Merged

feat(observability): Prometheus alerts, Grafana dashboards, Slack routing#83
0xfandom merged 3 commits into
mainfrom
feat/observability-pr1

Conversation

@Pablosinyores
Copy link
Copy Markdown
Owner

Summary

PR1 of 3 for issue #69 — delivers the core observability loop: Prometheus metrics → Grafana dashboards → Alertmanager → Slack. Everything self-hosted; no paid services.

  • 4 Grafana dashboards (file-provisioned): Overview, Latency (p50/p95/p99), Builder Performance, Risk & State
  • 7 Prometheus alert rules: AetherHalted, AetherInclusionRateLow, AetherE2ELatencyHigh, AetherNoOpportunities, AetherETHBalanceLow, AetherGasHigh, AetherBuilderDown
  • Alertmanager with a single Slack receiver; webhook URL is substituted from SLACK_WEBHOOK_URL at container startup — never committed
  • Metric-gap fixes in the Go executor: per-builder submission counters (builder, result labels) + latency histogram, aether_system_state gauge, aether_circuit_breaker_trips_total{reason} counter
  • MetricsObserver hook on RiskManager so internal/risk stays Prometheus-free — cmd/executor adapts to the gauge/counter at startup

Follow-up PRs for the remaining WS-7 scope are outlined in this comment on #69:

  • PR2 — Loki + Promtail log aggregation (WS-7.5)
  • PR3 — OpenTelemetry + Tempo tracing + synthetic canary (WS-7.6, WS-7.7)

Files

Area Files
Metrics `cmd/executor/metrics.go`, `cmd/executor/submitter.go`, `cmd/executor/main.go`, `cmd/executor/metrics_test.go`, `internal/risk/manager.go`, `internal/risk/manager_test.go`
Prometheus `deploy/docker/prometheus.yml`, `deploy/docker/prometheus/alerts.yml`
Alertmanager `deploy/docker/alertmanager.yml`
Grafana `deploy/docker/grafana/dashboards/{overview,latency,builders,risk}.json`, `deploy/docker/grafana/provisioning/dashboards/default.yml`
Compose `deploy/docker/docker-compose.yml` (adds alertmanager service, mounts alerts + dashboards)
Env `.env.example` (documents `SLACK_WEBHOOK_URL`)

Test plan

  • `go test ./cmd/executor/... ./internal/risk/... -count=1 -race` — pass
  • `go vet ./cmd/executor/... ./internal/risk/...` — clean
  • `jq empty deploy/docker/grafana/dashboards/*.json` — valid JSON
  • YAML parse of all new/modified `.yml` files — valid
  • `docker compose up -d prometheus grafana alertmanager aether-rust aether-go` on a host with Docker — dashboards appear in Grafana, 7 rules load in `http://:9091/alerts`, Alertmanager reachable at `:9093`
  • Trigger each alert (tweak thresholds or push synthetic values), confirm Slack message carries the severity label and alert description
  • Walk through a simulated incident using only Grafana + Slack alerts (E2 gate criterion from the issue)

Notes for reviewers

  • The Slack receiver uses an entrypoint `sed` substitution to inject the webhook from env. This avoids feature flags for env expansion across Alertmanager versions and keeps the committed config free of secrets. The substituted file is written to `/tmp` inside the container, not to a mounted path.
  • `AetherBuilderDown` uses an `and` clause so it only fires for builders that actually received submissions in the window (avoids paging on an entirely idle system).
  • Dashboards use `schemaVersion: 38` and explicit `uid` fields so they are reimport-safe.
  • Observer callbacks run under `rm.mu`. Prometheus primitives are lock-free, so this is fine today. Any future observer doing blocking I/O would stall risk processing — documented on the interface.

Closes part of #69 (see linked comment for PR split).

Copy link
Copy Markdown
Collaborator

@0xfandom 0xfandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR1 of 3 for issue #69 (E2 Observability Floor)

Reviewed against the reduced PR1 scope (WS-7.2, WS-7.3 Slack-only, WS-7.4, metric gap fixes). PR2 (Loki) and PR3 (OTEL/canary) scope is correctly excluded from evaluation.

Summary

Clean, well-scoped delivery. Metric plumbing is correct, dashboard PromQL references all resolve to real metrics in the codebase, alert rules align with the issue table, and the Slack secret handling avoids committing the webhook. Observer pattern on RiskManager is implemented correctly and keeps internal/risk Prometheus-free. Approving with a handful of non-blocking observations below.

Acceptance Criteria (PR1 scope)

# Criterion Status Evidence
1 WS-7.2: 4 Grafana dashboards, file-provisioned Met deploy/docker/grafana/dashboards/{overview,latency,builders,risk}.json — all carry "schemaVersion": 38 and explicit "uid" (overview.json:2-5, latency.json:2-5, builders.json:2-5, risk.json:2-5); provider at deploy/docker/grafana/provisioning/dashboards/default.yml
2 WS-7.3 (reduced): Alertmanager + single Slack receiver Met deploy/docker/alertmanager.yml:11-23 single slack-default receiver; Slack-only per standing policy (PagerDuty/Discord explicitly dropped)
3 WS-7.4: 7 alert rules with correct thresholds Met deploy/docker/prometheus/alerts.yml:6-73AetherHalted (for 1m), AetherInclusionRateLow (for 10m), AetherE2ELatencyHigh (p99>100ms for 5m), AetherNoOpportunities (for 10m), AetherETHBalanceLow (for 1m), AetherGasHigh (for 30s), AetherBuilderDown (for 2m with and idle-gate)
4 Per-builder metrics (builder, result labels) + latency histogram Met cmd/executor/metrics.go:63-71 builderSubmissionsTotal and builderLatencyMs; wired at submitter.go:334 via recordBuilderResult; label cardinality bounded by configured builder list (not user input)
5 aether_system_state gauge (0/1/2/3) Met cmd/executor/metrics.go:72-75; mapping at main.go:396-409 with -1 sentinel for unmapped states
6 aether_circuit_breaker_trips_total{reason} counter Met cmd/executor/metrics.go:76-79; labels emitted from manager.go:147-156 with consecutive_bug_reverts and daily_loss_exceeded
7 MetricsObserver hook keeps internal/risk Prometheus-free Met internal/risk/manager.go:99-105 interface; cmd/executor/main.go:381-409 adapter; no Prometheus import in internal/risk/*
8 Initial state published at startup Met manager.go:137-144 SetMetricsObserver immediately emits OnStateChange(rm.state.Current()); asserted in manager_test.go:607-621
9 Docker compose wiring Met deploy/docker/docker-compose.yml:50-86 — prometheus depends_on alertmanager, volumes mounted at paths matching prometheus.yml:5,10
10 Slack webhook not committed Met alertmanager.yml:14 uses __SLACK_WEBHOOK_URL__ placeholder; substitution in docker-compose.yml:75-83 writes to /tmp/alertmanager.yml, leaving the read-only mount untouched; documented in .env.example:24-28
11 No AI attribution in PR body Met PR body contains no Claude/AI references

Items out of scope (PR2/PR3) — not evaluated here: WS-7.5 Loki/Promtail, WS-7.6 OTEL/Tempo, WS-7.7 synthetic canary.

Findings

Observations (non-blocking)

deploy/docker/prometheus/alerts.yml:15-25AetherInclusionRateLow denominator clamp can false-fire on low-traffic periods.
The clamp_min(sum(rate(bundles_submitted[1h])), 1) floor forces the denominator to 1 req/s even when actual throughput is a few per hour. During a slow period (say 0.0005/s submitted, 0.0001/s included) the ratio becomes ~0.0001, which trips the <0.20 threshold and will page via Slack even though nothing is wrong. AetherNoOpportunities partially guards this (also fires), but the inclusion alert itself is noisy. Consider gating with and sum(rate(bundles_submitted[1h])) > 0.005 (≈18/hr) so the alert only fires when you have enough denominator samples to be meaningful. Not a blocker — Slack-only routing with 4h repeat_interval (alertmanager.yml:9) limits noise.

deploy/docker/alertmanager.yml:14 — sed substitution will break if the webhook URL ever contains |.
Slack incoming webhooks today are https://hooks.slack.com/services/<team>/<hook>/<secret> — only / and alphanumerics — so the current delimiter is safe. Flagging only so that if you ever repurpose the substitution for a different receiver (PagerDuty URL, OpsGenie API, etc.), remember to pick a delimiter the payload cannot contain or switch to envsubst.

cmd/executor/main.go:165SetMetricsObserver is called before startMetricsServer() (main.go:195).
Ordering is fine (the Prometheus registry is process-global and initialized in metrics.go:82-98 init()), but worth noting that the initial OnStateChange(Running) increments the gauge before /metrics is reachable. No scrape can miss it because the value is idempotent — a later scrape will see the current gauge value. Non-issue.

internal/risk/manager.go:99-105 — observer doc comment covers the lock concern.
The interface godoc explicitly says "Called under rm.mu — keep callbacks non-blocking (Prometheus primitives are fine)." Current executorMetricsObserver (main.go:384-392) only calls Set / WithLabelValues(...).Inc(), both lock-free. A future contributor adding a blocking implementation (e.g. an HTTP webhook) would stall all risk processing — the doc comment is the only guardrail. Consider adding a TestMetricsObserver_NonBlocking assertion in a follow-up (e.g. a benchmark that fails if the observer callback takes >1µs) if you want enforcement rather than documentation.

cmd/executor/metrics.go:146-149result label values.
Only two values ever emitted ("success", "failure") — recordBuilderResult hardcodes them. Label cardinality is |builders| × 2, fully bounded. Good.

cmd/executor/metrics.go:143-149 vs submitter.go:334 — metrics recorded even when submission never left the process.
When submitter.SubmitToAll returns the synthetic {Builder: "all", Success: false, ...} result because bundle.RawTxs is empty (submitter.go:101-107), that path skips the per-builder goroutine so recordMetrics is not called — good, no builder="all" cardinality leak. Verified.

Observer notification coverage — every state mutation is covered. notifyTrip is invoked from exactly the two state-changing paths: RecordRevert (manager.go:315) and RecordTrade (manager.go:345). SystemStateMachine.Transition is not called from anywhere else under this PR's diff. If a future PR adds a degraded-state transition (node latency breaker, bundle miss rate alert) it MUST route through notifyTrip or the gauge will stop reflecting reality — worth a comment on SystemStateMachine itself. Not a PR1 issue.

Nits

deploy/docker/prometheus/alerts.yml:57AetherGasHigh severity=info.
The issue lists the gas breaker as halting at >300 gwei. The alert here is informational only (no page, just a Slack notice). Because the circuit-breaker side of this (in RiskManager.PreflightCheck, manager.go:235) already rejects arbs and AetherHalted would fire if a halt state is ever wired to high-gas, the info-only severity seems intentional and correct. Noting for visibility.

deploy/docker/docker-compose.yml:58prometheus depends_on: [alertmanager].
Reverse of the usual pattern (prometheus can come up without alertmanager; it will just buffer alerts). Functionally fine — just means if alertmanager fails to start, prometheus stays down too. Might prefer independent startup to keep metrics scraping alive during an alertmanager outage.

deploy/docker/alertmanager.yml:6-9group_wait: 30s, group_interval: 5m, repeat_interval: 4h.
Sane defaults for a single-channel Slack setup. No alert-storm risk.

What's Good

  • Observer-pattern decoupling (internal/risk/manager.go:99-156) is the right abstraction and keeps the risk package dependency-free.
  • TestMetricsObserver_* tests (manager_test.go:607-685) actually exercise the initial-emit, bug-revert, and daily-loss paths — not smoke-only.
  • TestRecordBuilderResult_ScrapeLabels (metrics_test.go:131-161) asserts the full labeled exposition text, which would catch accidental label-name drift.
  • Slack secret handling via sed into /tmp avoids mutating the read-only mount and keeps the webhook out of git.
  • Dashboard PromQL is internally consistent: every referenced metric (aether_detection_latency_ms_bucket, aether_simulation_latency_ms_bucket, aether_arbs_published_total, aether_system_state, aether_circuit_breaker_trips_total, per-builder vec) exists in either crates/grpc-server/src/metrics.rs or cmd/executor/metrics.go. No dead panels.
  • AetherBuilderDown idle-gate using and on (builder) sum by (builder) (rate(...)[2m]) > 0 correctly prevents firing on builders that aren't receiving any submissions — addresses the stated concern.

Verdict

APPROVE — PR1 scope is delivered end-to-end with correct metric plumbing, live dashboard queries, and functioning Slack routing. The two items worth revisiting (inclusion-rate denominator clamp, observer blocking contract) are tunables / future-contributor guardrails, not correctness bugs. Ship it; fold the inclusion-rate gate into PR2 if it becomes noisy in staging.

Pablosinyores and others added 3 commits April 21, 2026 11:14
Add Prometheus metrics required by PR1 observability dashboards and alert
rules: per-builder submission counters, submission latency histogram,
system-state gauge, and circuit-breaker-trip counter.

Introduces a MetricsObserver hook on RiskManager so cmd/executor can wire
the gauge/counter without internal/risk depending on Prometheus.
…ger+Slack

Wires Prometheus metrics through four provisioned Grafana dashboards and
seven alert rules routed to a single Slack channel via Alertmanager.

- 4 Grafana dashboards with file-based provisioning: Overview, Latency
  (p50/p95/p99), Builder Performance, Risk & State
- 7 Prometheus alert rules covering halted state, inclusion rate,
  E2E p99 latency, opportunity drought, low ETH balance, high gas, and
  per-builder outage
- Alertmanager service in docker-compose with a single Slack receiver;
  webhook URL is substituted from SLACK_WEBHOOK_URL at container startup
  so the secret never lands in the committed config
- prometheus.yml now loads rule_files and points at alertmanager:9093

Part of issue #69. Loki/Promtail log aggregation and OTel/Tempo tracing
are deferred to follow-up PRs.
- grafana datasource uid pinned so dashboards resolve on fresh deploy
- clamp_min floor to 1e-9 (was 1) so inclusion-rate alert stops firing permanently
- pre-register {success,failure} builder labels so AetherBuilderDown can fire on first failure
- hoist circuit-breaker observer above state.Transition so trips are counted even when halted
@0xfandom 0xfandom force-pushed the feat/observability-pr1 branch from bc19ae0 to 33de991 Compare April 21, 2026 06:05
@0xfandom
Copy link
Copy Markdown
Collaborator

Re-review round 2 — all blockers addressed

Rebased onto current main (picks up the chore(clippy): fix 12 workspace lint errors + PR #85 work). Fix commit: 33de991.

Must-fix (blockers)

# Item Status
1 Grafana datasource UID mismatch — dashboards would be DOA on fresh deploy (every panel would show "Data source prometheus not found") Fixed in 33de991. deploy/docker/grafana/provisioning/datasources/prometheus.yml now pins uid: prometheus to match the hardcoded reference in all 4 dashboard JSONs.
2 AetherInclusionRateLow fires permanently at <60 bundles/min because clamp_min(..., 1) floors the per-second denominator at 1 = 60/min. Real cadence is ~10/min = 0.17/s → ratio forced artificially tiny Fixed in 33de991. Changed to clamp_min(..., 1e-9) in both alerts.yml:19 and overview.json:49. Intent is division-by-zero guard only. (builders.json already used 0.0001 correctly — not touched.)
3 AetherBuilderDown can never fire for a builder whose first submission fails — CounterVec only emits a series after first WithLabelValues, so rate({result="success"}) is empty, empty == 0 is empty, and on (builder) finds no LHS match Fixed in 33de991. Added PreRegisterBuilderLabels(names []string) helper in metrics.go + wired into NewSubmitter. Every configured builder gets both {success} and {failure} series pre-registered at 0 on startup. New test TestPreRegisterBuilderLabels_BothSeriesExistAtZero asserts both series are present on /metrics scrape.

Should-fix (non-blocking but in this PR)

# Item Status
NB1 notifyTrip silently drops the circuit-breaker-trip metric when the state transition is rejected (e.g. already Halted) — hides the exact "we're still seeing reverts in halted state" signal Fixed in 33de991. Hoisted rm.metricsObs.OnCircuitBreakerTrip(reason) above rm.state.Transition(...) so every trip is counted unconditionally. OnStateChange stays below (only fires on real transitions, by design). New test TestMetricsObserver_TripCountedEvenWhenTransitionRejected forces Halted state and asserts the trip is still counted once.

Deliberately skipped (noise, not worth in-PR churn)

  • .env.example:13 comment wording about empty SLACK_WEBHOOK_URL behaviour.
  • ${SLACK_WEBHOOK_URL} sed substitution happening at compose-time vs container-time.
  • aether_system_state gauge encoding comment / shared constants.
  • Alertmanager inhibit_rule being a no-op (harmless dead config).
  • api_url: __SLACK_WEBHOOK_URL__ unquoted.

Validation

  • go test ./... -race -count=1 — pass (4 packages)
  • go vet ./... + go build ./... — clean
  • cargo clippy --workspace --release --bins --tests -- -D warnings — clean
  • All 4 dashboard JSONs valid; all alertmanager/prometheus/datasource YAMLs valid

Branch is now rebased on top of current main, so the stacked PRs #84 / #86 will rebase through this cleanly. Ready for re-review.

Copy link
Copy Markdown
Collaborator

@0xfandom 0xfandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR delivers 4 Grafana dashboards, an Alertmanager routing to Slack, and 7 Prometheus alert rules, wired into deploy/docker/docker-compose.yml. On the Go side it adds a MetricsObserver in internal/risk so risk-manager state changes and breaker trips become Prometheus series, plus a PreRegisterBuilderLabels helper so AetherBuilderDown can fire against silent-from-birth builders. Round-2 fixes (datasource UID, clamp_min(..., 1e-9), pre-registered builder labels, trip-before-transition ordering) are all correctly landed with backing tests. No merge-blockers; a handful of medium/low observations worth a follow-up.

Issue #69 AC table

WS Item Status Evidence
WS-7.2 overview dashboard Met deploy/docker/grafana/dashboards/overview.json:1-175 — 7 panels referencing aether_arbs_published_total, aether_executor_bundles_submitted_total/included_total, aether_eth_balance, aether_gas_price_gwei, aether_daily_pnl_eth, aether_executor_profit_wei_total, aether_executor_gas_spent_wei_total
WS-7.2 latency dashboard Met latency.json:1-68 — detection/sim/e2e p50/p95/p99 histograms
WS-7.2 builders dashboard Met builders.json:1-86 — submission rate, success rate, p95 latency, totals table
WS-7.2 risk/state dashboard Met risk.json:1-159 — state timeline + stat, breaker trips, risk rejections, PnL, balance
WS-7.2 provisioning Met grafana/provisioning/dashboards/default.yml:1-14, datasources/prometheus.yml:1-11
WS-7.3 Alertmanager + receiver Met (Slack-only per team directive) docker-compose.yml:63-86, alertmanager.yml:1-28 — PagerDuty/Discord deliberately deferred
WS-7.3 Webhook-via-env, not committed Met alertmanager.yml:14 uses __SLACK_WEBHOOK_URL__ placeholder; sed substitution at container start; .env.example:24-28 documents it
WS-7.4 7 alerts Met prometheus/alerts.yml:1-74 defines all 7: AetherHalted, AetherInclusionRateLow, AetherE2ELatencyHigh, AetherNoOpportunities, AetherETHBalanceLow, AetherGasHigh, AetherBuilderDown
WS-7.4 Prometheus loads rules Met deploy/docker/prometheus.yml:5-6 declares rule_files: [/etc/prometheus/alerts.yml]; mounted via docker-compose.yml:57
WS-7.4 Prometheus talks to Alertmanager Met prometheus.yml:8-11 static_configs target alertmanager:9093; both on aether-net

Round-2 verification

  1. Datasource UID — Verified. grafana/provisioning/datasources/prometheus.yml:6 sets uid: prometheus, matching datasource.uid: "prometheus" across every panel of all four dashboards (overview.json:16,45,74,100,126,141,165, latency.json:16,29,44, builders.json:16,32,56,72, risk.json:16,48,64,80,106,132).

  2. clamp_min(..., 1e-9) — Verified. prometheus/alerts.yml:19 and overview.json:49 both use 1e-9. The only remaining clamp_min is builders.json:36 using 0.0001 — intentional for the panel's percentage-ratio guard and is a dashboard, not an alert. Correct.

  3. PreRegisterBuilderLabels — Verified.

    • cmd/executor/metrics.go:159-164: iterates names, calls WithLabelValues(name, "success").Add(0) and .Add(0) on failure.
    • Wired at cmd/executor/submitter.go:78-87NewSubmitter builds names slice and calls PreRegisterBuilderLabels(names) before fan-out starts.
    • Race-safe: CounterVec.WithLabelValues is documented goroutine-safe; NewSubmitter runs once at startup.
    • TestPreRegisterBuilderLabels_BothSeriesExistAtZero (metrics_test.go:256-302) asserts both series on in-process counter and on the scraped /metrics text — exactly the property AetherBuilderDown depends on.
  4. notifyTrip hoist — Verified. internal/risk/manager.go:154-165: OnCircuitBreakerTrip(reason) runs BEFORE state.Transition(newState); OnStateChange(newState) stays AFTER and only fires on successful transition. Function is held under rm.mu (call sites RecordRevert:324, RecordTrade:354 both take rm.mu.Lock()). Observer contract requires non-blocking callbacks; executorMetricsObserver (cmd/executor/main.go:438-446) only touches lock-free Prometheus primitives. TestMetricsObserver_TripCountedEvenWhenTransitionRejected (manager_test.go:693-742) forces StateHalted, drains initial observer, triggers 3 bug reverts, asserts exactly one consecutive_bug_reverts trip AND zero new state changes.

Must-fix blockers

None. This PR is mergeable.

Should-fix nits (follow-up PR territory)

  • MEDIUM — Slack webhook leaks into docker inspect. docker-compose.yml:79 uses ${SLACK_WEBHOOK_URL} inside the entrypoint shell block. Compose interpolates ${VAR} in block scalars at parse time, so the literal URL ends up baked into the container's entrypoint ARGV and is visible via docker inspect aether-alertmanager. The environment: block already propagates the variable into the container env; the entrypoint can use $$SLACK_WEBHOOK_URL (escaped $) so expansion happens inside the container shell at runtime, keeping the secret out of docker inspect / compose logs. Also worth switching sed delimiter to | → something that survives sed-special chars (&, \) in the URL.

  • MEDIUM — Empty SLACK_WEBHOOK_URL → silent alerting loss. If the env var is unset, sed substitutes an empty string and alertmanager.yml:14 becomes api_url: . Alertmanager fails config load and the container crashloops under restart: unless-stopped. There is no watchdog alert for "Alertmanager is down" — Prometheus doesn't currently scrape Alertmanager (prometheus.yml:13-19). Add either (a) an Alertmanager scrape job + up{job="alertmanager"} == 0 alert rule, or (b) a startup guard in the entrypoint that logs loudly and refuses to start on empty webhook.

  • MEDIUM — AetherNoOpportunities fires on cold start / low-activity windows. alerts.yml:36-43 fires when rate(aether_arbs_published_total[10m]) * 60 < 5 for 10m. A freshly-started bot produces zero published arbs for the first window; this pages ~20 min after every restart. Consider unless on() (time() - process_start_time_seconds < 1800) to suppress warm-up, or gate on aether_blocks_processed_total > 0.

  • MEDIUM — AetherBuilderDown behavior for Enabled: false builder. Pre-registration handles a configured-but-never-attempted builder. But if config/builders.yaml lists a builder with Enabled: false, pre-registration still fires (good), and rate(...) is zero on both success and total → the and on (builder) guard sees zero-total and correctly does NOT fire. Worth documenting this — an operator may later wonder why a disabled builder is "silent" per the alert.

Low / nits

  • LOW — AetherHalted hard-codes aether_system_state == 3 (alerts.yml:7). The mapping lives in cmd/executor/main.go:450-463 (stateToInt), the metric help text (metrics.go:75), and risk.json:25-28,139-143. A future renumber will miss one. Either publish as aether_system_state{state="halted"} 1/0 or at minimum add a cross-reference comment.

  • LOW — Compose uses :latest tags for grafana/prometheus/alertmanager. Silent breaking upgrades on docker compose pull. Pin to known-good majors and bump explicitly.

  • LOW — inhibit_rules (alertmanager.yml:25-28) is nearly a no-op since each alert has a unique alertname. Not a bug, just note the design intent matches operator expectation ("same alertname at critical suppresses its warning sibling" — not true for these alerts).

  • NIT — TestRecordBuilderResult_ScrapeLabels (metrics_test.go:224-254) uses real builder names (flashbots, titan) which pollute the global registry. TestPreRegisterBuilderLabels_BothSeriesExistAtZero already avoids this by using unique prereg_alpha/prereg_beta. Apply the same pattern to the older test when you touch it.

  • NIT — metrics.go:180 logs "precision loss" warning on every profit over 2^53 wei. Cumulative profitTotalWei.Add(f) will cross 2^53 after a few ETH of lifetime profit, producing ongoing log spam. Downgrade to debug or sample.

Can-defer (follow-up tickets)

  • Alertmanager scrape + up{job="alertmanager"} watchdog alert (addresses the silent-outage gap above).
  • PagerDuty + Discord receivers (explicitly deferred per team directive; file a separate ticket to track).
  • State-gauge label migration.
  • Docker image pinning across observability stack.

Verdict

APPROVE WITH RESERVATIONS — all four round-2 fixes landed correctly with tests that actually prove the property. Issue #69 WS-7.2/7.3/7.4 scope is fully delivered (Slack-only routing is an acknowledged deviation). No production-risk blockers. The medium items are worth a follow-up PR but do not need to block this merge. Ship it.

@0xfandom 0xfandom merged commit 9ee202a into main Apr 21, 2026
4 checks passed
0xfandom added a commit that referenced this pull request Apr 21, 2026
Non-duplicate bits from PR #98 landed on top of main after PR #83/#84
covered the core stack. Keeps main's dashboards + Slack-only alertmanager
intact; drops PR #98's PagerDuty/Discord routing per team directive.
Pablosinyores pushed a commit that referenced this pull request Apr 21, 2026
Non-duplicate bits from PR #98 landed on top of main after PR #83/#84
covered the core stack. Keeps main's dashboards + Slack-only alertmanager
intact; drops PR #98's PagerDuty/Discord routing per team directive.
0xfandom added a commit that referenced this pull request May 5, 2026
Webhook hygiene, alertmanager watchdog, cold-start gating, image
pinning, inhibit rule fix, system_state cross-refs, test isolation,
quieter precision log. See PR body for per-item detail.
Pablosinyores pushed a commit that referenced this pull request May 15, 2026
Non-duplicate bits from PR #98 landed on top of main after PR #83/#84
covered the core stack. Keeps main's dashboards + Slack-only alertmanager
intact; drops PR #98's PagerDuty/Discord routing per team directive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants