feat(observability): Prometheus alerts, Grafana dashboards, Slack routing#83
Conversation
0xfandom
left a comment
There was a problem hiding this comment.
Review — PR1 of 3 for issue #69 (E2 Observability Floor)
Reviewed against the reduced PR1 scope (WS-7.2, WS-7.3 Slack-only, WS-7.4, metric gap fixes). PR2 (Loki) and PR3 (OTEL/canary) scope is correctly excluded from evaluation.
Summary
Clean, well-scoped delivery. Metric plumbing is correct, dashboard PromQL references all resolve to real metrics in the codebase, alert rules align with the issue table, and the Slack secret handling avoids committing the webhook. Observer pattern on RiskManager is implemented correctly and keeps internal/risk Prometheus-free. Approving with a handful of non-blocking observations below.
Acceptance Criteria (PR1 scope)
| # | Criterion | Status | Evidence |
|---|---|---|---|
| 1 | WS-7.2: 4 Grafana dashboards, file-provisioned | Met | deploy/docker/grafana/dashboards/{overview,latency,builders,risk}.json — all carry "schemaVersion": 38 and explicit "uid" (overview.json:2-5, latency.json:2-5, builders.json:2-5, risk.json:2-5); provider at deploy/docker/grafana/provisioning/dashboards/default.yml |
| 2 | WS-7.3 (reduced): Alertmanager + single Slack receiver | Met | deploy/docker/alertmanager.yml:11-23 single slack-default receiver; Slack-only per standing policy (PagerDuty/Discord explicitly dropped) |
| 3 | WS-7.4: 7 alert rules with correct thresholds | Met | deploy/docker/prometheus/alerts.yml:6-73 — AetherHalted (for 1m), AetherInclusionRateLow (for 10m), AetherE2ELatencyHigh (p99>100ms for 5m), AetherNoOpportunities (for 10m), AetherETHBalanceLow (for 1m), AetherGasHigh (for 30s), AetherBuilderDown (for 2m with and idle-gate) |
| 4 | Per-builder metrics (builder, result labels) + latency histogram | Met | cmd/executor/metrics.go:63-71 builderSubmissionsTotal and builderLatencyMs; wired at submitter.go:334 via recordBuilderResult; label cardinality bounded by configured builder list (not user input) |
| 5 | aether_system_state gauge (0/1/2/3) |
Met | cmd/executor/metrics.go:72-75; mapping at main.go:396-409 with -1 sentinel for unmapped states |
| 6 | aether_circuit_breaker_trips_total{reason} counter |
Met | cmd/executor/metrics.go:76-79; labels emitted from manager.go:147-156 with consecutive_bug_reverts and daily_loss_exceeded |
| 7 | MetricsObserver hook keeps internal/risk Prometheus-free |
Met | internal/risk/manager.go:99-105 interface; cmd/executor/main.go:381-409 adapter; no Prometheus import in internal/risk/* |
| 8 | Initial state published at startup | Met | manager.go:137-144 SetMetricsObserver immediately emits OnStateChange(rm.state.Current()); asserted in manager_test.go:607-621 |
| 9 | Docker compose wiring | Met | deploy/docker/docker-compose.yml:50-86 — prometheus depends_on alertmanager, volumes mounted at paths matching prometheus.yml:5,10 |
| 10 | Slack webhook not committed | Met | alertmanager.yml:14 uses __SLACK_WEBHOOK_URL__ placeholder; substitution in docker-compose.yml:75-83 writes to /tmp/alertmanager.yml, leaving the read-only mount untouched; documented in .env.example:24-28 |
| 11 | No AI attribution in PR body | Met | PR body contains no Claude/AI references |
Items out of scope (PR2/PR3) — not evaluated here: WS-7.5 Loki/Promtail, WS-7.6 OTEL/Tempo, WS-7.7 synthetic canary.
Findings
Observations (non-blocking)
deploy/docker/prometheus/alerts.yml:15-25 — AetherInclusionRateLow denominator clamp can false-fire on low-traffic periods.
The clamp_min(sum(rate(bundles_submitted[1h])), 1) floor forces the denominator to 1 req/s even when actual throughput is a few per hour. During a slow period (say 0.0005/s submitted, 0.0001/s included) the ratio becomes ~0.0001, which trips the <0.20 threshold and will page via Slack even though nothing is wrong. AetherNoOpportunities partially guards this (also fires), but the inclusion alert itself is noisy. Consider gating with and sum(rate(bundles_submitted[1h])) > 0.005 (≈18/hr) so the alert only fires when you have enough denominator samples to be meaningful. Not a blocker — Slack-only routing with 4h repeat_interval (alertmanager.yml:9) limits noise.
deploy/docker/alertmanager.yml:14 — sed substitution will break if the webhook URL ever contains |.
Slack incoming webhooks today are https://hooks.slack.com/services/<team>/<hook>/<secret> — only / and alphanumerics — so the current delimiter is safe. Flagging only so that if you ever repurpose the substitution for a different receiver (PagerDuty URL, OpsGenie API, etc.), remember to pick a delimiter the payload cannot contain or switch to envsubst.
cmd/executor/main.go:165 — SetMetricsObserver is called before startMetricsServer() (main.go:195).
Ordering is fine (the Prometheus registry is process-global and initialized in metrics.go:82-98 init()), but worth noting that the initial OnStateChange(Running) increments the gauge before /metrics is reachable. No scrape can miss it because the value is idempotent — a later scrape will see the current gauge value. Non-issue.
internal/risk/manager.go:99-105 — observer doc comment covers the lock concern.
The interface godoc explicitly says "Called under rm.mu — keep callbacks non-blocking (Prometheus primitives are fine)." Current executorMetricsObserver (main.go:384-392) only calls Set / WithLabelValues(...).Inc(), both lock-free. A future contributor adding a blocking implementation (e.g. an HTTP webhook) would stall all risk processing — the doc comment is the only guardrail. Consider adding a TestMetricsObserver_NonBlocking assertion in a follow-up (e.g. a benchmark that fails if the observer callback takes >1µs) if you want enforcement rather than documentation.
cmd/executor/metrics.go:146-149 — result label values.
Only two values ever emitted ("success", "failure") — recordBuilderResult hardcodes them. Label cardinality is |builders| × 2, fully bounded. Good.
cmd/executor/metrics.go:143-149 vs submitter.go:334 — metrics recorded even when submission never left the process.
When submitter.SubmitToAll returns the synthetic {Builder: "all", Success: false, ...} result because bundle.RawTxs is empty (submitter.go:101-107), that path skips the per-builder goroutine so recordMetrics is not called — good, no builder="all" cardinality leak. Verified.
Observer notification coverage — every state mutation is covered. notifyTrip is invoked from exactly the two state-changing paths: RecordRevert (manager.go:315) and RecordTrade (manager.go:345). SystemStateMachine.Transition is not called from anywhere else under this PR's diff. If a future PR adds a degraded-state transition (node latency breaker, bundle miss rate alert) it MUST route through notifyTrip or the gauge will stop reflecting reality — worth a comment on SystemStateMachine itself. Not a PR1 issue.
Nits
deploy/docker/prometheus/alerts.yml:57 — AetherGasHigh severity=info.
The issue lists the gas breaker as halting at >300 gwei. The alert here is informational only (no page, just a Slack notice). Because the circuit-breaker side of this (in RiskManager.PreflightCheck, manager.go:235) already rejects arbs and AetherHalted would fire if a halt state is ever wired to high-gas, the info-only severity seems intentional and correct. Noting for visibility.
deploy/docker/docker-compose.yml:58 — prometheus depends_on: [alertmanager].
Reverse of the usual pattern (prometheus can come up without alertmanager; it will just buffer alerts). Functionally fine — just means if alertmanager fails to start, prometheus stays down too. Might prefer independent startup to keep metrics scraping alive during an alertmanager outage.
deploy/docker/alertmanager.yml:6-9 — group_wait: 30s, group_interval: 5m, repeat_interval: 4h.
Sane defaults for a single-channel Slack setup. No alert-storm risk.
What's Good
- Observer-pattern decoupling (
internal/risk/manager.go:99-156) is the right abstraction and keeps the risk package dependency-free. TestMetricsObserver_*tests (manager_test.go:607-685) actually exercise the initial-emit, bug-revert, and daily-loss paths — not smoke-only.TestRecordBuilderResult_ScrapeLabels(metrics_test.go:131-161) asserts the full labeled exposition text, which would catch accidental label-name drift.- Slack secret handling via
sedinto/tmpavoids mutating the read-only mount and keeps the webhook out of git. - Dashboard PromQL is internally consistent: every referenced metric (
aether_detection_latency_ms_bucket,aether_simulation_latency_ms_bucket,aether_arbs_published_total,aether_system_state,aether_circuit_breaker_trips_total, per-builder vec) exists in eithercrates/grpc-server/src/metrics.rsorcmd/executor/metrics.go. No dead panels. AetherBuilderDownidle-gate usingand on (builder) sum by (builder) (rate(...)[2m]) > 0correctly prevents firing on builders that aren't receiving any submissions — addresses the stated concern.
Verdict
APPROVE — PR1 scope is delivered end-to-end with correct metric plumbing, live dashboard queries, and functioning Slack routing. The two items worth revisiting (inclusion-rate denominator clamp, observer blocking contract) are tunables / future-contributor guardrails, not correctness bugs. Ship it; fold the inclusion-rate gate into PR2 if it becomes noisy in staging.
Add Prometheus metrics required by PR1 observability dashboards and alert rules: per-builder submission counters, submission latency histogram, system-state gauge, and circuit-breaker-trip counter. Introduces a MetricsObserver hook on RiskManager so cmd/executor can wire the gauge/counter without internal/risk depending on Prometheus.
…ger+Slack Wires Prometheus metrics through four provisioned Grafana dashboards and seven alert rules routed to a single Slack channel via Alertmanager. - 4 Grafana dashboards with file-based provisioning: Overview, Latency (p50/p95/p99), Builder Performance, Risk & State - 7 Prometheus alert rules covering halted state, inclusion rate, E2E p99 latency, opportunity drought, low ETH balance, high gas, and per-builder outage - Alertmanager service in docker-compose with a single Slack receiver; webhook URL is substituted from SLACK_WEBHOOK_URL at container startup so the secret never lands in the committed config - prometheus.yml now loads rule_files and points at alertmanager:9093 Part of issue #69. Loki/Promtail log aggregation and OTel/Tempo tracing are deferred to follow-up PRs.
- grafana datasource uid pinned so dashboards resolve on fresh deploy
- clamp_min floor to 1e-9 (was 1) so inclusion-rate alert stops firing permanently
- pre-register {success,failure} builder labels so AetherBuilderDown can fire on first failure
- hoist circuit-breaker observer above state.Transition so trips are counted even when halted
bc19ae0 to
33de991
Compare
Re-review round 2 — all blockers addressedRebased onto current Must-fix (blockers)
Should-fix (non-blocking but in this PR)
Deliberately skipped (noise, not worth in-PR churn)
Validation
Branch is now rebased on top of current |
0xfandom
left a comment
There was a problem hiding this comment.
Summary
This PR delivers 4 Grafana dashboards, an Alertmanager routing to Slack, and 7 Prometheus alert rules, wired into deploy/docker/docker-compose.yml. On the Go side it adds a MetricsObserver in internal/risk so risk-manager state changes and breaker trips become Prometheus series, plus a PreRegisterBuilderLabels helper so AetherBuilderDown can fire against silent-from-birth builders. Round-2 fixes (datasource UID, clamp_min(..., 1e-9), pre-registered builder labels, trip-before-transition ordering) are all correctly landed with backing tests. No merge-blockers; a handful of medium/low observations worth a follow-up.
Issue #69 AC table
| WS | Item | Status | Evidence |
|---|---|---|---|
| WS-7.2 | overview dashboard | Met | deploy/docker/grafana/dashboards/overview.json:1-175 — 7 panels referencing aether_arbs_published_total, aether_executor_bundles_submitted_total/included_total, aether_eth_balance, aether_gas_price_gwei, aether_daily_pnl_eth, aether_executor_profit_wei_total, aether_executor_gas_spent_wei_total |
| WS-7.2 | latency dashboard | Met | latency.json:1-68 — detection/sim/e2e p50/p95/p99 histograms |
| WS-7.2 | builders dashboard | Met | builders.json:1-86 — submission rate, success rate, p95 latency, totals table |
| WS-7.2 | risk/state dashboard | Met | risk.json:1-159 — state timeline + stat, breaker trips, risk rejections, PnL, balance |
| WS-7.2 | provisioning | Met | grafana/provisioning/dashboards/default.yml:1-14, datasources/prometheus.yml:1-11 |
| WS-7.3 | Alertmanager + receiver | Met (Slack-only per team directive) | docker-compose.yml:63-86, alertmanager.yml:1-28 — PagerDuty/Discord deliberately deferred |
| WS-7.3 | Webhook-via-env, not committed | Met | alertmanager.yml:14 uses __SLACK_WEBHOOK_URL__ placeholder; sed substitution at container start; .env.example:24-28 documents it |
| WS-7.4 | 7 alerts | Met | prometheus/alerts.yml:1-74 defines all 7: AetherHalted, AetherInclusionRateLow, AetherE2ELatencyHigh, AetherNoOpportunities, AetherETHBalanceLow, AetherGasHigh, AetherBuilderDown |
| WS-7.4 | Prometheus loads rules | Met | deploy/docker/prometheus.yml:5-6 declares rule_files: [/etc/prometheus/alerts.yml]; mounted via docker-compose.yml:57 |
| WS-7.4 | Prometheus talks to Alertmanager | Met | prometheus.yml:8-11 static_configs target alertmanager:9093; both on aether-net |
Round-2 verification
-
Datasource UID — Verified.
grafana/provisioning/datasources/prometheus.yml:6setsuid: prometheus, matchingdatasource.uid: "prometheus"across every panel of all four dashboards (overview.json:16,45,74,100,126,141,165,latency.json:16,29,44,builders.json:16,32,56,72,risk.json:16,48,64,80,106,132). -
clamp_min(..., 1e-9)— Verified.prometheus/alerts.yml:19andoverview.json:49both use1e-9. The only remainingclamp_minisbuilders.json:36using0.0001— intentional for the panel's percentage-ratio guard and is a dashboard, not an alert. Correct. -
PreRegisterBuilderLabels— Verified.cmd/executor/metrics.go:159-164: iterates names, callsWithLabelValues(name, "success").Add(0)and.Add(0)on failure.- Wired at
cmd/executor/submitter.go:78-87—NewSubmitterbuildsnamesslice and callsPreRegisterBuilderLabels(names)before fan-out starts. - Race-safe:
CounterVec.WithLabelValuesis documented goroutine-safe;NewSubmitterruns once at startup. TestPreRegisterBuilderLabels_BothSeriesExistAtZero(metrics_test.go:256-302) asserts both series on in-process counter and on the scraped/metricstext — exactly the propertyAetherBuilderDowndepends on.
-
notifyTriphoist — Verified.internal/risk/manager.go:154-165:OnCircuitBreakerTrip(reason)runs BEFOREstate.Transition(newState);OnStateChange(newState)stays AFTER and only fires on successful transition. Function is held underrm.mu(call sitesRecordRevert:324,RecordTrade:354both takerm.mu.Lock()). Observer contract requires non-blocking callbacks;executorMetricsObserver(cmd/executor/main.go:438-446) only touches lock-free Prometheus primitives.TestMetricsObserver_TripCountedEvenWhenTransitionRejected(manager_test.go:693-742) forcesStateHalted, drains initial observer, triggers 3 bug reverts, asserts exactly oneconsecutive_bug_revertstrip AND zero new state changes.
Must-fix blockers
None. This PR is mergeable.
Should-fix nits (follow-up PR territory)
-
MEDIUM — Slack webhook leaks into
docker inspect.docker-compose.yml:79uses${SLACK_WEBHOOK_URL}inside the entrypoint shell block. Compose interpolates${VAR}in block scalars at parse time, so the literal URL ends up baked into the container's entrypoint ARGV and is visible viadocker inspect aether-alertmanager. Theenvironment:block already propagates the variable into the container env; the entrypoint can use$$SLACK_WEBHOOK_URL(escaped$) so expansion happens inside the container shell at runtime, keeping the secret out ofdocker inspect/ compose logs. Also worth switchingseddelimiter to|→ something that survives sed-special chars (&,\) in the URL. -
MEDIUM — Empty
SLACK_WEBHOOK_URL→ silent alerting loss. If the env var is unset, sed substitutes an empty string andalertmanager.yml:14becomesapi_url:. Alertmanager fails config load and the container crashloops underrestart: unless-stopped. There is no watchdog alert for "Alertmanager is down" — Prometheus doesn't currently scrape Alertmanager (prometheus.yml:13-19). Add either (a) an Alertmanager scrape job +up{job="alertmanager"} == 0alert rule, or (b) a startup guard in the entrypoint that logs loudly and refuses to start on empty webhook. -
MEDIUM —
AetherNoOpportunitiesfires on cold start / low-activity windows.alerts.yml:36-43fires whenrate(aether_arbs_published_total[10m]) * 60 < 5for 10m. A freshly-started bot produces zero published arbs for the first window; this pages ~20 min after every restart. Considerunless on() (time() - process_start_time_seconds < 1800)to suppress warm-up, or gate onaether_blocks_processed_total > 0. -
MEDIUM —
AetherBuilderDownbehavior forEnabled: falsebuilder. Pre-registration handles a configured-but-never-attempted builder. But ifconfig/builders.yamllists a builder withEnabled: false, pre-registration still fires (good), andrate(...)is zero on bothsuccessand total → theand on (builder)guard sees zero-total and correctly does NOT fire. Worth documenting this — an operator may later wonder why a disabled builder is "silent" per the alert.
Low / nits
-
LOW —
AetherHaltedhard-codesaether_system_state == 3(alerts.yml:7). The mapping lives incmd/executor/main.go:450-463(stateToInt), the metric help text (metrics.go:75), andrisk.json:25-28,139-143. A future renumber will miss one. Either publish asaether_system_state{state="halted"} 1/0or at minimum add a cross-reference comment. -
LOW — Compose uses
:latesttags for grafana/prometheus/alertmanager. Silent breaking upgrades ondocker compose pull. Pin to known-good majors and bump explicitly. -
LOW —
inhibit_rules(alertmanager.yml:25-28) is nearly a no-op since each alert has a unique alertname. Not a bug, just note the design intent matches operator expectation ("same alertname at critical suppresses its warning sibling" — not true for these alerts). -
NIT —
TestRecordBuilderResult_ScrapeLabels(metrics_test.go:224-254) uses real builder names (flashbots,titan) which pollute the global registry.TestPreRegisterBuilderLabels_BothSeriesExistAtZeroalready avoids this by using uniqueprereg_alpha/prereg_beta. Apply the same pattern to the older test when you touch it. -
NIT —
metrics.go:180logs "precision loss" warning on every profit over 2^53 wei. CumulativeprofitTotalWei.Add(f)will cross2^53after a few ETH of lifetime profit, producing ongoing log spam. Downgrade to debug or sample.
Can-defer (follow-up tickets)
- Alertmanager scrape +
up{job="alertmanager"}watchdog alert (addresses the silent-outage gap above). - PagerDuty + Discord receivers (explicitly deferred per team directive; file a separate ticket to track).
- State-gauge label migration.
- Docker image pinning across observability stack.
Verdict
APPROVE WITH RESERVATIONS — all four round-2 fixes landed correctly with tests that actually prove the property. Issue #69 WS-7.2/7.3/7.4 scope is fully delivered (Slack-only routing is an acknowledged deviation). No production-risk blockers. The medium items are worth a follow-up PR but do not need to block this merge. Ship it.
Webhook hygiene, alertmanager watchdog, cold-start gating, image pinning, inhibit rule fix, system_state cross-refs, test isolation, quieter precision log. See PR body for per-item detail.
Summary
PR1 of 3 for issue #69 — delivers the core observability loop: Prometheus metrics → Grafana dashboards → Alertmanager → Slack. Everything self-hosted; no paid services.
AetherHalted,AetherInclusionRateLow,AetherE2ELatencyHigh,AetherNoOpportunities,AetherETHBalanceLow,AetherGasHigh,AetherBuilderDownSLACK_WEBHOOK_URLat container startup — never committedbuilder,resultlabels) + latency histogram,aether_system_stategauge,aether_circuit_breaker_trips_total{reason}counterMetricsObserverhook onRiskManagersointernal/riskstays Prometheus-free —cmd/executoradapts to the gauge/counter at startupFollow-up PRs for the remaining WS-7 scope are outlined in this comment on #69:
Files
Test plan
Notes for reviewers
Closes part of #69 (see linked comment for PR split).