feat(observability): Prometheus alerts, Grafana dashboards, Slack routing by Pablosinyores · Pull Request #83 · Pablosinyores/aether

Pablosinyores · 2026-04-14T10:01:10Z

Summary

PR1 of 3 for issue #69 — delivers the core observability loop: Prometheus metrics → Grafana dashboards → Alertmanager → Slack. Everything self-hosted; no paid services.

4 Grafana dashboards (file-provisioned): Overview, Latency (p50/p95/p99), Builder Performance, Risk & State
7 Prometheus alert rules: AetherHalted, AetherInclusionRateLow, AetherE2ELatencyHigh, AetherNoOpportunities, AetherETHBalanceLow, AetherGasHigh, AetherBuilderDown
Alertmanager with a single Slack receiver; webhook URL is substituted from SLACK_WEBHOOK_URL at container startup — never committed
Metric-gap fixes in the Go executor: per-builder submission counters (builder, result labels) + latency histogram, aether_system_state gauge, aether_circuit_breaker_trips_total{reason} counter
MetricsObserver hook on RiskManager so internal/risk stays Prometheus-free — cmd/executor adapts to the gauge/counter at startup

Follow-up PRs for the remaining WS-7 scope are outlined in this comment on #69:

PR2 — Loki + Promtail log aggregation (WS-7.5)
PR3 — OpenTelemetry + Tempo tracing + synthetic canary (WS-7.6, WS-7.7)

Files

Area	Files
Metrics	`cmd/executor/metrics.go`, `cmd/executor/submitter.go`, `cmd/executor/main.go`, `cmd/executor/metrics_test.go`, `internal/risk/manager.go`, `internal/risk/manager_test.go`
Prometheus	`deploy/docker/prometheus.yml`, `deploy/docker/prometheus/alerts.yml`
Alertmanager	`deploy/docker/alertmanager.yml`
Grafana	`deploy/docker/grafana/dashboards/{overview,latency,builders,risk}.json`, `deploy/docker/grafana/provisioning/dashboards/default.yml`
Compose	`deploy/docker/docker-compose.yml` (adds alertmanager service, mounts alerts + dashboards)
Env	`.env.example` (documents `SLACK_WEBHOOK_URL`)

Test plan

`go test ./cmd/executor/... ./internal/risk/... -count=1 -race` — pass
`go vet ./cmd/executor/... ./internal/risk/...` — clean
`jq empty deploy/docker/grafana/dashboards/*.json` — valid JSON
YAML parse of all new/modified `.yml` files — valid
`docker compose up -d prometheus grafana alertmanager aether-rust aether-go` on a host with Docker — dashboards appear in Grafana, 7 rules load in `http://:9091/alerts`, Alertmanager reachable at `:9093`
Trigger each alert (tweak thresholds or push synthetic values), confirm Slack message carries the severity label and alert description
Walk through a simulated incident using only Grafana + Slack alerts (E2 gate criterion from the issue)

Notes for reviewers

The Slack receiver uses an entrypoint `sed` substitution to inject the webhook from env. This avoids feature flags for env expansion across Alertmanager versions and keeps the committed config free of secrets. The substituted file is written to `/tmp` inside the container, not to a mounted path.
`AetherBuilderDown` uses an `and` clause so it only fires for builders that actually received submissions in the window (avoids paging on an entirely idle system).
Dashboards use `schemaVersion: 38` and explicit `uid` fields so they are reimport-safe.
Observer callbacks run under `rm.mu`. Prometheus primitives are lock-free, so this is fine today. Any future observer doing blocking I/O would stall risk processing — documented on the interface.

Closes part of #69 (see linked comment for PR split).

0xfandom

Review — PR1 of 3 for issue #69 (E2 Observability Floor)

Reviewed against the reduced PR1 scope (WS-7.2, WS-7.3 Slack-only, WS-7.4, metric gap fixes). PR2 (Loki) and PR3 (OTEL/canary) scope is correctly excluded from evaluation.

Summary

Clean, well-scoped delivery. Metric plumbing is correct, dashboard PromQL references all resolve to real metrics in the codebase, alert rules align with the issue table, and the Slack secret handling avoids committing the webhook. Observer pattern on RiskManager is implemented correctly and keeps internal/risk Prometheus-free. Approving with a handful of non-blocking observations below.

Acceptance Criteria (PR1 scope)

#	Criterion	Status	Evidence
1	WS-7.2: 4 Grafana dashboards, file-provisioned	Met	`deploy/docker/grafana/dashboards/{overview,latency,builders,risk}.json` — all carry `"schemaVersion": 38` and explicit `"uid"` (overview.json:2-5, latency.json:2-5, builders.json:2-5, risk.json:2-5); provider at `deploy/docker/grafana/provisioning/dashboards/default.yml`
2	WS-7.3 (reduced): Alertmanager + single Slack receiver	Met	`deploy/docker/alertmanager.yml:11-23` single `slack-default` receiver; Slack-only per standing policy (PagerDuty/Discord explicitly dropped)
3	WS-7.4: 7 alert rules with correct thresholds	Met	`deploy/docker/prometheus/alerts.yml:6-73` — `AetherHalted` (for 1m), `AetherInclusionRateLow` (for 10m), `AetherE2ELatencyHigh` (p99>100ms for 5m), `AetherNoOpportunities` (for 10m), `AetherETHBalanceLow` (for 1m), `AetherGasHigh` (for 30s), `AetherBuilderDown` (for 2m with `and` idle-gate)
4	Per-builder metrics (builder, result labels) + latency histogram	Met	`cmd/executor/metrics.go:63-71` `builderSubmissionsTotal` and `builderLatencyMs`; wired at `submitter.go:334` via `recordBuilderResult`; label cardinality bounded by configured builder list (not user input)
5	`aether_system_state` gauge (0/1/2/3)	Met	`cmd/executor/metrics.go:72-75`; mapping at `main.go:396-409` with `-1` sentinel for unmapped states
6	`aether_circuit_breaker_trips_total{reason}` counter	Met	`cmd/executor/metrics.go:76-79`; labels emitted from `manager.go:147-156` with `consecutive_bug_reverts` and `daily_loss_exceeded`
7	`MetricsObserver` hook keeps `internal/risk` Prometheus-free	Met	`internal/risk/manager.go:99-105` interface; `cmd/executor/main.go:381-409` adapter; no Prometheus import in `internal/risk/*`
8	Initial state published at startup	Met	`manager.go:137-144` `SetMetricsObserver` immediately emits `OnStateChange(rm.state.Current())`; asserted in `manager_test.go:607-621`
9	Docker compose wiring	Met	`deploy/docker/docker-compose.yml:50-86` — prometheus depends_on alertmanager, volumes mounted at paths matching `prometheus.yml:5,10`
10	Slack webhook not committed	Met	`alertmanager.yml:14` uses `__SLACK_WEBHOOK_URL__` placeholder; substitution in `docker-compose.yml:75-83` writes to `/tmp/alertmanager.yml`, leaving the read-only mount untouched; documented in `.env.example:24-28`
11	No AI attribution in PR body	Met	PR body contains no Claude/AI references

Items out of scope (PR2/PR3) — not evaluated here: WS-7.5 Loki/Promtail, WS-7.6 OTEL/Tempo, WS-7.7 synthetic canary.

Findings

Observations (non-blocking)

deploy/docker/prometheus/alerts.yml:15-25 — AetherInclusionRateLow denominator clamp can false-fire on low-traffic periods.
The clamp_min(sum(rate(bundles_submitted[1h])), 1) floor forces the denominator to 1 req/s even when actual throughput is a few per hour. During a slow period (say 0.0005/s submitted, 0.0001/s included) the ratio becomes ~0.0001, which trips the <0.20 threshold and will page via Slack even though nothing is wrong. AetherNoOpportunities partially guards this (also fires), but the inclusion alert itself is noisy. Consider gating with and sum(rate(bundles_submitted[1h])) > 0.005 (≈18/hr) so the alert only fires when you have enough denominator samples to be meaningful. Not a blocker — Slack-only routing with 4h repeat_interval (alertmanager.yml:9) limits noise.

deploy/docker/alertmanager.yml:14 — sed substitution will break if the webhook URL ever contains |.
Slack incoming webhooks today are https://hooks.slack.com/services/<team>/<hook>/<secret> — only / and alphanumerics — so the current delimiter is safe. Flagging only so that if you ever repurpose the substitution for a different receiver (PagerDuty URL, OpsGenie API, etc.), remember to pick a delimiter the payload cannot contain or switch to envsubst.

cmd/executor/main.go:165 — SetMetricsObserver is called before startMetricsServer() (main.go:195).
Ordering is fine (the Prometheus registry is process-global and initialized in metrics.go:82-98 init()), but worth noting that the initial OnStateChange(Running) increments the gauge before /metrics is reachable. No scrape can miss it because the value is idempotent — a later scrape will see the current gauge value. Non-issue.

internal/risk/manager.go:99-105 — observer doc comment covers the lock concern.
The interface godoc explicitly says "Called under rm.mu — keep callbacks non-blocking (Prometheus primitives are fine)." Current executorMetricsObserver (main.go:384-392) only calls Set / WithLabelValues(...).Inc(), both lock-free. A future contributor adding a blocking implementation (e.g. an HTTP webhook) would stall all risk processing — the doc comment is the only guardrail. Consider adding a TestMetricsObserver_NonBlocking assertion in a follow-up (e.g. a benchmark that fails if the observer callback takes >1µs) if you want enforcement rather than documentation.

cmd/executor/metrics.go:146-149 — result label values.
Only two values ever emitted ("success", "failure") — recordBuilderResult hardcodes them. Label cardinality is |builders| × 2, fully bounded. Good.

cmd/executor/metrics.go:143-149 vs submitter.go:334 — metrics recorded even when submission never left the process.
When submitter.SubmitToAll returns the synthetic {Builder: "all", Success: false, ...} result because bundle.RawTxs is empty (submitter.go:101-107), that path skips the per-builder goroutine so recordMetrics is not called — good, no builder="all" cardinality leak. Verified.

Observer notification coverage — every state mutation is covered. notifyTrip is invoked from exactly the two state-changing paths: RecordRevert (manager.go:315) and RecordTrade (manager.go:345). SystemStateMachine.Transition is not called from anywhere else under this PR's diff. If a future PR adds a degraded-state transition (node latency breaker, bundle miss rate alert) it MUST route through notifyTrip or the gauge will stop reflecting reality — worth a comment on SystemStateMachine itself. Not a PR1 issue.

Nits

deploy/docker/prometheus/alerts.yml:57 — AetherGasHigh severity=info.
The issue lists the gas breaker as halting at >300 gwei. The alert here is informational only (no page, just a Slack notice). Because the circuit-breaker side of this (in RiskManager.PreflightCheck, manager.go:235) already rejects arbs and AetherHalted would fire if a halt state is ever wired to high-gas, the info-only severity seems intentional and correct. Noting for visibility.

deploy/docker/docker-compose.yml:58 — prometheus depends_on: [alertmanager].
Reverse of the usual pattern (prometheus can come up without alertmanager; it will just buffer alerts). Functionally fine — just means if alertmanager fails to start, prometheus stays down too. Might prefer independent startup to keep metrics scraping alive during an alertmanager outage.

deploy/docker/alertmanager.yml:6-9 — group_wait: 30s, group_interval: 5m, repeat_interval: 4h.
Sane defaults for a single-channel Slack setup. No alert-storm risk.

What's Good

Observer-pattern decoupling (internal/risk/manager.go:99-156) is the right abstraction and keeps the risk package dependency-free.
TestMetricsObserver_* tests (manager_test.go:607-685) actually exercise the initial-emit, bug-revert, and daily-loss paths — not smoke-only.
TestRecordBuilderResult_ScrapeLabels (metrics_test.go:131-161) asserts the full labeled exposition text, which would catch accidental label-name drift.
Slack secret handling via sed into /tmp avoids mutating the read-only mount and keeps the webhook out of git.
Dashboard PromQL is internally consistent: every referenced metric (aether_detection_latency_ms_bucket, aether_simulation_latency_ms_bucket, aether_arbs_published_total, aether_system_state, aether_circuit_breaker_trips_total, per-builder vec) exists in either crates/grpc-server/src/metrics.rs or cmd/executor/metrics.go. No dead panels.
AetherBuilderDown idle-gate using and on (builder) sum by (builder) (rate(...)[2m]) > 0 correctly prevents firing on builders that aren't receiving any submissions — addresses the stated concern.

Verdict

APPROVE — PR1 scope is delivered end-to-end with correct metric plumbing, live dashboard queries, and functioning Slack routing. The two items worth revisiting (inclusion-rate denominator clamp, observer blocking contract) are tunables / future-contributor guardrails, not correctness bugs. Ship it; fold the inclusion-rate gate into PR2 if it becomes noisy in staging.

Add Prometheus metrics required by PR1 observability dashboards and alert rules: per-builder submission counters, submission latency histogram, system-state gauge, and circuit-breaker-trip counter. Introduces a MetricsObserver hook on RiskManager so cmd/executor can wire the gauge/counter without internal/risk depending on Prometheus.

…ger+Slack Wires Prometheus metrics through four provisioned Grafana dashboards and seven alert rules routed to a single Slack channel via Alertmanager. - 4 Grafana dashboards with file-based provisioning: Overview, Latency (p50/p95/p99), Builder Performance, Risk & State - 7 Prometheus alert rules covering halted state, inclusion rate, E2E p99 latency, opportunity drought, low ETH balance, high gas, and per-builder outage - Alertmanager service in docker-compose with a single Slack receiver; webhook URL is substituted from SLACK_WEBHOOK_URL at container startup so the secret never lands in the committed config - prometheus.yml now loads rule_files and points at alertmanager:9093 Part of issue #69. Loki/Promtail log aggregation and OTel/Tempo tracing are deferred to follow-up PRs.

- grafana datasource uid pinned so dashboards resolve on fresh deploy - clamp_min floor to 1e-9 (was 1) so inclusion-rate alert stops firing permanently - pre-register {success,failure} builder labels so AetherBuilderDown can fire on first failure - hoist circuit-breaker observer above state.Transition so trips are counted even when halted

0xfandom · 2026-04-21T06:07:23Z

Re-review round 2 — all blockers addressed

Rebased onto current main (picks up the chore(clippy): fix 12 workspace lint errors + PR #85 work). Fix commit: 33de991.

Must-fix (blockers)

#	Item	Status
1	Grafana datasource UID mismatch — dashboards would be DOA on fresh deploy (every panel would show "Data source prometheus not found")	Fixed in `33de991`. `deploy/docker/grafana/provisioning/datasources/prometheus.yml` now pins `uid: prometheus` to match the hardcoded reference in all 4 dashboard JSONs.
2	`AetherInclusionRateLow` fires permanently at <60 bundles/min because `clamp_min(..., 1)` floors the per-second denominator at 1 = 60/min. Real cadence is ~10/min = 0.17/s → ratio forced artificially tiny	Fixed in `33de991`. Changed to `clamp_min(..., 1e-9)` in both `alerts.yml:19` and `overview.json:49`. Intent is division-by-zero guard only. (`builders.json` already used `0.0001` correctly — not touched.)
3	`AetherBuilderDown` can never fire for a builder whose first submission fails — `CounterVec` only emits a series after first `WithLabelValues`, so `rate({result="success"})` is empty, `empty == 0` is empty, `and on (builder)` finds no LHS match	Fixed in `33de991`. Added `PreRegisterBuilderLabels(names []string)` helper in `metrics.go` + wired into `NewSubmitter`. Every configured builder gets both `{success}` and `{failure}` series pre-registered at 0 on startup. New test `TestPreRegisterBuilderLabels_BothSeriesExistAtZero` asserts both series are present on `/metrics` scrape.

Should-fix (non-blocking but in this PR)

#	Item	Status
NB1	`notifyTrip` silently drops the circuit-breaker-trip metric when the state transition is rejected (e.g. already Halted) — hides the exact "we're still seeing reverts in halted state" signal	Fixed in `33de991`. Hoisted `rm.metricsObs.OnCircuitBreakerTrip(reason)` above `rm.state.Transition(...)` so every trip is counted unconditionally. `OnStateChange` stays below (only fires on real transitions, by design). New test `TestMetricsObserver_TripCountedEvenWhenTransitionRejected` forces Halted state and asserts the trip is still counted once.

Deliberately skipped (noise, not worth in-PR churn)

.env.example:13 comment wording about empty SLACK_WEBHOOK_URL behaviour.
${SLACK_WEBHOOK_URL} sed substitution happening at compose-time vs container-time.
aether_system_state gauge encoding comment / shared constants.
Alertmanager inhibit_rule being a no-op (harmless dead config).
api_url: __SLACK_WEBHOOK_URL__ unquoted.

Validation

go test ./... -race -count=1 — pass (4 packages)
go vet ./... + go build ./... — clean
cargo clippy --workspace --release --bins --tests -- -D warnings — clean
All 4 dashboard JSONs valid; all alertmanager/prometheus/datasource YAMLs valid

Branch is now rebased on top of current main, so the stacked PRs #84 / #86 will rebase through this cleanly. Ready for re-review.

0xfandom

Summary

This PR delivers 4 Grafana dashboards, an Alertmanager routing to Slack, and 7 Prometheus alert rules, wired into deploy/docker/docker-compose.yml. On the Go side it adds a MetricsObserver in internal/risk so risk-manager state changes and breaker trips become Prometheus series, plus a PreRegisterBuilderLabels helper so AetherBuilderDown can fire against silent-from-birth builders. Round-2 fixes (datasource UID, clamp_min(..., 1e-9), pre-registered builder labels, trip-before-transition ordering) are all correctly landed with backing tests. No merge-blockers; a handful of medium/low observations worth a follow-up.

Issue #69 AC table

WS	Item	Status	Evidence
WS-7.2	overview dashboard	Met	`deploy/docker/grafana/dashboards/overview.json:1-175` — 7 panels referencing `aether_arbs_published_total`, `aether_executor_bundles_submitted_total/included_total`, `aether_eth_balance`, `aether_gas_price_gwei`, `aether_daily_pnl_eth`, `aether_executor_profit_wei_total`, `aether_executor_gas_spent_wei_total`
WS-7.2	latency dashboard	Met	`latency.json:1-68` — detection/sim/e2e p50/p95/p99 histograms
WS-7.2	builders dashboard	Met	`builders.json:1-86` — submission rate, success rate, p95 latency, totals table
WS-7.2	risk/state dashboard	Met	`risk.json:1-159` — state timeline + stat, breaker trips, risk rejections, PnL, balance
WS-7.2	provisioning	Met	`grafana/provisioning/dashboards/default.yml:1-14`, `datasources/prometheus.yml:1-11`
WS-7.3	Alertmanager + receiver	Met (Slack-only per team directive)	`docker-compose.yml:63-86`, `alertmanager.yml:1-28` — PagerDuty/Discord deliberately deferred
WS-7.3	Webhook-via-env, not committed	Met	`alertmanager.yml:14` uses `__SLACK_WEBHOOK_URL__` placeholder; sed substitution at container start; `.env.example:24-28` documents it
WS-7.4	7 alerts	Met	`prometheus/alerts.yml:1-74` defines all 7: `AetherHalted`, `AetherInclusionRateLow`, `AetherE2ELatencyHigh`, `AetherNoOpportunities`, `AetherETHBalanceLow`, `AetherGasHigh`, `AetherBuilderDown`
WS-7.4	Prometheus loads rules	Met	`deploy/docker/prometheus.yml:5-6` declares `rule_files: [/etc/prometheus/alerts.yml]`; mounted via `docker-compose.yml:57`
WS-7.4	Prometheus talks to Alertmanager	Met	`prometheus.yml:8-11` static_configs target `alertmanager:9093`; both on `aether-net`

Round-2 verification

Datasource UID — Verified. grafana/provisioning/datasources/prometheus.yml:6 sets uid: prometheus, matching datasource.uid: "prometheus" across every panel of all four dashboards (overview.json:16,45,74,100,126,141,165, latency.json:16,29,44, builders.json:16,32,56,72, risk.json:16,48,64,80,106,132).
clamp_min(..., 1e-9) — Verified. prometheus/alerts.yml:19 and overview.json:49 both use 1e-9. The only remaining clamp_min is builders.json:36 using 0.0001 — intentional for the panel's percentage-ratio guard and is a dashboard, not an alert. Correct.
PreRegisterBuilderLabels — Verified.
- cmd/executor/metrics.go:159-164: iterates names, calls WithLabelValues(name, "success").Add(0) and .Add(0) on failure.
- Wired at cmd/executor/submitter.go:78-87 — NewSubmitter builds names slice and calls PreRegisterBuilderLabels(names) before fan-out starts.
- Race-safe: CounterVec.WithLabelValues is documented goroutine-safe; NewSubmitter runs once at startup.
- TestPreRegisterBuilderLabels_BothSeriesExistAtZero (metrics_test.go:256-302) asserts both series on in-process counter and on the scraped /metrics text — exactly the property AetherBuilderDown depends on.
notifyTrip hoist — Verified. internal/risk/manager.go:154-165: OnCircuitBreakerTrip(reason) runs BEFORE state.Transition(newState); OnStateChange(newState) stays AFTER and only fires on successful transition. Function is held under rm.mu (call sites RecordRevert:324, RecordTrade:354 both take rm.mu.Lock()). Observer contract requires non-blocking callbacks; executorMetricsObserver (cmd/executor/main.go:438-446) only touches lock-free Prometheus primitives. TestMetricsObserver_TripCountedEvenWhenTransitionRejected (manager_test.go:693-742) forces StateHalted, drains initial observer, triggers 3 bug reverts, asserts exactly one consecutive_bug_reverts trip AND zero new state changes.

Must-fix blockers

None. This PR is mergeable.

Should-fix nits (follow-up PR territory)

MEDIUM — Slack webhook leaks into docker inspect. docker-compose.yml:79 uses ${SLACK_WEBHOOK_URL} inside the entrypoint shell block. Compose interpolates ${VAR} in block scalars at parse time, so the literal URL ends up baked into the container's entrypoint ARGV and is visible via docker inspect aether-alertmanager. The environment: block already propagates the variable into the container env; the entrypoint can use $$SLACK_WEBHOOK_URL (escaped $) so expansion happens inside the container shell at runtime, keeping the secret out of docker inspect / compose logs. Also worth switching sed delimiter to | → something that survives sed-special chars (&, \) in the URL.
MEDIUM — Empty SLACK_WEBHOOK_URL → silent alerting loss. If the env var is unset, sed substitutes an empty string and alertmanager.yml:14 becomes api_url: . Alertmanager fails config load and the container crashloops under restart: unless-stopped. There is no watchdog alert for "Alertmanager is down" — Prometheus doesn't currently scrape Alertmanager (prometheus.yml:13-19). Add either (a) an Alertmanager scrape job + up{job="alertmanager"} == 0 alert rule, or (b) a startup guard in the entrypoint that logs loudly and refuses to start on empty webhook.
MEDIUM — AetherNoOpportunities fires on cold start / low-activity windows. alerts.yml:36-43 fires when rate(aether_arbs_published_total[10m]) * 60 < 5 for 10m. A freshly-started bot produces zero published arbs for the first window; this pages ~20 min after every restart. Consider unless on() (time() - process_start_time_seconds < 1800) to suppress warm-up, or gate on aether_blocks_processed_total > 0.
MEDIUM — AetherBuilderDown behavior for Enabled: false builder. Pre-registration handles a configured-but-never-attempted builder. But if config/builders.yaml lists a builder with Enabled: false, pre-registration still fires (good), and rate(...) is zero on both success and total → the and on (builder) guard sees zero-total and correctly does NOT fire. Worth documenting this — an operator may later wonder why a disabled builder is "silent" per the alert.

Low / nits

LOW — AetherHalted hard-codes aether_system_state == 3 (alerts.yml:7). The mapping lives in cmd/executor/main.go:450-463 (stateToInt), the metric help text (metrics.go:75), and risk.json:25-28,139-143. A future renumber will miss one. Either publish as aether_system_state{state="halted"} 1/0 or at minimum add a cross-reference comment.
LOW — Compose uses :latest tags for grafana/prometheus/alertmanager. Silent breaking upgrades on docker compose pull. Pin to known-good majors and bump explicitly.
LOW — inhibit_rules (alertmanager.yml:25-28) is nearly a no-op since each alert has a unique alertname. Not a bug, just note the design intent matches operator expectation ("same alertname at critical suppresses its warning sibling" — not true for these alerts).
NIT — TestRecordBuilderResult_ScrapeLabels (metrics_test.go:224-254) uses real builder names (flashbots, titan) which pollute the global registry. TestPreRegisterBuilderLabels_BothSeriesExistAtZero already avoids this by using unique prereg_alpha/prereg_beta. Apply the same pattern to the older test when you touch it.
NIT — metrics.go:180 logs "precision loss" warning on every profit over 2^53 wei. Cumulative profitTotalWei.Add(f) will cross 2^53 after a few ETH of lifetime profit, producing ongoing log spam. Downgrade to debug or sample.

Can-defer (follow-up tickets)

Alertmanager scrape + up{job="alertmanager"} watchdog alert (addresses the silent-outage gap above).
PagerDuty + Discord receivers (explicitly deferred per team directive; file a separate ticket to track).
State-gauge label migration.
Docker image pinning across observability stack.

Verdict

APPROVE WITH RESERVATIONS — all four round-2 fixes landed correctly with tests that actually prove the property. Issue #69 WS-7.2/7.3/7.4 scope is fully delivered (Slack-only routing is an acknowledged deviation). No production-risk blockers. The medium items are worth a follow-up PR but do not need to block this merge. Ship it.

Non-duplicate bits from PR #98 landed on top of main after PR #83/#84 covered the core stack. Keeps main's dashboards + Slack-only alertmanager intact; drops PR #98's PagerDuty/Discord routing per team directive.

Webhook hygiene, alertmanager watchdog, cold-start gating, image pinning, inhibit rule fix, system_state cross-refs, test isolation, quieter precision log. See PR body for per-item detail.

Non-duplicate bits from PR #98 landed on top of main after PR #83/#84 covered the core stack. Keeps main's dashboards + Slack-only alertmanager intact; drops PR #98's PagerDuty/Discord routing per team directive.

0xfandom approved these changes Apr 14, 2026

View reviewed changes

Pablosinyores mentioned this pull request Apr 14, 2026

feat(observability): self-hosted log aggregation (loki + structured logs) #84

Merged

8 tasks

0xfandom mentioned this pull request Apr 14, 2026

feat(grpc-server): surface decoder drops via warn log + counter #85

Merged

9 tasks

Pablosinyores and others added 3 commits April 21, 2026 11:14

0xfandom force-pushed the feat/observability-pr1 branch from bc19ae0 to 33de991 Compare April 21, 2026 06:05

0xfandom mentioned this pull request Apr 21, 2026

feat(observability): distributed tracing to tempo + canary probe #86

Merged

8 tasks

0xfandom reviewed Apr 21, 2026

View reviewed changes

0xfandom merged commit 9ee202a into main Apr 21, 2026
4 checks passed

This was referenced Apr 21, 2026

chore(observability): land PR #84 content on main (post-stack recovery) #102

Closed

chore(observability): land PR #84 content on main (cherry-pick recovery) #103

Merged

0xfandom mentioned this pull request Apr 21, 2026

feat(observability): grafana dashboards, prometheus alerts, alertmanager routing #98

Merged

6 tasks

0xfandom mentioned this pull request Apr 21, 2026

fix(replay): deploy AetherExecutor to Anvil fork in e2e script #104

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): Prometheus alerts, Grafana dashboards, Slack routing#83

feat(observability): Prometheus alerts, Grafana dashboards, Slack routing#83
0xfandom merged 3 commits into
mainfrom
feat/observability-pr1

Pablosinyores commented Apr 14, 2026

Uh oh!

0xfandom left a comment

Uh oh!

0xfandom commented Apr 21, 2026

Uh oh!

0xfandom left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Pablosinyores commented Apr 14, 2026

Summary

Files

Test plan

Notes for reviewers

Uh oh!

0xfandom left a comment

Choose a reason for hiding this comment

Review — PR1 of 3 for issue #69 (E2 Observability Floor)

Summary

Acceptance Criteria (PR1 scope)

Findings

Observations (non-blocking)

Nits

What's Good

Verdict

Uh oh!

0xfandom commented Apr 21, 2026

Re-review round 2 — all blockers addressed

Must-fix (blockers)

Should-fix (non-blocking but in this PR)

Deliberately skipped (noise, not worth in-PR churn)

Validation

Uh oh!

0xfandom left a comment

Choose a reason for hiding this comment

Summary

Issue #69 AC table

Round-2 verification

Must-fix blockers

Should-fix nits (follow-up PR territory)

Low / nits

Can-defer (follow-up tickets)

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants