Observability pipeline is globally tagged (Alloy adds app, instance, environment to every Prometheus series at scrape; OTEL resource sets deployment.environment on every span), but the dashboard and one code path aren't using that tagging well.
1. hover-overview.json has no app variable
grafana/dashboards/hover-overview.json has no template variable named app, and every panel's expr queries unfiltered — e.g.:
bee_db_pool_in_use
bee_db_pressure_ema_ms_milliseconds
avg(go_memstats_heap_inuse_bytes{ })
histogram_quantile(0.95, sum by (le) (rate(bee_db_semaphore_wait_ms_milliseconds_bucket[$__rate_interval])))
So every panel aggregates across API prod + worker prod + hover-pr-N + hover-worker-pr-N simultaneously. Can't isolate a PR preview, can't compare API vs worker, can't diff prod vs staging.
Fix:
- Add a multi-value template variable
app with query label_values(bee_db_pool_in_use, app) (or similar base metric that every app emits) and `All` enabled.
- Add a template variable
environment with query label_values(bee_db_pool_in_use, environment) so prod/staging can be split.
- Update every panel's
expr to include {app=~\"\$app\", environment=~\"\$environment\"} (preserving existing label filters).
- Round-trip via Grafana UI "Export as code → Save to Git" rather than hand-editing JSON to keep the v2 schema valid.
2. Dead panels: node_* and pg_stat_*
Several panels query metrics we never emit:
node_cpu_seconds_total, node_load1, node_network_receive_bytes_total — would come from node_exporter; we don't run it on Fly VMs.
pg_stat_bgwriter_checkpoints_req_total and similar — would come from postgres_exporter scraping Supabase; we don't run that either.
They show No data permanently and add noise when scanning the dashboard. Pick one:
- Delete them, or
- Stand up
node_exporter as a Fly sidecar next to Alloy (same pattern as the metrics agent) and Supabase query exporter for Postgres stats.
Deletion is probably the right call unless anyone actively wants VM-level CPU/load.
3. OTEL `service.name` is a static string
`cmd/worker/main.go:54` hard-codes `ServiceName: "hover-worker"`, and the API equivalent uses `"hover"`. Every trace from every worker review app and prod gets `service.name=hover-worker` — no way to tell PR #342's traces from prod or from PR #340.
Fix: derive from `FLY_APP_NAME` with a fallback:
```go
serviceName := strings.TrimSpace(os.Getenv("FLY_APP_NAME"))
if serviceName == "" {
serviceName = "hover-worker"
}
```
Apply in both `cmd/worker/main.go` and the API's `observability.Init` call site. `environment` is already being set from `APP_ENV`.
Scope / sequencing
1 is the highest-leverage — unlocks the entire dashboard. 3 is a tiny one-liner worth bundling. 2 is a housekeeping sweep that can go either way.
Probably two PRs:
- PR A: items 1 + 2 (dashboard refactor)
- PR B: item 3 (service.name derivation — one-liner in each main.go)
Context: this came out of validating PR #342's observability. The Prometheus pipeline is correct and tagging `app=hover-worker-pr-342, environment=staging` — we just can't see it in the default dashboard because the queries don't use the label.
Observability pipeline is globally tagged (Alloy adds
app,instance,environmentto every Prometheus series at scrape; OTEL resource setsdeployment.environmenton every span), but the dashboard and one code path aren't using that tagging well.1.
hover-overview.jsonhas noappvariablegrafana/dashboards/hover-overview.jsonhas no template variable namedapp, and every panel'sexprqueries unfiltered — e.g.:bee_db_pool_in_usebee_db_pressure_ema_ms_millisecondsavg(go_memstats_heap_inuse_bytes{ })histogram_quantile(0.95, sum by (le) (rate(bee_db_semaphore_wait_ms_milliseconds_bucket[$__rate_interval])))So every panel aggregates across API prod + worker prod +
hover-pr-N+hover-worker-pr-Nsimultaneously. Can't isolate a PR preview, can't compare API vs worker, can't diff prod vs staging.Fix:
appwith querylabel_values(bee_db_pool_in_use, app)(or similar base metric that every app emits) and `All` enabled.environmentwith querylabel_values(bee_db_pool_in_use, environment)so prod/staging can be split.exprto include{app=~\"\$app\", environment=~\"\$environment\"}(preserving existing label filters).2. Dead panels:
node_*andpg_stat_*Several panels query metrics we never emit:
node_cpu_seconds_total,node_load1,node_network_receive_bytes_total— would come fromnode_exporter; we don't run it on Fly VMs.pg_stat_bgwriter_checkpoints_req_totaland similar — would come frompostgres_exporterscraping Supabase; we don't run that either.They show No data permanently and add noise when scanning the dashboard. Pick one:
node_exporteras a Fly sidecar next to Alloy (same pattern as the metrics agent) and Supabase query exporter for Postgres stats.Deletion is probably the right call unless anyone actively wants VM-level CPU/load.
3. OTEL `service.name` is a static string
`cmd/worker/main.go:54` hard-codes `ServiceName: "hover-worker"`, and the API equivalent uses `"hover"`. Every trace from every worker review app and prod gets `service.name=hover-worker` — no way to tell PR #342's traces from prod or from PR #340.
Fix: derive from `FLY_APP_NAME` with a fallback:
```go
serviceName := strings.TrimSpace(os.Getenv("FLY_APP_NAME"))
if serviceName == "" {
serviceName = "hover-worker"
}
```
Apply in both `cmd/worker/main.go` and the API's `observability.Init` call site. `environment` is already being set from `APP_ENV`.
Scope / sequencing
1 is the highest-leverage — unlocks the entire dashboard. 3 is a tiny one-liner worth bundling. 2 is a housekeeping sweep that can go either way.
Probably two PRs:
Context: this came out of validating PR #342's observability. The Prometheus pipeline is correct and tagging `app=hover-worker-pr-342, environment=staging` — we just can't see it in the default dashboard because the queries don't use the label.