Skip to content

Per-app filtering + cleanup in Grafana dashboards + traces #343

@simonsmallchua

Description

@simonsmallchua

Observability pipeline is globally tagged (Alloy adds app, instance, environment to every Prometheus series at scrape; OTEL resource sets deployment.environment on every span), but the dashboard and one code path aren't using that tagging well.

1. hover-overview.json has no app variable

grafana/dashboards/hover-overview.json has no template variable named app, and every panel's expr queries unfiltered — e.g.:

  • bee_db_pool_in_use
  • bee_db_pressure_ema_ms_milliseconds
  • avg(go_memstats_heap_inuse_bytes{ })
  • histogram_quantile(0.95, sum by (le) (rate(bee_db_semaphore_wait_ms_milliseconds_bucket[$__rate_interval])))

So every panel aggregates across API prod + worker prod + hover-pr-N + hover-worker-pr-N simultaneously. Can't isolate a PR preview, can't compare API vs worker, can't diff prod vs staging.

Fix:

  • Add a multi-value template variable app with query label_values(bee_db_pool_in_use, app) (or similar base metric that every app emits) and `All` enabled.
  • Add a template variable environment with query label_values(bee_db_pool_in_use, environment) so prod/staging can be split.
  • Update every panel's expr to include {app=~\"\$app\", environment=~\"\$environment\"} (preserving existing label filters).
  • Round-trip via Grafana UI "Export as code → Save to Git" rather than hand-editing JSON to keep the v2 schema valid.

2. Dead panels: node_* and pg_stat_*

Several panels query metrics we never emit:

  • node_cpu_seconds_total, node_load1, node_network_receive_bytes_total — would come from node_exporter; we don't run it on Fly VMs.
  • pg_stat_bgwriter_checkpoints_req_total and similar — would come from postgres_exporter scraping Supabase; we don't run that either.

They show No data permanently and add noise when scanning the dashboard. Pick one:

  • Delete them, or
  • Stand up node_exporter as a Fly sidecar next to Alloy (same pattern as the metrics agent) and Supabase query exporter for Postgres stats.

Deletion is probably the right call unless anyone actively wants VM-level CPU/load.

3. OTEL `service.name` is a static string

`cmd/worker/main.go:54` hard-codes `ServiceName: "hover-worker"`, and the API equivalent uses `"hover"`. Every trace from every worker review app and prod gets `service.name=hover-worker` — no way to tell PR #342's traces from prod or from PR #340.

Fix: derive from `FLY_APP_NAME` with a fallback:

```go
serviceName := strings.TrimSpace(os.Getenv("FLY_APP_NAME"))
if serviceName == "" {
serviceName = "hover-worker"
}
```

Apply in both `cmd/worker/main.go` and the API's `observability.Init` call site. `environment` is already being set from `APP_ENV`.

Scope / sequencing

1 is the highest-leverage — unlocks the entire dashboard. 3 is a tiny one-liner worth bundling. 2 is a housekeeping sweep that can go either way.

Probably two PRs:

  • PR A: items 1 + 2 (dashboard refactor)
  • PR B: item 3 (service.name derivation — one-liner in each main.go)

Context: this came out of validating PR #342's observability. The Prometheus pipeline is correct and tagging `app=hover-worker-pr-342, environment=staging` — we just can't see it in the default dashboard because the queries don't use the label.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions