Skip to content

Convert bee_broker_* gauges to async Int64ObservableGauge #389

@simonsmallchua

Description

@simonsmallchua

Context

Follow-up to #388. That PR silences fly-autoscaler's empty prometheus result error by wrapping the PromQL in or on() vector(0). The underlying cause is that the broker gauges go sparse in Fly's managed Prometheus during idle, and the PR papers over it at the autoscaler layer.

Root cause

OTel Go SDK synchronous Int64Gauge / Float64Gauge use LastValue aggregation. The aggregator only emits a sample at collect time if Record() was called at least once during the collect interval. During long idle periods the broker probe stops landing Record() calls inside every collect window (whether because of the anySuccess gate at internal/broker/probe.go:150, an empty ActiveJobIDs path that doesn't fall through as cleanly as expected, or other early returns), so the Prom series goes stale.

Grafana Cloud confirms the pattern across all three queried metrics — visible activity during work, ~20+ hours of empty/sparse data during idle, fresh instant value once the worker restarts.

Affected instruments (all in internal/observability/observability.go:959-1006)

  • brokerStreamLengthGaugeInt64Gauge
  • brokerScheduledDepthGaugeInt64Gauge
  • brokerConsumerPendingGaugeInt64Gauge
  • brokerUnclampedScaleTargetGaugeFloat64Gauge
  • brokerOutboxBacklogGaugeInt64Gauge
  • brokerOutboxAgeGaugeFloat64Gauge

Proposed fix

  1. Replace the six synchronous gauges with Int64ObservableGauge / Float64ObservableGauge.
  2. Maintain the latest values in atomic state (e.g. atomic.Int64 per (metric, stream_type) key).
  3. Register a single batch callback with meter.RegisterCallback(...) that reads the atomic state and emits to the observer at every collect.
  4. Update RecordBrokerStreamStats, RecordBrokerOutbox, etc. to write to atomic state instead of calling Gauge.Record().
  5. Once the metric is continuous, the or on() vector(0) wrapper in fly.autoscaler-*.toml can be reverted to restore the Redis-outage series-gap behaviour described in internal/broker/probe.go:130.

Verification

  • flyctl ssh console -a hover-workercurl -s localhost:9464/metrics | grep bee_broker_scheduled_zset_depth during an idle window should always return a fresh value.
  • Grafana Cloud time-series for the six metrics should show continuous lines (flat at 0 during idle, no gaps).

Estimated effort

~40–60 lines plus tests. The trickier piece is preserving the anySuccess-gated behaviour: probe success/failure should still influence whether the autoscaler sees a gap during a Redis outage with active jobs. Likely solved by exposing a lastSuccessAt time.Time and short-circuiting the callback if the last successful probe is too stale.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions