Context
Follow-up to #388. That PR silences fly-autoscaler's empty prometheus result error by wrapping the PromQL in or on() vector(0). The underlying cause is that the broker gauges go sparse in Fly's managed Prometheus during idle, and the PR papers over it at the autoscaler layer.
Root cause
OTel Go SDK synchronous Int64Gauge / Float64Gauge use LastValue aggregation. The aggregator only emits a sample at collect time if Record() was called at least once during the collect interval. During long idle periods the broker probe stops landing Record() calls inside every collect window (whether because of the anySuccess gate at internal/broker/probe.go:150, an empty ActiveJobIDs path that doesn't fall through as cleanly as expected, or other early returns), so the Prom series goes stale.
Grafana Cloud confirms the pattern across all three queried metrics — visible activity during work, ~20+ hours of empty/sparse data during idle, fresh instant value once the worker restarts.
Affected instruments (all in internal/observability/observability.go:959-1006)
brokerStreamLengthGauge — Int64Gauge
brokerScheduledDepthGauge — Int64Gauge
brokerConsumerPendingGauge — Int64Gauge
brokerUnclampedScaleTargetGauge — Float64Gauge
brokerOutboxBacklogGauge — Int64Gauge
brokerOutboxAgeGauge — Float64Gauge
Proposed fix
- Replace the six synchronous gauges with
Int64ObservableGauge / Float64ObservableGauge.
- Maintain the latest values in atomic state (e.g.
atomic.Int64 per (metric, stream_type) key).
- Register a single batch callback with
meter.RegisterCallback(...) that reads the atomic state and emits to the observer at every collect.
- Update
RecordBrokerStreamStats, RecordBrokerOutbox, etc. to write to atomic state instead of calling Gauge.Record().
- Once the metric is continuous, the
or on() vector(0) wrapper in fly.autoscaler-*.toml can be reverted to restore the Redis-outage series-gap behaviour described in internal/broker/probe.go:130.
Verification
flyctl ssh console -a hover-worker → curl -s localhost:9464/metrics | grep bee_broker_scheduled_zset_depth during an idle window should always return a fresh value.
- Grafana Cloud time-series for the six metrics should show continuous lines (flat at 0 during idle, no gaps).
Estimated effort
~40–60 lines plus tests. The trickier piece is preserving the anySuccess-gated behaviour: probe success/failure should still influence whether the autoscaler sees a gap during a Redis outage with active jobs. Likely solved by exposing a lastSuccessAt time.Time and short-circuiting the callback if the last successful probe is too stale.
Context
Follow-up to #388. That PR silences
fly-autoscaler'sempty prometheus resulterror by wrapping the PromQL inor on() vector(0). The underlying cause is that the broker gauges go sparse in Fly's managed Prometheus during idle, and the PR papers over it at the autoscaler layer.Root cause
OTel Go SDK synchronous
Int64Gauge/Float64GaugeuseLastValueaggregation. The aggregator only emits a sample at collect time ifRecord()was called at least once during the collect interval. During long idle periods the broker probe stops landingRecord()calls inside every collect window (whether because of theanySuccessgate atinternal/broker/probe.go:150, an emptyActiveJobIDspath that doesn't fall through as cleanly as expected, or other early returns), so the Prom series goes stale.Grafana Cloud confirms the pattern across all three queried metrics — visible activity during work, ~20+ hours of empty/sparse data during idle, fresh instant value once the worker restarts.
Affected instruments (all in
internal/observability/observability.go:959-1006)brokerStreamLengthGauge—Int64GaugebrokerScheduledDepthGauge—Int64GaugebrokerConsumerPendingGauge—Int64GaugebrokerUnclampedScaleTargetGauge—Float64GaugebrokerOutboxBacklogGauge—Int64GaugebrokerOutboxAgeGauge—Float64GaugeProposed fix
Int64ObservableGauge/Float64ObservableGauge.atomic.Int64per(metric, stream_type)key).meter.RegisterCallback(...)that reads the atomic state and emits to the observer at every collect.RecordBrokerStreamStats,RecordBrokerOutbox, etc. to write to atomic state instead of callingGauge.Record().or on() vector(0)wrapper infly.autoscaler-*.tomlcan be reverted to restore the Redis-outage series-gap behaviour described ininternal/broker/probe.go:130.Verification
flyctl ssh console -a hover-worker→curl -s localhost:9464/metrics | grep bee_broker_scheduled_zset_depthduring an idle window should always return a fresh value.Estimated effort
~40–60 lines plus tests. The trickier piece is preserving the
anySuccess-gated behaviour: probe success/failure should still influence whether the autoscaler sees a gap during a Redis outage with active jobs. Likely solved by exposing alastSuccessAt time.Timeand short-circuiting the callback if the last successful probe is too stale.