Silence autoscaler empty-prom error#388
Conversation
Bump pgx/v5 for memory-safety CVE
📝 WalkthroughWalkthroughThe PR fixes autoscaler Prometheus gauge staleness by appending ChangesSecurity and autoscaler metrics fix
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
|
Updates to Preview Branch (work/autoscaler-promql-vector0) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Release VersionsApp patch: ChangelogFixed
Security
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
fly.autoscaler-worker.toml (1)
35-38: Add a dedicated outage signal now that gaps are masked.Because Line 35–Line 38 intentionally converts outage-like gaps into
0, consider alerting on a direct Redis health metric (or broker probe heartbeat) so outages remain immediately visible independent of backlog maths.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fly.autoscaler-worker.toml` around lines 35 - 38, The backlog masking currently converts outage-like gaps to 0 which hides Redis outages; add a dedicated outage signal by instrumenting a direct Redis/broker health metric (e.g., export a redis_up gauge or a broker_probe_heartbeat TTL counter) alongside the existing autoscaler backlog logic, then add an alerting rule that fires when redis_up == 0 or broker_probe_heartbeat has not been updated for X seconds; keep the existing gap-to-0 behavior in the autoscaler (the "gap masking" logic) but ensure monitoring/alerting uses the new redis_up/broker_probe_heartbeat metric so outages remain immediately visible and actionable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@fly.autoscaler-worker.toml`:
- Around line 35-38: The backlog masking currently converts outage-like gaps to
0 which hides Redis outages; add a dedicated outage signal by instrumenting a
direct Redis/broker health metric (e.g., export a redis_up gauge or a
broker_probe_heartbeat TTL counter) alongside the existing autoscaler backlog
logic, then add an alerting rule that fires when redis_up == 0 or
broker_probe_heartbeat has not been updated for X seconds; keep the existing
gap-to-0 behavior in the autoscaler (the "gap masking" logic) but ensure
monitoring/alerting uses the new redis_up/broker_probe_heartbeat metric so
outages remain immediately visible and actionable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 223c7f38-d49c-4908-aa13-663fd2f24d2d
⛔ Files ignored due to path filters (2)
go.sumis excluded by!**/*.sumwebflow-designer-extension-cli/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (5)
CHANGELOG.mdfly.autoscaler-analysis.tomlfly.autoscaler-worker.tomlgo.modwebflow-designer-extension-cli/package.json
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-388.fly.dev |
Silence autoscaler empty-prom error
Summary
fly-autoscalerPromQL in bothfly.autoscaler-worker.tomlandfly.autoscaler-analysis.tomlwithor on() vector(0)so an empty result collapses to zero rather than loggingmetrics collection failed: empty prometheus resultonce per minute.bee_broker_stream_length,bee_broker_scheduled_zset_depth) are synchronous OTelInt64Gauges —LastValueaggregation only emits a sample whenRecord()is called inside a collect interval, so the series goes stale in Fly's managed Prometheus during idle. Grafana Cloud confirms the same gappy pattern across all three queried metrics.Trade-off (documented inline + in CHANGELOG)
A genuine Redis outage previously produced a series gap (autoscaler holds machine count). With this change the gap collapses to
0, so the autoscaler scales toMIN=1. Acceptable because idle workers can't crawl during an outage anyway and restart cleanly once Redis recovers.The proper fix — converting the broker gauges to async
Int64ObservableGaugeso they always emit at every collect — will be tracked in a follow-up issue.Test plan
flyctl logs -a hover-autoscaler-workerand-a hover-autoscaler-analysisfor ~30 min during idle —empty prometheus resultlog lines should drop to zero.flyctl status -a hover-workerstill shows 1 started machine;-a hover-analysisstill shows 1 started.Need help on this PR? Tag
@codesmithwith what you need.Summary by CodeRabbit
Bug Fixes
Security