Skip to content

feat(observability): scale-to-zero alert + tile + catalog (#54)#55

Merged
mastermanas805 merged 1 commit into
masterfrom
feat/scale-to-zero-observability
Jun 5, 2026
Merged

feat(observability): scale-to-zero alert + tile + catalog (#54)#55
mastermanas805 merged 1 commit into
masterfrom
feat/scale-to-zero-observability

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Rule 25 observability artifacts for scale-to-zero (Task #54). Covers instant_deploy_scaled_to_zero_total{outcome} + instant_deploy_idle_apps emitted by the worker idle-scaler.

  • Prom rules (instant-worker-deploy-scale-to-zero group): DeployScaleToZeroWakeFailed (wake_failed > 0 / 15m → P1: app stuck asleep), DeployScaleToZeroScaleDownFailures (scale_failed ≥ 5 / 30m → P2: savings not landing, no customer impact).
  • NR alert: deploy-scale-to-zero-fail.json (wake_failed → CRITICAL).
  • Dashboard: two tiles on instanode-reliability.json — asleep-apps billboard + scale-outcome stacked-bar.
  • METRICS-CATALOG.md: counter (lazy, 4 outcomes primed) + gauge (eager) rows.

All INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED — series stay 0, alerts stay quiet until the feature is canaried on.

Companion PRs

🤖 Generated with Claude Code

…#54)

Rule 25 artifacts for instant_deploy_scaled_to_zero_total{outcome} +
instant_deploy_idle_apps (emitted by worker/internal/jobs/deploy_idle_scaler.go).

- Prom rules (instant-worker-deploy-scale-to-zero group):
  DeployScaleToZeroWakeFailed (wake_failed > 0 / 15m → P1: app stuck asleep),
  DeployScaleToZeroScaleDownFailures (scale_failed >= 5 / 30m → P2: savings not
  landing, no customer impact).
- NR alert: deploy-scale-to-zero-fail.json (wake_failed → CRITICAL).
- Dashboard: two tiles on instanode-reliability.json — asleep-apps billboard +
  scale-outcome stacked-bar.
- METRICS-CATALOG.md: rows for the counter (lazy, 4 outcomes primed) + the
  gauge (eager).

All INERT until an operator sets DEPLOY_SCALE_TO_ZERO_ENABLED — the series stay
0 and the alerts stay quiet until the feature is canaried on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 1b6c43e into master Jun 5, 2026
3 checks passed
mastermanas805 added a commit that referenced this pull request Jun 5, 2026
…tile + catalog (Task #55) (#56)

Wires monitoring for instant_resource_count_limit_blocked_total{service,team_tier}
(api), the metric emitted when the per-service resource-COUNT cap rejects a
provision with 402. Closes the rule-25 gap for Task #55's metric:

- newrelic/alerts/resource-count-limit-blocked.json — P2 (abuse/observability),
  WARN on > 20 blocks/h per service+tier (derivative over 1h).
- k8s/prometheus-rules.yaml — ResourceCountCapBlocked rule (instant-api group).
- newrelic/dashboards/instanode-reliability.json — stacked-bar tile by
  service+tier.
- observability/METRICS-CATALOG.md — catalog row (lazy CounterVec; INERT until
  RESOURCE_COUNT_CAPS_ENABLED).

All artifacts are inert until the operator enables the api flag — the counter
has zero series until the first over-cap rejection.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant