Skip to content

obs(deploy): alert + dashboard + catalog for runtime start-failure metric#67

Merged
mastermanas805 merged 1 commit into
masterfrom
obs/deploy-runtime-failed-detected
Jun 8, 2026
Merged

obs(deploy): alert + dashboard + catalog for runtime start-failure metric#67
mastermanas805 merged 1 commit into
masterfrom
obs/deploy-runtime-failed-detected

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

Rule-25 observability for the new instant_deploy_runtime_failed_detected_total{reason} counter introduced in worker #101 (api twin: #280).

That counter fires when deploy_status_reconcile flips a rollout to failed because it exceeded its progress deadline with no available replica — the broken-image runtime silent-failure class (CreateContainerError "no command specified" from an empty image, ImagePullBackOff, CrashLoopBackOff). It's the runtime twin of instant_deploy_job_failed_detected_total (build side).

What's here

  • k8s/prometheus-rules.yamlinstant-worker-deploy-runtime-failed group; sum(increase(...[30m])) >= 3 → critical (same posture as DeployJobFailedDetected: tolerate one or two one-off bad images, page on a platform-wide image defect).
  • newrelic/alerts/deploy-runtime-failed-detected.json — NRQL alert, derivative(...,30m) faceted by reason, ABOVE_OR_EQUALS 3.
  • newrelic/dashboards/instanode-reliability.json — billboard (1h, must be 0) + by-reason stacked-bar (6h).
  • observability/METRICS-CATALOG.md — catalog row.

Apply

infra has no auto-apply (rule 15). Operator applies prometheus-rules.yaml and imports the NR alert + dashboard JSON. validate.yml (yamllint + kubeconform) is the only gate; the PrometheusRule CRD is skipped via -ignore-missing-schemas.

Context: this whole fix came from finding 6 live deploys stuck in CreateContainerError (474-byte empty images) reported as deploying forever.

🤖 Generated with Claude Code

…tric

Rule-25 observability for instant_deploy_runtime_failed_detected_total (worker PR #101): the runtime twin of the build-Job-failed detector. Pages when deploy_status_reconcile flips rollouts to failed on ProgressDeadlineExceeded with no available replica (broken image can't start — CreateContainerError 'no command specified', ImagePullBackOff, CrashLoopBackOff).

- k8s/prometheus-rules.yaml: instant-worker-deploy-runtime-failed group, >=3 in 30m -> critical (mirrors DeployJobFailedDetected).

- newrelic/alerts/deploy-runtime-failed-detected.json: NRQL alert, derivative per reason, ABOVE_OR_EQUALS 3.

- newrelic/dashboards/instanode-reliability.json: billboard (1h) + by-reason stacked-bar (6h) tiles.

- observability/METRICS-CATALOG.md: catalog row.

infra has no auto-apply (rule 15) — operator applies prometheus-rules + imports the NR alert/dashboard JSON.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit cbbcb12 into master Jun 8, 2026
3 checks passed
@mastermanas805 mastermanas805 deleted the obs/deploy-runtime-failed-detected branch June 8, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant