obs(deploy): alert + dashboard + catalog for runtime start-failure metric by mastermanas805 · Pull Request #67 · InstaNode-dev/infra

mastermanas805 · 2026-06-08T17:02:03Z

Rule-25 observability for the new instant_deploy_runtime_failed_detected_total{reason} counter introduced in worker #101 (api twin: #280).

That counter fires when deploy_status_reconcile flips a rollout to failed because it exceeded its progress deadline with no available replica — the broken-image runtime silent-failure class (CreateContainerError "no command specified" from an empty image, ImagePullBackOff, CrashLoopBackOff). It's the runtime twin of instant_deploy_job_failed_detected_total (build side).

What's here

k8s/prometheus-rules.yaml — instant-worker-deploy-runtime-failed group; sum(increase(...[30m])) >= 3 → critical (same posture as DeployJobFailedDetected: tolerate one or two one-off bad images, page on a platform-wide image defect).
newrelic/alerts/deploy-runtime-failed-detected.json — NRQL alert, derivative(...,30m) faceted by reason, ABOVE_OR_EQUALS 3.
newrelic/dashboards/instanode-reliability.json — billboard (1h, must be 0) + by-reason stacked-bar (6h).
observability/METRICS-CATALOG.md — catalog row.

Apply

infra has no auto-apply (rule 15). Operator applies prometheus-rules.yaml and imports the NR alert + dashboard JSON. validate.yml (yamllint + kubeconform) is the only gate; the PrometheusRule CRD is skipped via -ignore-missing-schemas.

Context: this whole fix came from finding 6 live deploys stuck in CreateContainerError (474-byte empty images) reported as deploying forever.

🤖 Generated with Claude Code

…tric Rule-25 observability for instant_deploy_runtime_failed_detected_total (worker PR #101): the runtime twin of the build-Job-failed detector. Pages when deploy_status_reconcile flips rollouts to failed on ProgressDeadlineExceeded with no available replica (broken image can't start — CreateContainerError 'no command specified', ImagePullBackOff, CrashLoopBackOff). - k8s/prometheus-rules.yaml: instant-worker-deploy-runtime-failed group, >=3 in 30m -> critical (mirrors DeployJobFailedDetected). - newrelic/alerts/deploy-runtime-failed-detected.json: NRQL alert, derivative per reason, ABOVE_OR_EQUALS 3. - newrelic/dashboards/instanode-reliability.json: billboard (1h) + by-reason stacked-bar (6h) tiles. - observability/METRICS-CATALOG.md: catalog row. infra has no auto-apply (rule 15) — operator applies prometheus-rules + imports the NR alert/dashboard JSON. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mastermanas805 merged commit cbbcb12 into master Jun 8, 2026
3 checks passed

mastermanas805 deleted the obs/deploy-runtime-failed-detected branch June 8, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obs(deploy): alert + dashboard + catalog for runtime start-failure metric#67

obs(deploy): alert + dashboard + catalog for runtime start-failure metric#67
mastermanas805 merged 1 commit into
masterfrom
obs/deploy-runtime-failed-detected

mastermanas805 commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented Jun 8, 2026

What's here

Apply

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant