Skip to content

feat(jobs): deploy_idle_scaler — scale-to-zero idle descheduler (#54)#94

Merged
mastermanas805 merged 3 commits into
masterfrom
feat/deploy-scale-to-zero
Jun 5, 2026
Merged

feat(jobs): deploy_idle_scaler — scale-to-zero idle descheduler (#54)#94
mastermanas805 merged 3 commits into
masterfrom
feat/deploy-scale-to-zero

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Worker half of scale-to-zero / idle descheduling (Task #54). New periodic job (every 2 min) that patches idle, healthy, not-pinned deployments to replicas=0 (~$0 compute) — reversible via the api wake endpoint. Sibling to deployment_expirer (idle ≠ expired).

Safety

  • Flag-gated behind DEPLOY_SCALE_TO_ZERO_ENABLED (default OFF): Work() short-circuits at DEBUG when off — no k8s patch, no DB write (TestDeployIdleScaler_FlagOffNoOp asserts zero SQL + zero scale calls).
  • Fail-open when k8s is unreachable (nil client → WARN per tick; other jobs unaffected).
  • CAS double-guard: candidate SELECT and the scaled_to_zero UPDATE share the healthy + not-zeroed + not-always-on predicate, so a row that raced into a woken/pinned/redeployed/expired state is skipped, never wrongly slept. NotFound Deployment = skip (torn down), not fail.

The idle SIGNAL (stated honestly)

deployments.last_activity_at (api migration 068) — stamped at create, bumped on deploy/redeploy/wake. v1 idle = "no deploy/redeploy/wake for N min" (default 30, floored at 5), NOT per-HTTP traffic, because the api/worker are not in the request path and no nginx-ingress request scrape is wired yet. Follow-up to lift to traffic-based idle is noted in the job header.

Metric (rule 25)

instant_deploy_scaled_to_zero_total{outcome} (scaled_down | woke_up | wake_failed | scale_failed) + instant_deploy_idle_apps gauge, primed in metrics_test.go. Alert+tile+catalog ship in the infra PR.

Tests

flag-off no-op, nil-k8s no-op, scale-down happy path (+counter+gauge), CAS-race skip, NotFound skip, scale-error → scale_failed, idle-minutes floor, provider_id→namespace derivation, real k8sDeployScaleClient vs fake clientset. make gate GREEN.

Companion PRs

Awaiting operator enable of DEPLOY_SCALE_TO_ZERO_ENABLED to verify real scale-down in prod.

🤖 Generated with Claude Code

…54)

Worker half of scale-to-zero. New periodic job (every 2 min) that patches
idle, healthy, not-pinned deployments to replicas=0 (~$0 compute) — reversible
via the api wake endpoint. Sibling to deployment_expirer (idle ≠ expired).

Flag-gated behind DEPLOY_SCALE_TO_ZERO_ENABLED (default OFF): Work()
short-circuits at DEBUG when off — no k8s patch, no DB write (proven by
TestDeployIdleScaler_FlagOffNoOp: zero SQL issued, zero scale calls). Fail-open
when k8s is unreachable (nil client → WARN per tick, other jobs unaffected).

Idle SIGNAL (stated honestly): deployments.last_activity_at (api migration 068)
— stamped at create, bumped on deploy/redeploy/wake. v1 idle = "no
deploy/redeploy/wake for N min" (default 30, floored at 5), NOT per-HTTP
traffic, because the api/worker are not in the request path and no
nginx-ingress request scrape is wired yet. Follow-up noted in the job header to
lift this to traffic-based idle.

CAS double-guard: candidate SELECT and the scaled_to_zero UPDATE share the
healthy + not-zeroed + not-always-on predicate, so a row that raced into a
woken/pinned/redeployed/expired state between SELECT and UPDATE is skipped
(0 rows), never wrongly slept. NotFound Deployment = skip (torn down), not fail.

Metric (rule 25): instant_deploy_scaled_to_zero_total{outcome} (scaled_down |
woke_up | wake_failed | scale_failed) + instant_deploy_idle_apps gauge, primed
in metrics_test.go. Alert+tile+catalog ship in the infra PR.

Tests: flag-off no-op, nil-k8s no-op, scale-down happy path (+counter+gauge),
CAS-race skip, NotFound skip, scale-error → scale_failed, idle-minutes floor,
provider_id→namespace derivation, real k8sDeployScaleClient vs fake clientset.
make gate GREEN.

Awaiting operator enable of DEPLOY_SCALE_TO_ZERO_ENABLED to verify real
scale-down in prod.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 5, 2026 14:16
mastermanas805 and others added 2 commits June 5, 2026 19:51
Closes the 100%-patch gaps on deploy_idle_scaler.go: Kind(), the no-config
cluster-constructor error path (CI-only, gated like the status client),
list-query error, scan error, foreign provider_id skip, db-flip error,
gauge-sample error. Work() and the SQL helpers now 100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e + AddWorker wiring

Closes the 6 uncovered changed lines flagged by the patch-coverage gate:

- config: DEPLOY_SCALE_TO_ZERO_IDLE_MINUTES parse branch (valid override +
  sub-5/non-numeric floor-to-30) now exercised directly.
- deploy_idle_scaler: introduce newDeployScaleClientset package-var seam so
  NewK8sDeployScaleClientFromCluster's success return is testable without a
  reachable cluster; add a success test alongside the NoConfig error test.
- workers: extract the idle-scaler k8s-client wiring (the AddWorker else-branch)
  into a unit-testable buildIdleScaleK8s helper; cover both the success and the
  fail-open (nil) branches via the seam.

Flag remains default-OFF; behaviour unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 91ef958 into master Jun 5, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant