Skip to content

perf(cluster): speed up local deploy loop and harden fast path#367

Draft
johntmyers wants to merge 5 commits intomainfrom
dx/local-loop-reliability-v1
Draft

perf(cluster): speed up local deploy loop and harden fast path#367
johntmyers wants to merge 5 commits intomainfrom
dx/local-loop-reliability-v1

Conversation

@johntmyers
Copy link
Collaborator

@johntmyers johntmyers commented Mar 16, 2026

Summary

This WIP PR improves local cluster DX by reducing redeploy latency and adding observability so local changes are faster and more predictable to deploy.

Related Issue

N/A (WIP tracking PR).

Changes

  • Added deploy transaction IDs and markdown reports under .cache/deploy-reports/.
  • Added baseline harness script and tasks for cold/warm/full timing runs.
  • Added a local Cargo build profile (local-fast) and wired profile-aware Docker builds for gateway/cluster/supervisor paths.
  • Added profile-aware cache scoping to improve cache hit rates and avoid cross-profile churn.
  • Added macOS Bash 3 compatibility fixes (mapfile replacements) in local scripts.
  • Hardened local registry readiness/recovery in bootstrap flow.
  • Added minimal supervisor-export Docker stage and switched supervisor extraction to avoid heavy filesystem export overhead.
  • Made non-interactive gateway reuse explicit with --reuse-ok and updated docs.
  • Added digest-aware gateway no-op skip in fast deploy and deterministic state updates for explicit-target runs.
  • Updated architecture docs with baseline/deploy-report workflow details.

Phase 1 Perf Results

Warm baseline (mise run cluster:baseline:warm:full)

Step Before Phase 1 After Phase 1
rust/cli_debug 94s 4s
docker/gateway_image 337s 2s
docker/supervisor_stage 193s 0s
docker/cluster_image 46s 2s
deploy/cluster_task 62s 1s
  • Before report: .cache/perf/cluster-baseline-20260316-100450.md
  • After report: .cache/perf/cluster-baseline-20260316-105244.md

Real source-change simulations (local-fast)

Scenario Transaction Key timings (s)
Gateway source edit (auto path) tx-sim-gateway-auto-20260316-105844-6443 builds=31, helm=2, rollout=10, total=43
Supervisor source edit (auto path) tx-sim-supervisor-auto2-20260316-110345-69 supervisor_build_deploy=17, total=17

Key Phase 1 outcome: no-change redeploys are now near-instant, and real source-change loops are bounded to ~43s (gateway path) and ~17s (supervisor path) on current local hardware/cache state.

Phase 2 Results (digest skip + state determinism)

Digest-aware gateway no-op skip

Scenario Transaction Key timings (s)
Explicit gateway rerun with unchanged digest tx-phase2-gateway-skipcheck-20260316-115158-12390 builds=2, helm=0, rollout=0, total=3
  • Deploy report shows: gateway_reconcile_skipped=1, gateway_reconcile_skip_reason=digest_already_deployed.

Explicit-target state sync validation

Scenario Transaction Key timings (s)
Explicit supervisor deploy updates only supervisor state tx-phase2-state-explicit-20260316-115340-12548 supervisor_build_deploy=95, total=95
Follow-up auto deploy after explicit run tx-phase2-state-auto-20260316-115626-32596 builds=0, total=1
  • This confirms explicit-target runs no longer cause follow-up auto drift/rebuild churn.

Phase 3 Results (cache-friendly invalidation)

Source-change simulations (local-fast)

Scenario Transaction Key timings (s)
Gateway source edit (first run after Phase 3) tx-20260316-120903-15216 builds=15, helm=1, rollout=12, total=28
Gateway source edit (warm rerun) tx-20260316-121226-32566 builds=20, helm=0, rollout=0, total=20
Supervisor source edit (first run after Phase 3) tx-20260316-120949-29654 supervisor_build_deploy=104, total=104
Supervisor source edit (warm rerun) tx-20260316-121257-28650 supervisor_build_deploy=13, total=14
  • Warm gateway source-change path improves from the earlier Phase 1 simulation (43s) to 20s with digest-aware rollout skip.
  • Warm supervisor source-change path improves from the earlier Phase 1 simulation (17s) to 14s.
  • Warm baseline full report (second pass after warming baseline-warm scope): .cache/perf/cluster-baseline-20260316-122121.md with cli=11s, gateway=26s, supervisor=10s, cluster=11s, deploy=32s.

Phase 4 Results (bounded readiness + supervisor reconcile)

Reliability validation runs (local-fast)

Scenario Transaction Key timings (s)
Explicit supervisor deploy with reconcile path enabled tx-20260316-133137-30850 supervisor_build_deploy=125, supervisor_reconcile=0, readiness_gate=1, total=126
Auto deploy through gateway rebuild/helm/rollout path tx-20260316-133957-30862 builds=150, helm=1, rollout=15, readiness_gate=1, total=168
Clean no-change steady state (post-push) tx-20260316-134945-29515 builds=0, helm=0, rollout=0, readiness_gate=1, total=1
  • Added bounded deploy controls: DEPLOY_FAST_READINESS_TIMEOUT_SECONDS, DEPLOY_FAST_READINESS_POLL_SECONDS, DEPLOY_FAST_SUPERVISOR_RECONCILE, and DEPLOY_FAST_SUPERVISOR_RECONCILE_TIMEOUT_SECONDS.
  • Deploy reports now capture readiness status/failure reasons and supervisor reconcile stats/durations.
  • Net effect: reliability improves (bounded, explicit readiness contract) while no-change loop remains at ~1s.

Phase 5 Results (iter-04 classifier expansion)

Cluster-infra classification validation

Scenario Transaction Key timings (s)
Infra-path change (cluster-bootstrap.sh) auto-escalates to bootstrap tx-20260316-142747-16072 mode=fast, recreated_cluster=1, built_cluster_image=1, total=202
Follow-up no-change steady-state deploy after bootstrap tx-20260316-145203-17082 builds=0, helm=0, rollout=0, readiness_gate=1, total=2
  • cluster-deploy-fast.sh now tracks a dedicated cluster_infra fingerprint and writes it to deploy state.
  • Changes in cluster infra paths now deterministically route to cluster-bootstrap.sh fast instead of attempting incremental fast deploy.
  • Covered paths include cluster Dockerfile/entrypoint/healthcheck, kube manifests, bootstrap/cluster build scripts, and crates/openshell-bootstrap.

Testing

  • mise run pre-commit passes
  • cargo check -p openshell-sandbox
  • mise run cluster:baseline:warm:full
  • Real-change local deploy simulations (gateway + supervisor)
  • Phase 2 digest-skip + state-sync validation runs
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Improve local DX by adding profile-aware fast Docker builds, deploy/baseline reporting, and resilient cluster bootstrap/deploy behavior so iterative redeploys are faster and more predictable on macOS/Linux.

Made-with: Cursor
@johntmyers johntmyers self-assigned this Mar 16, 2026
Avoid unnecessary helm upgrades and gateway rollouts when an explicit gateway deploy produces the same digest, and keep explicit-target state writes deterministic so subsequent auto deploys do not drift.

Made-with: Cursor
Stop touching gateway and supervisor main sources in Docker builds so unrelated edits keep incremental cache hits while retaining proto/build.rs invalidation safety.

Made-with: Cursor
Make fast deploy paths use bounded readiness checks and rollout timeouts, and add deterministic supervisor pod reconcile behavior so local deploy outcomes are reliable without regressing steady-state no-op speed.

Made-with: Cursor
Extend fast deploy classification to track cluster infrastructure paths and route those changes through full cluster bootstrap so local redeploys remain deterministic while preserving fast no-op runs.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant