Skip to content

dev-harness: self-healing LocalStack + Azurite bucket bootstrap#179

Closed
plombardi89 wants to merge 1 commit into
mainfrom
phlombar/orca-localstack-init-hook
Closed

dev-harness: self-healing LocalStack + Azurite bucket bootstrap#179
plombardi89 wants to merge 1 commit into
mainfrom
phlombar/orca-localstack-init-hook

Conversation

@plombardi89
Copy link
Copy Markdown
Collaborator

Why

LocalStack and Azurite both run with ephemeral state (emptyDir + no persistence). When their pods restart (OOM, eviction, manual delete, kind node restart) their in-memory buckets/containers are wiped. The existing orca-buckets-init and orca-azurite-container-init Kubernetes Jobs ran once at first deploy and could not re-run after restarts.

End state of that design: orca pods CrashLoopBackOff forever with NoSuchBucket on the cachestore versioning probe; manual recovery required.

What

Replace the one-shot Jobs with mechanisms that re-fire on every emulator container start.

Emulator Mechanism
LocalStack localstack-init-buckets ConfigMap mounted at /etc/localstack/init/ready.d/init-buckets.sh (defaultMode: 0755). LocalStack 3.x rescans this directory on every container start and runs every executable script (LocalStack docs).
Azurite container-ensurer sidecar in the same Pod, running a 30-second forever-loop that calls az storage container create idempotently. Talks to Azurite over loopback (127.0.0.1:10000) so it doesn't depend on cluster DNS.

Both 02-init-job.yaml.tmpl and 04-azurite-init.yaml.tmpl are deleted. make deploy-localstack / make deploy-azurite lose the Job apply + wait steps and gain explicit bucket/container readiness polls so operators see a clear success signal.

After this change, make -C hack/orca deploy-localstack and deploy-azurite are idempotent clean-recovery targets: re-running either against a stale cluster heals the buckets without a full make orca-down && make orca-up.

Verification (live)

Tested against a working kind cluster:

  • Applied the new LocalStack manifest; init-hook ran and created both buckets within 5s of container Ready.
  • Deleted the LocalStack pod; new pod's init-hook recreated both buckets within 5s of restart.
  • Applied the new Azurite manifest with the container-ensurer sidecar; sidecar created orca-test within 30s.
  • Deleted the Azurite pod; new sidecar recreated the container within 30s.
  • Three orca pods that had been CrashLoopBackOff for 9 days reached 3/3 Ready after a rollout-restart.

Templates render and pass kubectl --dry-run validation. No Go code changed; CI surface unaffected.

Note on sidecar memory limit

Initial sidecar memory limit of 64Mi proved too low; the Azure CLI is Python-based and loads ~150MB of modules at startup. Bumped to 256Mi limit / 128Mi request, sized to be comfortably above measured RSS.

Out of scope

Bucket contents still vanish on emulator restart (because the volume is still emptyDir / PERSISTENCE=0). This matches the dev mental model ("fresh state on each restart is fine for dev") and is unchanged by this PR. If contents-persistence is wanted later it's a separate change (PVC + PERSISTENCE=1).

Size

~+217 / -158 lines across 6 files (4 manifest templates, Makefile, dev-harness.md). Net deletion driven by removing two Job templates totaling ~135 lines.

Replaces the one-shot Kubernetes Jobs that previously created the
LocalStack S3 buckets ('orca-cache', 'orca-origin') and the Azurite
container ('orca-test') with mechanisms that re-fire on every
emulator restart.

Failure mode being fixed
------------------------

LocalStack and Azurite both run with ephemeral state (emptyDir +
PERSISTENCE=0 / no persistence mode). When their pods restart (OOM,
eviction, manual delete, kind node restart) state is wiped. The
existing 'orca-buckets-init' and 'orca-azurite-container-init' Jobs
are Kubernetes Jobs - they ran once at first deploy and could not
re-run after emulator restarts. Result: orca pods CrashLoopBackOff
forever with 'NoSuchBucket' on the cachestore versioning probe;
manual recovery was required.

Mechanism
---------

LocalStack: 'localstack-init-buckets' ConfigMap mounted at
/etc/localstack/init/ready.d/init-buckets.sh (defaultMode 0755).
LocalStack 3.x's native init-hooks pattern rescans this directory on
every container start. The script idempotently creates both buckets
and re-checks cachestore-versioning is unset (orca's versioningGate
requirement).

Azurite: 'container-ensurer' sidecar in the same Pod, running a
30-second forever-loop that calls 'az storage container create'
idempotently. Talks to Azurite over loopback (127.0.0.1:10000) so
it doesn't depend on cluster DNS.

Both files lose their separate init-Job templates
(02-init-job.yaml.tmpl, 04-azurite-init.yaml.tmpl); 'make
deploy-localstack' and 'make deploy-azurite' lose the Job
apply + wait steps and gain bucket/container readiness polls so
operators see explicit success when the init mechanisms have run.

After this change, 'make -C hack/orca deploy-localstack' and
'deploy-azurite' are clean idempotent recovery targets: re-running
either against a stale cluster heals the buckets without a full
'orca-down && orca-up'.

Verification
------------

Live-tested against an existing kind cluster:

  - Applied the new LocalStack manifest; init-hook ran and created
    both buckets within 5s of container Ready.
  - Deleted the LocalStack pod; new pod's init-hook recreated both
    buckets within 5s of restart.
  - Applied the new Azurite manifest with the container-ensurer
    sidecar; sidecar created 'orca-test' within 30s.
  - Deleted the Azurite pod; new sidecar recreated the container
    within 30s.
  - Three orca pods that had been CrashLoopBackOff for 9 days with
    NoSuchBucket reached 3/3 Ready after a rollout-restart.

Initial sidecar memory limit of 64Mi was too low (Azure CLI is
Python-based and loads ~150MB of modules); bumped to 256Mi limit /
128Mi request, sized to be comfortably above measured RSS.

Templates render and pass kubectl --dry-run validation; no Go code
changed; CI surface unaffected.
@plombardi89 plombardi89 requested a review from a team May 21, 2026 19:15
@plombardi89
Copy link
Copy Markdown
Collaborator Author

Closing in favor of the omnibus PR #176. The resilience commit (171fbe3) has been folded into #176 as 886607c (cherry-picked cleanly; no conflicts). The new third commit on phlombar/orcadev-tool ships the LocalStack init-hook ConfigMap + Azurite container-ensurer sidecar alongside the orcadev tool itself. Reviewers should look at #176 for the consolidated change.

@plombardi89 plombardi89 deleted the phlombar/orca-localstack-init-hook branch May 21, 2026 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant