dev-harness: self-healing LocalStack + Azurite bucket bootstrap#179
Closed
plombardi89 wants to merge 1 commit into
Closed
dev-harness: self-healing LocalStack + Azurite bucket bootstrap#179plombardi89 wants to merge 1 commit into
plombardi89 wants to merge 1 commit into
Conversation
Replaces the one-shot Kubernetes Jobs that previously created the
LocalStack S3 buckets ('orca-cache', 'orca-origin') and the Azurite
container ('orca-test') with mechanisms that re-fire on every
emulator restart.
Failure mode being fixed
------------------------
LocalStack and Azurite both run with ephemeral state (emptyDir +
PERSISTENCE=0 / no persistence mode). When their pods restart (OOM,
eviction, manual delete, kind node restart) state is wiped. The
existing 'orca-buckets-init' and 'orca-azurite-container-init' Jobs
are Kubernetes Jobs - they ran once at first deploy and could not
re-run after emulator restarts. Result: orca pods CrashLoopBackOff
forever with 'NoSuchBucket' on the cachestore versioning probe;
manual recovery was required.
Mechanism
---------
LocalStack: 'localstack-init-buckets' ConfigMap mounted at
/etc/localstack/init/ready.d/init-buckets.sh (defaultMode 0755).
LocalStack 3.x's native init-hooks pattern rescans this directory on
every container start. The script idempotently creates both buckets
and re-checks cachestore-versioning is unset (orca's versioningGate
requirement).
Azurite: 'container-ensurer' sidecar in the same Pod, running a
30-second forever-loop that calls 'az storage container create'
idempotently. Talks to Azurite over loopback (127.0.0.1:10000) so
it doesn't depend on cluster DNS.
Both files lose their separate init-Job templates
(02-init-job.yaml.tmpl, 04-azurite-init.yaml.tmpl); 'make
deploy-localstack' and 'make deploy-azurite' lose the Job
apply + wait steps and gain bucket/container readiness polls so
operators see explicit success when the init mechanisms have run.
After this change, 'make -C hack/orca deploy-localstack' and
'deploy-azurite' are clean idempotent recovery targets: re-running
either against a stale cluster heals the buckets without a full
'orca-down && orca-up'.
Verification
------------
Live-tested against an existing kind cluster:
- Applied the new LocalStack manifest; init-hook ran and created
both buckets within 5s of container Ready.
- Deleted the LocalStack pod; new pod's init-hook recreated both
buckets within 5s of restart.
- Applied the new Azurite manifest with the container-ensurer
sidecar; sidecar created 'orca-test' within 30s.
- Deleted the Azurite pod; new sidecar recreated the container
within 30s.
- Three orca pods that had been CrashLoopBackOff for 9 days with
NoSuchBucket reached 3/3 Ready after a rollout-restart.
Initial sidecar memory limit of 64Mi was too low (Azure CLI is
Python-based and loads ~150MB of modules); bumped to 256Mi limit /
128Mi request, sized to be comfortably above measured RSS.
Templates render and pass kubectl --dry-run validation; no Go code
changed; CI surface unaffected.
Collaborator
Author
|
Closing in favor of the omnibus PR #176. The resilience commit (171fbe3) has been folded into #176 as 886607c (cherry-picked cleanly; no conflicts). The new third commit on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
LocalStack and Azurite both run with ephemeral state (
emptyDir+ no persistence). When their pods restart (OOM, eviction, manual delete, kind node restart) their in-memory buckets/containers are wiped. The existingorca-buckets-initandorca-azurite-container-initKubernetes Jobs ran once at first deploy and could not re-run after restarts.End state of that design: orca pods CrashLoopBackOff forever with
NoSuchBucketon the cachestore versioning probe; manual recovery required.What
Replace the one-shot Jobs with mechanisms that re-fire on every emulator container start.
localstack-init-bucketsConfigMap mounted at/etc/localstack/init/ready.d/init-buckets.sh(defaultMode: 0755). LocalStack 3.x rescans this directory on every container start and runs every executable script (LocalStack docs).container-ensurersidecar in the same Pod, running a 30-second forever-loop that callsaz storage container createidempotently. Talks to Azurite over loopback (127.0.0.1:10000) so it doesn't depend on cluster DNS.Both
02-init-job.yaml.tmpland04-azurite-init.yaml.tmplare deleted.make deploy-localstack/make deploy-azuritelose the Job apply + wait steps and gain explicit bucket/container readiness polls so operators see a clear success signal.After this change,
make -C hack/orca deploy-localstackanddeploy-azuriteare idempotent clean-recovery targets: re-running either against a stale cluster heals the buckets without a fullmake orca-down && make orca-up.Verification (live)
Tested against a working kind cluster:
container-ensurersidecar; sidecar createdorca-testwithin 30s.Templates render and pass
kubectl --dry-runvalidation. No Go code changed; CI surface unaffected.Note on sidecar memory limit
Initial sidecar memory limit of 64Mi proved too low; the Azure CLI is Python-based and loads ~150MB of modules at startup. Bumped to 256Mi limit / 128Mi request, sized to be comfortably above measured RSS.
Out of scope
Bucket contents still vanish on emulator restart (because the volume is still
emptyDir/PERSISTENCE=0). This matches the dev mental model ("fresh state on each restart is fine for dev") and is unchanged by this PR. If contents-persistence is wanted later it's a separate change (PVC +PERSISTENCE=1).Size
~+217 / -158 lines across 6 files (4 manifest templates, Makefile, dev-harness.md). Net deletion driven by removing two Job templates totaling ~135 lines.