feat: AKS read-only deploy hardening (sub-project 2)#103
Merged
Conversation
Enables `codeiq serve` inside an AKS pod with
`securityContext.readOnlyRootFilesystem=true` and a writable `/tmp`,
without source-code changes to the serve profile or Neo4j wiring. The
deploy contract is solved at the deployment layer plus a JVM-flag-preset
launch wrapper.
Deploy shape:
Build CI: index → enrich → bundle → upload to Nexus
AKS pod:
init-container: pull bundle.zip from Nexus → unzip into /tmp/codeiq-data
main container: scripts/aks-launch.sh /tmp/codeiq-data
→ java [JVM flag preset] -jar code-iq.jar serve /tmp/codeiq-data
Why init-container copy + flag preset over alternatives:
- vs. Neo4j read-only mode + tmp redirects: embedded Neo4j 2026.04.0
still acquires a `store_lock` file at open; per-version fragility
isn't worth fighting when /tmp is writable.
- vs. baking the bundle into the container image: container's writable
upper layer is also read-only when mounted --read-only, so Neo4j
still fails. Plus large image, cadence coupling.
- vs. swapping Neo4j for a static snapshot at serve time: throws away
the entire read API surface (Cypher, indexes, full-text search).
Reserved as the fallback if init-container copy proves operationally
insufficient — out of scope here.
JVM flag preset (encoded in scripts/aks-launch.sh):
-Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader
Spring Boot fat JAR extracts nested JARs to ~/.m2/spring-boot-loader-tmp
by default — outside /tmp, fails under read-only HOME.
-Djava.io.tmpdir=/tmp
Explicit even though /tmp is the Linux default — multipart upload
temps, JNA / Netty native lib extraction all use this; making it
explicit means base-image-default drift can't break us.
-XX:ErrorFile=/tmp/hs_err_pid%p.log
-XX:HeapDumpPath=/tmp
-XX:+HeapDumpOnOutOfMemoryError
JVM crash + heap-dump default is cwd. cwd under read-only root =
unwritable. These redirect to /tmp so dumps survive for kubectl cp.
Files in this commit:
- docs/specs/2026-04-28-aks-read-only-deploy-design.md — architecture
spec: problem, approach, audit table (Neo4j store_lock,
spring-boot-loader, JVM crash files, logback, H2 cache, SPA static),
test approach (sentinel + docker smoke), risks, acceptance criteria.
Logback was verified console-only via src/main/resources/logback-
spring.xml — no file appender redirect needed.
- docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md — task
list (5 tasks, single PR), file map, acceptance gates, deliberate
out-of-scope items.
- shared/runbooks/aks-read-only-deploy.md — canonical operational
runbook: deploy shape, full Kubernetes Pod manifest snippet (init
container + main container with SecurityContext, volume mounts,
probes, resource limits), reference Dockerfile, JVM flag preset
table, three verification gates (local docker smoke, sentinel test,
in-cluster smoke), rollback, troubleshooting matrix.
- scripts/aks-launch.sh — the launch wrapper. set -euo pipefail, arg
validation, JAR location resolution (/app/code-iq.jar default,
$CODEIQ_JAR override), 1 GB /tmp pre-flight, mkdir
/tmp/spring-boot-loader, exec java to PID 1.
- src/test/java/.../deploy/AksLaunchScriptSentinelTest.java — 11
sentinel tests asserting every required flag, the strict-bash mode,
arg-count validation, exec-to-pid-1 contract, and the /tmp pre-flight
floor are present in scripts/aks-launch.sh. Catches drift on refactor.
- CHANGELOG.md — [Unreleased] / Added entry.
- shared/runbooks/engineering-standards.md §7.1.1 — new subsection
cross-linking the runbook + script for downstream consumers running
on hardened container runtimes.
Tests: mvn test → 3427 / 0 failures / 31 skipped (full suite). The
delta from a sub-project-1 branch run (3618) is the ~190 sub-project-1
tests that haven't merged yet — independent and expected.
Out of scope for this PR (deliberate, listed in spec §3 + plan):
- Heavyweight JVM-level filesystem-write detector (Java has no clean
chroot/unshare API; environment-fragile in CI). The runbook docker
smoke is the SSoT for "did this actually work in a RO root."
- /api/diagnostics endpoint surfacing JVM flag preset values.
- Static-snapshot storage layer rewrite (Approach D in the spec).
- Helm / OCI artifact packaging — runbook ships vanilla Kubernetes
manifest; productionizing into Helm is the deployer's call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6e95c5e to
5c9ca2b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enables
codeiq serveinside an AKS pod withsecurityContext.readOnlyRootFilesystem=trueand a writable/tmpmount, without source-code changes to the serve profile or Neo4j wiring. The deploy contract is solved at the deployment layer plus a JVM-flag-preset launch wrapper.Deploy shape
Why init-container copy + flag preset (Approach A)
store_lockfile at open. Per-version fragility isn't worth fighting when/tmpis writable.--read-only. Plus large image, cadence coupling.JVM flag preset (
scripts/aks-launch.sh)-Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader~/.m2/spring-boot-loader-tmpby default — outside/tmp.-Djava.io.tmpdir=/tmp-XX:ErrorFile=/tmp/hs_err_pid%p.log-XX:HeapDumpPath=/tmp+-XX:+HeapDumpOnOutOfMemoryErrorFiles
docs/specs/2026-04-28-aks-read-only-deploy-design.md— architecture spec (problem, approach, audit table, test approach, risks, acceptance).docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md— implementation plan (5 tasks, file map, acceptance gates).shared/runbooks/aks-read-only-deploy.md— canonical operational runbook with full Kubernetes Pod manifest snippet (init-container + main container + SecurityContext + volumes + probes + resource limits), reference Dockerfile, three verification gates, rollback, troubleshooting matrix.scripts/aks-launch.sh— launch wrapper:set -euo pipefail, arg validation, JAR location resolution, 1 GB/tmppre-flight,mkdir /tmp/spring-boot-loader,exec javato PID 1.src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java— 11 sentinel tests asserting every required flag + structural contract is present in the launch script. Catches drift on refactor.CHANGELOG.md—[Unreleased] / Addedentry.shared/runbooks/engineering-standards.md§7.1.1 — new subsection cross-linking the runbook + script for downstream consumers on hardened container runtimes.Audit findings (full table in the spec)
store_lock+ tx logs<dataDir>/.codeiq/graph/graph.db//tmp/codeiq-data. No code change.~/.m2/spring-boot-loader-tmp/src/main/resources/logback-spring.xmlTest plan
mvn test -Dtest=AksLaunchScriptSentinelTest— 11/11 greenmvn test(full suite) — 3427 / 0 failures / 31 skippeddocker run --read-only --tmpfs /tmp ...) — operator-run; runbook has the copy-pasteable commandOut of scope (deliberate)
/api/diagnosticsendpoint surfacing JVM flag preset values.Independence
This PR is independent of sub-project 1 (PRs #101 / #102). Branched directly off
main. No file overlap with the resolver work exceptCHANGELOG.md(where the merge is a trivial append).🤖 Generated with Claude Code