Skip to content

feat: AKS read-only deploy hardening (sub-project 2)#103

Merged
aksOps merged 2 commits intomainfrom
feat/sub-project-2-aks-read-only-deploy
Apr 28, 2026
Merged

feat: AKS read-only deploy hardening (sub-project 2)#103
aksOps merged 2 commits intomainfrom
feat/sub-project-2-aks-read-only-deploy

Conversation

@aksOps
Copy link
Copy Markdown
Contributor

@aksOps aksOps commented Apr 28, 2026

Summary

Enables codeiq serve inside an AKS pod with securityContext.readOnlyRootFilesystem=true and a writable /tmp mount, without source-code changes to the serve profile or Neo4j wiring. The deploy contract is solved at the deployment layer plus a JVM-flag-preset launch wrapper.

Deploy shape

Build CI:  index → enrich → bundle → upload to Nexus
AKS pod:   init-container pulls bundle.zip → unzip into /tmp/codeiq-data
           main container runs scripts/aks-launch.sh /tmp/codeiq-data
                  → java [JVM flag preset] -jar code-iq.jar serve /tmp/codeiq-data

Why init-container copy + flag preset (Approach A)

  • vs. Neo4j read-only mode + tmp redirects: embedded Neo4j 2026.04.0 still acquires a store_lock file at open. Per-version fragility isn't worth fighting when /tmp is writable.
  • vs. baking bundle into image: container's writable upper layer is also read-only when mounted --read-only. Plus large image, cadence coupling.
  • vs. swapping Neo4j for static snapshot at serve: throws away the entire read API surface (Cypher, indexes, full-text). Reserved as the fallback if Approach A proves operationally insufficient — out of scope here.

JVM flag preset (scripts/aks-launch.sh)

Flag Why
-Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader Spring Boot fat JAR extracts to ~/.m2/spring-boot-loader-tmp by default — outside /tmp.
-Djava.io.tmpdir=/tmp Explicit so multipart uploads, JNA / Netty native lib extraction land where we expect across base images.
-XX:ErrorFile=/tmp/hs_err_pid%p.log JVM crash dump default is cwd. cwd is read-only.
-XX:HeapDumpPath=/tmp + -XX:+HeapDumpOnOutOfMemoryError OOM heap dump default is cwd.

Files

  • docs/specs/2026-04-28-aks-read-only-deploy-design.md — architecture spec (problem, approach, audit table, test approach, risks, acceptance).
  • docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md — implementation plan (5 tasks, file map, acceptance gates).
  • shared/runbooks/aks-read-only-deploy.md — canonical operational runbook with full Kubernetes Pod manifest snippet (init-container + main container + SecurityContext + volumes + probes + resource limits), reference Dockerfile, three verification gates, rollback, troubleshooting matrix.
  • scripts/aks-launch.sh — launch wrapper: set -euo pipefail, arg validation, JAR location resolution, 1 GB /tmp pre-flight, mkdir /tmp/spring-boot-loader, exec java to PID 1.
  • src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java — 11 sentinel tests asserting every required flag + structural contract is present in the launch script. Catches drift on refactor.
  • CHANGELOG.md[Unreleased] / Added entry.
  • shared/runbooks/engineering-standards.md §7.1.1 — new subsection cross-linking the runbook + script for downstream consumers on hardened container runtimes.

Audit findings (full table in the spec)

Surface Default location Conflict Fix
Neo4j store_lock + tx logs <dataDir>/.codeiq/graph/graph.db/ 🚩 Init-container copies bundle to /tmp/codeiq-data. No code change.
Spring Boot fat JAR extraction ~/.m2/spring-boot-loader-tmp/ 🚩 JVM flag
JVM crash + heap dumps cwd 🚩 JVM flags
Logback file appenders none — verified console-only at src/main/resources/logback-spring.xml No change
H2 analysis cache index-time only No change

Test plan

  • mvn test -Dtest=AksLaunchScriptSentinelTest — 11/11 green
  • mvn test (full suite) — 3427 / 0 failures / 31 skipped
  • Local docker smoke per runbook §5.1 (docker run --read-only --tmpfs /tmp ...) — operator-run; runbook has the copy-pasteable command
  • In-cluster smoke per runbook §5.3 — operator-run after deploy

Out of scope (deliberate)

  • Heavyweight JVM-level filesystem-write detector (Java has no clean chroot/unshare API; environment-fragile in CI).
  • /api/diagnostics endpoint surfacing JVM flag preset values.
  • Static-snapshot storage layer rewrite (Approach D in the spec).
  • Helm / OCI artifact packaging — runbook ships a vanilla Kubernetes manifest; productionizing into Helm is the deployer's call.

Independence

This PR is independent of sub-project 1 (PRs #101 / #102). Branched directly off main. No file overlap with the resolver work except CHANGELOG.md (where the merge is a trivial append).

🤖 Generated with Claude Code

aksOps and others added 2 commits April 28, 2026 05:21
Enables `codeiq serve` inside an AKS pod with
`securityContext.readOnlyRootFilesystem=true` and a writable `/tmp`,
without source-code changes to the serve profile or Neo4j wiring. The
deploy contract is solved at the deployment layer plus a JVM-flag-preset
launch wrapper.

Deploy shape:
  Build CI: index → enrich → bundle → upload to Nexus
  AKS pod:
    init-container: pull bundle.zip from Nexus → unzip into /tmp/codeiq-data
    main container: scripts/aks-launch.sh /tmp/codeiq-data
                    → java [JVM flag preset] -jar code-iq.jar serve /tmp/codeiq-data

Why init-container copy + flag preset over alternatives:
  - vs. Neo4j read-only mode + tmp redirects: embedded Neo4j 2026.04.0
    still acquires a `store_lock` file at open; per-version fragility
    isn't worth fighting when /tmp is writable.
  - vs. baking the bundle into the container image: container's writable
    upper layer is also read-only when mounted --read-only, so Neo4j
    still fails. Plus large image, cadence coupling.
  - vs. swapping Neo4j for a static snapshot at serve time: throws away
    the entire read API surface (Cypher, indexes, full-text search).
    Reserved as the fallback if init-container copy proves operationally
    insufficient — out of scope here.

JVM flag preset (encoded in scripts/aks-launch.sh):
  -Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader
    Spring Boot fat JAR extracts nested JARs to ~/.m2/spring-boot-loader-tmp
    by default — outside /tmp, fails under read-only HOME.
  -Djava.io.tmpdir=/tmp
    Explicit even though /tmp is the Linux default — multipart upload
    temps, JNA / Netty native lib extraction all use this; making it
    explicit means base-image-default drift can't break us.
  -XX:ErrorFile=/tmp/hs_err_pid%p.log
  -XX:HeapDumpPath=/tmp
  -XX:+HeapDumpOnOutOfMemoryError
    JVM crash + heap-dump default is cwd. cwd under read-only root =
    unwritable. These redirect to /tmp so dumps survive for kubectl cp.

Files in this commit:
  - docs/specs/2026-04-28-aks-read-only-deploy-design.md — architecture
    spec: problem, approach, audit table (Neo4j store_lock,
    spring-boot-loader, JVM crash files, logback, H2 cache, SPA static),
    test approach (sentinel + docker smoke), risks, acceptance criteria.
    Logback was verified console-only via src/main/resources/logback-
    spring.xml — no file appender redirect needed.
  - docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md — task
    list (5 tasks, single PR), file map, acceptance gates, deliberate
    out-of-scope items.
  - shared/runbooks/aks-read-only-deploy.md — canonical operational
    runbook: deploy shape, full Kubernetes Pod manifest snippet (init
    container + main container with SecurityContext, volume mounts,
    probes, resource limits), reference Dockerfile, JVM flag preset
    table, three verification gates (local docker smoke, sentinel test,
    in-cluster smoke), rollback, troubleshooting matrix.
  - scripts/aks-launch.sh — the launch wrapper. set -euo pipefail, arg
    validation, JAR location resolution (/app/code-iq.jar default,
    $CODEIQ_JAR override), 1 GB /tmp pre-flight, mkdir
    /tmp/spring-boot-loader, exec java to PID 1.
  - src/test/java/.../deploy/AksLaunchScriptSentinelTest.java — 11
    sentinel tests asserting every required flag, the strict-bash mode,
    arg-count validation, exec-to-pid-1 contract, and the /tmp pre-flight
    floor are present in scripts/aks-launch.sh. Catches drift on refactor.
  - CHANGELOG.md — [Unreleased] / Added entry.
  - shared/runbooks/engineering-standards.md §7.1.1 — new subsection
    cross-linking the runbook + script for downstream consumers running
    on hardened container runtimes.

Tests: mvn test → 3427 / 0 failures / 31 skipped (full suite). The
delta from a sub-project-1 branch run (3618) is the ~190 sub-project-1
tests that haven't merged yet — independent and expected.

Out of scope for this PR (deliberate, listed in spec §3 + plan):
  - Heavyweight JVM-level filesystem-write detector (Java has no clean
    chroot/unshare API; environment-fragile in CI). The runbook docker
    smoke is the SSoT for "did this actually work in a RO root."
  - /api/diagnostics endpoint surfacing JVM flag preset values.
  - Static-snapshot storage layer rewrite (Approach D in the spec).
  - Helm / OCI artifact packaging — runbook ships vanilla Kubernetes
    manifest; productionizing into Helm is the deployer's call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aksOps aksOps force-pushed the feat/sub-project-2-aks-read-only-deploy branch from 6e95c5e to 5c9ca2b Compare April 28, 2026 05:21
@aksOps aksOps enabled auto-merge (squash) April 28, 2026 05:22
@aksOps aksOps merged commit 47e6404 into main Apr 28, 2026
13 checks passed
@aksOps aksOps deleted the feat/sub-project-2-aks-read-only-deploy branch April 28, 2026 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant