Skip to content

Use native fsGroup for Hermes PVC ownership#514

Merged
bussyjd merged 2 commits into
mainfrom
fix/native-fsgroup-pvc-ownership
May 23, 2026
Merged

Use native fsGroup for Hermes PVC ownership#514
bussyjd merged 2 commits into
mainfrom
fix/native-fsgroup-pvc-ownership

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 23, 2026

Summary

  • Replace host-side Hermes PVC ownership repair with Kubernetes PodSecurityContext fsGroup and fsGroupChangePolicy: OnRootMismatch.
  • Apply the same fsGroup contract to CRD child agent Hermes pod rendering.
  • Remove the PR fix(hermes): host-side chown for CRD child agent PVCs on Linux k3d #511 repair command/docs/path helpers/call sites, keeping only a small k3d fallback that runs after the Hermes init container is already visibly stuck.

Why

The ownership fix belongs in the Kubernetes pod security context so kubelet applies it consistently to master and child Hermes PVC mounts. The previous host-side chown path added CLI surface area, host path plumbing, and child-agent-specific call sites.

Validation

  • /usr/local/bin/go test ./internal/hermes ./internal/serviceoffercontroller ./internal/agentcrd ./cmd/obol -count=1
  • git diff --check

HananINouman and others added 2 commits May 22, 2026 22:53
PR #481 only repaired hermes-<id> volumes after hermes.Sync (master agent).
Child agents live under agent-<name> and are provisioned by the controller or
agent-factory without that path, so hermes-data stayed 1000:1000 while Hermes
runs as 10000:10000 and crash-looped on Permission denied under /data/.hermes.

Extend EnsureHermesDataPVCOwnership to agent-<name>/hermes-data, call it from
obol agent new and obol sell demo quant, and add obol agent repair-perms for
factory-only creates that cannot docker-exec the k3d node from in-cluster.

Co-authored-by: Cursor <cursoragent@cursor.com>
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 23, 2026

Linux coverage validation

Validated PR head cc1325af (fix/native-fsgroup-pvc-ownership) in an isolated worktree on local Linux.

Environment:

  • OS: Ubuntu 24.04 / Linux 6.17.0-29-generic, x86_64
  • Go: go1.25.8 linux/amd64
  • Docker: 27.3.1
  • k3d available: v5.8.3

Commands run:

  • git diff --check origin/main...HEAD
  • go test ./internal/hermes ./internal/serviceoffercontroller ./internal/agentcrd ./cmd/obol -count=1
  • go test ./internal/hermes ./internal/serviceoffercontroller -count=1 -run 'TestGenerateValues_UsesHermesNativeNames|TestAgentManifests_DeploymentUsesFSGroup'
  • go build ./cmd/obol
  • go test $(go list ./... | grep -v '/internal/stack$') -count=1

Full-suite note:

  • go test ./... -count=1 has one red package: internal/stack, specifically TestWarnIfNoChatModel_EmitsWarnWhenNoModels.
  • I reproduced that same failure on unchanged origin/main (8eed58e), so it is pre-existing/unrelated to this PR. The mismatch is the test expecting No chat-capable LLM detected while current code emits No chat-capable model detected.

PR-specific Linux coverage checked:

  • Master Hermes values render the pod-level runAsUser, runAsGroup, fsGroup, and fsGroupChangePolicy: OnRootMismatch contract.
  • Generated Hermes values no longer include the old init-hermes-perms / chown -R 10000:10000 /data path.
  • CRD child-agent Deployment rendering includes the same fsGroup + fsGroupChangePolicy: OnRootMismatch contract.
  • Diff review confirms the remaining k3d host-side repair is constrained to the fallback path after the Hermes init container is visibly stuck.

Verdict: ✅ PR-specific Linux validation passes. The only observed full-suite failure is already present on main and is not from the fsGroup/PVC ownership changes.

@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 23, 2026

macOS/k3d validation report

Validated this PR on a local macOS arm64 host against the running k3d stack.

Go test and coverage

Command:

/usr/local/bin/go test -coverprofile=/tmp/obol-pr514/coverage.out ./internal/hermes ./internal/serviceoffercontroller ./internal/agentcrd ./cmd/obol -count=1
/usr/local/bin/go tool cover -func=/tmp/obol-pr514/coverage.out

Result: pass.

Package coverage from the scoped run:

  • internal/hermes: 33.3%
  • internal/serviceoffercontroller: 49.4%
  • internal/agentcrd: 75.3%
  • cmd/obol: 15.3%
  • total scoped statement coverage: 30.3%

Live master Hermes validation

Built the PR binary locally and synced the existing stack-managed Hermes agent:

go build -o /tmp/obol-pr514/obol ./cmd/obol
/tmp/obol-pr514/obol agent sync --runtime hermes obol-agent

Result: pass.

Live Deployment assertions:

{"fsGroup":10000,"fsGroupChangePolicy":"OnRootMismatch","runAsGroup":10000,"runAsUser":10000}
  • only init-hermes-data remains; init-hermes-perms is absent
  • rollout completed successfully
  • Hermes pod was 2/2 Running with 0 restarts after rollout
  • in-pod smoke as the Hermes container user succeeded:
id
# uid=10000(hermes) gid=10000(hermes) groups=10000(hermes)
touch /data/.hermes/.pr514-fsgroup-smoke && rm /data/.hermes/.pr514-fsgroup-smoke

Live CRD child-agent validation

The running cluster initially had the older pinned serviceoffer-controller image and did not have the Agent CRD installed. I ran stack sync in dev mode against the existing local stack config and force-rebuilt only serviceoffer-controller from this branch:

OBOL_DEVELOPMENT=true OBOL_FORCE_REBUILD_LOCAL_DEV_IMAGES=serviceoffer-controller /tmp/obol-pr514/obol stack up

Result: pass.

Post-sync assertions:

  • agents.obol.org, agentidentities.obol.org, purchaserequests.obol.org, and serviceoffers.obol.org API resources are present
  • serviceoffer-controller is running ghcr.io/obolnetwork/serviceoffer-controller:latest
  • serviceoffer-controller rollout completed with 1/1 ready replicas

Then created a temporary child Agent CR:

/tmp/obol-pr514/obol agent new pr514-smoke --model qwen3.5:9b --skills addresses --objective "PR 514 fsGroup smoke"

Live child Deployment assertions:

{"fsGroup":10000,"fsGroupChangePolicy":"OnRootMismatch","runAsGroup":10000,"runAsUser":10000}
  • child render path uses the Kubernetes fsGroup policy automatically
  • child init container list was profile-seed; no host-side repair/chown path was involved
  • child Hermes Deployment became Available
  • child Hermes pod was 1/1 Running
  • in-pod write smoke succeeded as UID/GID 10000:
id
# uid=10000(hermes) gid=10000(hermes) groups=10000(hermes)
touch /data/.hermes/.pr514-child-fsgroup-smoke && rm /data/.hermes/.pr514-child-fsgroup-smoke

Cleanup completed:

  • obol agent delete pr514-smoke --force
  • temporary namespace deleted
  • temporary host data removed

Notes

  • Non-interactive sudo prevented /etc/hosts updates during sync; this only affects host DNS convenience and did not affect the Kubernetes assertions.
  • Dev-mode k3d image import printed transient ctr: content digest ... not found lines while preloading large/cached images, but stack up exited 0 and the relevant controller/Hermes rollouts succeeded.
  • The temporary child Agent CR status stayed Provisioning during a ~60s poll even though its Hermes Deployment was Available and writable. Delete required --force to strip the Agent finalizer. That looks like a separate Agent status/finalizer requeue issue, not a PVC ownership failure, but it is worth a follow-up if we expect the Agent CR itself to flip Ready immediately after Deployment availability.

@bussyjd bussyjd marked this pull request as ready for review May 23, 2026 19:08
@bussyjd bussyjd merged commit 74f7f14 into main May 23, 2026
7 checks passed
bussyjd added a commit that referenced this pull request May 24, 2026
PR #511's host-side chown workaround was superseded by PR #514. This merge records the conflict resolution while keeping main's native Kubernetes fsGroup implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants