Skip to content

Run Kubernetes agent pods as non-root#77

Merged
ptone merged 1 commit intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/k8s-nonroot-agent-runtime
Apr 11, 2026
Merged

Run Kubernetes agent pods as non-root#77
ptone merged 1 commit intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/k8s-nonroot-agent-runtime

Conversation

@mfreeman451
Copy link
Copy Markdown
Contributor

Summary

  • run Kubernetes agent pods as the image's non-root scion user
  • keep fsGroup aligned with the broker host gid so synchronized files remain writable
  • add focused runtime tests for the pod security context and env setup

Validation

  • go test ./pkg/runtime
  • golangci-lint run --config /Users/mfreeman/src/scion/.golangci.yml --new-from-rev=main ./pkg/runtime

Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfreeman451 mfreeman451 marked this pull request as ready for review April 8, 2026 20:49
@mfreeman451
Copy link
Copy Markdown
Contributor Author

Tracking issue: #88

@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 10, 2026

Doesn't this re-introduce or re-use an 'ended' state cleaned up in #68?

PR #77: Run Kubernetes agent pods as non-root

PR URL: #77
Branch: fix/k8s-nonroot-agent-runtimemain
Files changed (PR-relevant): pkg/runtime/k8s_runtime.go, pkg/runtime/k8s_runtime_test.go, pkg/runtime/k8s_hardening_test.go, pkg/runtime/interface.go
Stats: +72 / -23 (PR-relevant scope)


Executive Summary

This PR hardens Kubernetes agent pods by running them as a non-root user (UID/GID 1000) with RunAsNonRoot enforcement, and sets explicit HOME/USER/LOGNAME environment variables. Risk level: Low-Medium — the security context changes are sound, but there is a semantic regression in the List method's phase reporting that loses the distinction between successful and failed pod exits.


Critical Issues

1. Loss of terminal phase granularity in List() — potential behavioral regression

File: pkg/runtime/k8s_runtime.go, lines 1581–1597

The previous code mapped PodSucceeded"stopped" and PodFailed"error", and also distinguished container exit code 0 vs non-zero. The new code collapses both to "ended":

if p.Status.Phase == corev1.PodSucceeded || p.Status.Phase == corev1.PodFailed {
    agentStatus = "ended"
}

Impact: The "ended" string is not a recognized state.Phase value (valid phases are: created, provisioning, cloning, starting, running, stopping, stopped, error). Consumers in pkg/agent/list.go (lines 133–151) reconcile phases against these known values — "ended" falls through all switch cases without being normalized, resulting in agents showing a raw "ended" phase in the UI/API instead of "stopped" or "error".

Additionally, the LegacyAgentPhaseEnded constant was removed from pkg/runtime/interface.go in this same PR, yet the code now re-introduces the "ended" literal it represented. This is contradictory.

Suggested Fix:

// Restore semantic distinction:
if p.Status.Phase == corev1.PodSucceeded {
    agentStatus = string(state.PhaseStopped)
} else if p.Status.Phase == corev1.PodFailed {
    agentStatus = string(state.PhaseError)
}
, and similarly for the container exit code path:
if cs.State.Terminated.ExitCode == 0 {
    agentStatus = string(state.PhaseStopped)
} else {
    agentStatus = string(state.PhaseError)
}

Or, if "ended" is intentionally a new catch-all, it needs to be handled in pkg/agent/list.go's reconciliation switch.


Observations

2. Hardcoded UID 1000 — consider making configurable

File: pkg/runtime/k8s_runtime.go, line 980

const containerUID int64 = 1000

The UID is hardcoded to 1000, matching the scion user in the standard image. This is correct for the default case, but if a custom image uses a different non-root UID (e.g., for enterprise base images), pods will fail with permission errors. Consider sourcing this from RunConfig or image metadata in a future iteration.

Severity: Low (acceptable for now, worth a TODO comment)

3. ensureAnnotations — minor readability improvement

The helper is clean and correct. One nit: it could be simplified to always return a new map when nil, which it already does. No action needed.

4. Deleted test TestKubernetesRuntime_List_TerminalPhases removes coverage for the exact behavior changed

File: pkg/runtime/k8s_runtime_test.go

The deleted test validated that PodSucceeded mapped to "stopped" and PodFailed mapped to "error". With the behavioral change in List(), this test was correctly removed to avoid failures — but no replacement test validates the new "ended" behavior or ensures downstream consumers handle it.

Suggested Fix: Add a test asserting that terminal pods return Phase: "ended" and that pkg/agent/list.go correctly normalizes this value.

5. Empty UnixUsername guard

File: pkg/runtime/k8s_runtime.go, line 967

containerHome := fmt.Sprintf("/home/%s", config.UnixUsername)

If UnixUsername is empty (defensive case), this produces HOME=/home/ which is invalid. Based on code inspection, UnixUsername always defaults to "scion", so this is low risk — but a guard would be defensive:

if config.UnixUsername == "" {
    config.UnixUsername = "scion"
}

Positive Feedback

  • Security improvement is well-scoped: Setting RunAsUser, RunAsGroup, and RunAsNonRoot at the pod level is the correct Kubernetes pattern. Keeping FSGroup aligned with the host GID for volume permissions is a thoughtful detail.
  • Explicit HOME/USER/LOGNAME env vars prevent issues with tools that rely on these being set (common in CI/agent contexts where /etc/passwd may not have the user entry).
  • ensureAnnotations helper is a clean DRY refactor that eliminates repetitive nil-checks across four call sites.
  • Test coverage for security context is thorough — both the focused TestBuildPod_SecurityContext_FSGroup and the integration-style TestBuildPod_FullConfig_Stage2 are updated.
  • int64Ptr helper mirrors the existing boolPtr pattern idiomatically.

Final Verdict

Needs minor rework. The security hardening changes (non-root, env vars, annotation cleanup) are solid and ready to ship. However, Critical Issue #1 (collapsing stopped/error into "ended") is a behavioral regression that will surface as incorrect agent phase reporting in the UI/CLI. This should either be reverted to preserve the semantic distinction, or "ended" must be handled as a recognized phase in the downstream consumer (pkg/agent/list.go).

@mfreeman451 mfreeman451 force-pushed the fix/k8s-nonroot-agent-runtime branch from 693ca6b to e8eb01b Compare April 11, 2026 05:15
@ptone ptone merged commit 7216f0e into GoogleCloudPlatform:main Apr 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants