Skip to content

Reconcile terminal Kubernetes agent state#68

Merged
ptone merged 3 commits intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/k8s-terminal-state-reconciliation
Apr 10, 2026
Merged

Reconcile terminal Kubernetes agent state#68
ptone merged 3 commits intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/k8s-terminal-state-reconciliation

Conversation

@mfreeman451
Copy link
Copy Markdown
Contributor

@mfreeman451 mfreeman451 commented Apr 8, 2026

Summary

  • map terminal Kubernetes pod states to structured Scion phases
  • prevent stale agent-info state from overriding completed Kubernetes runtime state
  • persist the reconciled terminal phase back into agent-info.json so broker and hub views converge

Testing

  • go test ./pkg/agent -run 'TestList(ReconcilesPhaseWithContainerStatus|PreservesRuntimeTerminalStateForKubernetes|EnrichesTemplateAndHarnessFromAgentInfo|DoesNotOverrideRuntimeTemplate|SetsLastSeenFromAgentInfoMtime|NonRunningAgentIncludesHarnessConfig)$'
  • go test ./pkg/runtime -run 'TestKubernetesRuntime_(List|List_TerminalPhases)$'
  • make build

Context

  • This is intentionally limited to terminal-state reconciliation. It does not attempt to solve durable workspace sync-back for completed Kubernetes pods.

Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfreeman451 mfreeman451 marked this pull request as ready for review April 8, 2026 09:46
@mfreeman451
Copy link
Copy Markdown
Contributor Author

Tracking issue: #85

@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 10, 2026

PR #68 Review: Reconcile terminal Kubernetes agent state

Executive Summary

This PR maps terminal Kubernetes pod states (PodSucceeded, PodFailed) to structured Scion phases (stopped, error) and prevents stale agent-info.json data from overriding authoritative runtime state. The change is well-scoped and low-to-moderate risk; there is one race-condition concern in the persist helper and a minor architectural observation about the new cross-package import, but no blocking issues.


Critical Issues

1. TOCTOU race in persistAgentInfoState (moderate severity)

File: pkg/agent/list.go — new function persistAgentInfoState

The function reads, unmarshals, modifies, marshals, and writes back agent-info.json without any file locking or atomic-write pattern. If the in-container harness writes to the same file concurrently (which it does — the harness updates agent-info.json on every heartbeat), the persist call can:

  • Overwrite a harness update that landed between ReadFile and WriteFile.
  • Produce a truncated file if the new JSON is shorter than the old content and WriteFile is not atomic on the filesystem.

Suggested Fix:
Use an atomic write (write to a temp file + rename) and/or only persist when the pod is truly terminal (container exited), which largely mitigates the race since the harness is no longer writing. The current code does gate on terminalPhase != "", which means the pod is terminated — so the harness is likely dead. However, there's a window where the harness could still be flushing a final heartbeat. Consider adding a comment documenting this assumption, or use os.Rename for atomicity:

func persistAgentInfoState(path, phase, activity string) error {
    data, err := os.ReadFile(path)
    if err != nil {
        return err
    }
    var info api.AgentInfo
    if err := json.Unmarshal(data, &info); err != nil {
        return err
    }
    if info.Phase == phase && info.Activity == activity {
        return nil
    }
    info.Phase = phase
    info.Activity = activity
    updated, err := json.MarshalIndent(info, "", "  ")
    if err != nil {
        return err
    }
    tmp := path + ".tmp"
    if err := os.WriteFile(tmp, updated, 0644); err != nil {
        return err
    }
    return os.Rename(tmp, path)
}

Severity: Moderate — in practice the harness is dead when a pod reaches a terminal phase, so the window is narrow. But an atomic write is cheap insurance.


Observations

2. persistAgentInfoState error is silently discarded

File: pkg/agent/list.go, line in the diff:

_ = persistAgentInfoState(agentInfoJSON, terminalPhase, "")

The error from persistAgentInfoState is explicitly discarded. This is acceptable for a best-effort convergence mechanism during List(), but consider emitting a debug-level log so that filesystem issues (permissions, disk full) are diagnosable in production:

if err := persistAgentInfoState(agentInfoJSON, terminalPhase, ""); err != nil {
    slog.Debug("failed to persist terminal agent state", "path", agentInfoJSON, "err", err)
}

3. terminalRuntimePhase default-case behavior

The function returns "" for any unrecognized phase string that is not LegacyAgentPhaseEnded. This is correct and safe — unrecognized phases will fall through to the existing agent-info.json / container-status reconciliation logic below. No action needed, just noting the design is sound.

4. Hardcoded 0644 file permissions

persistAgentInfoState writes with 0644. The original file may have been created with different permissions. Consider preserving the original file's mode:

fi, err := os.Stat(path)
if err != nil {
    return err
}
return os.WriteFile(path, updated, fi.Mode())

This is minor — 0644 is the standard for JSON config files in this codebase.

5. NewSimpleClientsetNewClientset rename in tests

The test file changes k8sfake.NewSimpleClientset() to k8sfake.NewClientset(). This appears to track an upstream client-go API change. Fine as-is, just noting it's a mechanical update unrelated to the core logic.

6. pkg/runtime now imports pkg/agent/state

The k8s_runtime.go change introduces pkg/agent/state as a dependency of pkg/runtime. The dependency graph is:

pkg/agent → pkg/runtime → pkg/agent/state

This is not a cycle (Go checks at the package level, and pkg/agent/state has no reverse dependency), but it does introduce a conceptual coupling where the runtime layer depends on agent-domain constants. If this concern grows, the phase constants could be lifted to pkg/api or a shared pkg/phase package. For now, with only two constants used (PhaseStopped, PhaseError), this is pragmatic and acceptable.


Positive Feedback

  • Well-scoped change. The PR explicitly states it does not attempt to solve durable workspace sync-back, which is the right boundary.
  • Good test coverage. Both the runtime-level mapping (TestKubernetesRuntime_List_TerminalPhases) and the agent-level reconciliation (TestListPreservesRuntimeTerminalStateForKubernetes) are tested with multiple sub-cases covering succeeded, failed, and structured-phase scenarios.
  • Early-return optimization in persistAgentInfoState. The no-op check (info.Phase == phase && info.Activity == activity) avoids unnecessary disk writes on repeated List() calls — nice touch.
  • Legacy sentinel constant. Naming LegacyAgentPhaseEnded and legacyFailedContainerStatusPrefix makes the migration path explicit and grep-friendly for future cleanup.
  • Container exit code distinction. Using ExitCode == 0 vs non-zero in the terminated-container path is more precise than the previous blanket "ended" assignment.

Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the follow-up hardening for terminal-state persistence in pkg/agent/list.go.

Changes:

  • switched persistAgentInfoState to temp-file-plus-rename so the write is atomic
  • preserved the original agent-info.json mode instead of hardcoding 0644
  • stopped silently discarding persist failures and now emit a debug log on best-effort persistence errors
  • added a comment documenting the assumption behind this path

I also added focused coverage in pkg/agent/list_test.go to verify the helper rewrites phase/activity, preserves file mode, and leaves no temp file behind.

On the race concern: this path only runs when the runtime has already reported a terminal Kubernetes state, so in the normal case the harness process is already gone or in its final
shutdown window and no longer continuously heartbeating agent-info.json. That means the remaining read-modify-write TOCTOU window is real in theory, but narrow in practice and limited to a
last-flush edge case during pod termination. The atomic rename removes the partial-write/truncation risk, which was the more actionable filesystem issue here, without broadening this PR
into a larger locking or cross-process coordination change.

@ptone ptone merged commit 4a97f82 into GoogleCloudPlatform:main Apr 10, 2026
1 check passed
@mfreeman451 mfreeman451 deleted the fix/k8s-terminal-state-reconciliation branch April 10, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants