nvrc: supervise kata-agent in a child PID namespace#157
Closed
fidencio wants to merge 4 commits into
Closed
Conversation
NVRC currently forks and exec()s kata-agent in the parent, so kata-agent inherits PID 1 and becomes the guest init. When kata-agent processes destroy_sandbox it then calls reboot(RB_POWER_OFF) from inside the guest, which races the host shim: qemu halts before the shim has finished its post-StopVM cleanup (stopping monitor / Cleanup agent / TaskExit), the shim catches SIGTERM from systemd ending the per-container scope, and /run/vc/sbs/<id> is left behind. The follow-up cleanup-shim then dials a dead vsock and surfaces a fatal "ttrpc: closed" to the runtime. Hand the actual VM power-off to NVRC without changing kata-agent: * unshare(CLONE_NEWPID) before forking so the child enters a fresh PID namespace, where it is pid 1. kata-agent in that namespace still has init_mode = true, so its init_agent_as_init setup (cgroups mount, /dev/ptmx symlink, setsid, sethostname) is performed exactly as today. * In a non-initial PID namespace the kernel reinterprets reboot(RB_POWER_OFF) as SIGINT to the namespace's init process, kata-agent itself, so it terminates instead of halting the VM. * NVRC remains pid 1 in the initial namespace, polls waitpid in a 500ms loop, opportunistically drains /dev/log via syslog::try_poll, replacing the previous syslog_loop child, and after kata-agent exits issues the real reboot(RB_POWER_OFF) from the initial namespace, where it actually halts the guest. The handover is purely kernel-mediated: kata-agent code is unchanged and still believes it owns shutdown. Signed-off-by: Fabiano Fidencio <ffidencio@nvidia.com>
Emit info/warn lifecycle logs and NSpid snapshots around unshare/fork/wait so CI can conclusively confirm namespace handoff and child-exit behavior.
Delay the final VM power-off briefly after kata-agent exits so host-side shim/ttrpc teardown can complete without racing into ttrpc: closed. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Collaborator
Author
|
Just adds extra complication for something that must be solved on Kata Containers side, thus closing it and focusing on fixing it properly on Kata. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NVRC currently forks and exec()s kata-agent in the parent, so kata-agent inherits PID 1 and becomes the guest init. When kata-agent processes destroy_sandbox it then calls reboot(RB_POWER_OFF) from inside the guest, which races the host shim: qemu halts before the shim has finished its post-StopVM cleanup (stopping monitor / Cleanup agent / TaskExit), the shim catches SIGTERM from systemd ending the per-container scope, and /run/vc/sbs/ is left behind. The follow-up cleanup-shim then dials a dead vsock and surfaces a fatal "ttrpc: closed" to the runtime.
Hand the actual VM power-off to NVRC without changing kata-agent:
The handover is purely kernel-mediated: kata-agent code is unchanged and still believes it owns shutdown.