fix(kata-agent): use syslog::try_poll in drain loop to avoid kata 255#154
Merged
Merged
Conversation
d119a38 to
765a80b
Compare
Collaborator
Author
|
@zvonkok, could you allow the other tests to run? I'm specially interested in the |
765a80b to
4ba2010
Compare
NVRC's syslog_loop runs in the syslog poller child of fork_agent for ~136 years, draining /dev/log every 500ms. NVRC's power-off panic hook (installed in main() before the fork) is inherited by the child — by design: a real NVRC failure in there must still tear down the VM, per the ephemeral confidential-VM fail-hard policy. Guest VMs have no syslog daemon: main() binds /dev/log via syslog::poll() before fork_agent, and the syslog child inherits that socket. A transient drain I/O error during the long-running loop — EBADF on recv, ENOSPC on append to /run/syslog.log, etc. — is NOT a configuration failure and NOT VM-fatal. Yet syslog::poll() (or_panic on any error) made every such hiccup panic the syslog child, fire the inherited power-off hook, and reboot the VM while kata-agent (now PID 1) was still serving the workload. kata-runtime would then observe a dead agent and return exit code 255 to docker/nerdctl, masking the workload's real exit code. This is the bug behind the '|| true' masks that had been added to .github/workflows/ci.yaml around the nerdctl invocations. Fix: syslog_loop now calls syslog::try_poll() (best-effort, silently swallows transient I/O errors — already provided for exactly this case) instead of syslog::poll(). The fail-hard panic hook stays in place for every other code path; only the auxiliary syslog drain is demoted from VM-fatal to best-effort. Add test_syslog_loop_does_not_trigger_power_off_hook as a regression guard: fork to isolate the production-shape panic hook from parallel tests, run a short syslog_loop in the child, and assert the hook never fired (exit 0). The test child does a fresh bind (unlike production, where main() already bound /dev/log); on dev hosts where a host syslog daemon owns /dev/log, reverting to syslog::poll() reproduces the bug via EADDRINUSE — a local reproducer, not the in-VM failure mode. Drop the '|| true' workarounds in .github/workflows/ci.yaml that were masking the bug; replace them with explicit exit-code checks now that exit codes propagate honestly. Bump lockdown::set_panic_hook_with to pub(crate) so the regression test can install the production-shape hook (its doc comment already advertised it as "for unit tests"). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
4ba2010 to
04b8be5
Compare
fidencio
added a commit
to fidencio/nvrc
that referenced
this pull request
May 25, 2026
After fork_agent handoff, kata-agent keeps /dev/log across exec; the fork child only duplicated that socket fd. Running syslog_loop forever left a stray child under PID 1 and blocked VM teardown (kata exit 255 / nerdctl ttrpc: closed), even with try_poll (NVIDIA#154) avoiding spurious power-off. Exit the fork child right after a best-effort try_poll drain. syslog_loop and the power-off-hook regression test remain for unit coverage. CI: wrap nerdctl with set +e so non-zero exits are captured under bash -e, assert exit code 0, and fail on kata-runtime fatal output (ttrpc: closed).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NVRC's syslog_loop runs in the syslog poller child of fork_agent for
~136 years, draining /dev/log every 500ms. NVRC's power-off panic hook
(installed in main() before the fork) is inherited by the child — by
design: a real NVRC failure in there must still tear down the VM, per
the ephemeral confidential-VM fail-hard policy.
Guest VMs have no syslog daemon: main() binds /dev/log via
syslog::poll() before fork_agent, and the syslog child inherits that
socket. A transient drain I/O error during the long-running loop —
EBADF on recv, ENOSPC on append to /run/syslog.log, etc. — is NOT a
configuration failure and NOT VM-fatal. Yet syslog::poll() (or_panic on
any error) made every such hiccup panic the syslog child, fire the
inherited power-off hook, and reboot the VM while kata-agent (now PID 1)
was still serving the workload. kata-runtime would then observe a dead
agent and return exit code 255 to docker/nerdctl, masking the workload's
real exit code. This is the bug behind the '|| true' masks that had been
added to .github/workflows/ci.yaml around the nerdctl invocations.
Fix: syslog_loop now calls syslog::try_poll() (best-effort, silently
swallows transient I/O errors — already provided for exactly this case)
instead of syslog::poll(). The fail-hard panic hook stays in place for
every other code path; only the auxiliary syslog drain is demoted from
VM-fatal to best-effort.
Add test_syslog_loop_does_not_trigger_power_off_hook as a regression
guard: fork to isolate the production-shape panic hook from parallel
tests, run a short syslog_loop in the child, and assert the hook never
fired (exit 0). The test child does a fresh bind (unlike production,
where main() already bound /dev/log); on dev hosts where a host syslog
daemon owns /dev/log, reverting to syslog::poll() reproduces the bug
via EADDRINUSE — a local reproducer, not the in-VM failure mode.
Drop the '|| true' workarounds in .github/workflows/ci.yaml that were
masking the bug; replace them with explicit exit-code checks now that
exit codes propagate honestly. Bump lockdown::set_panic_hook_with to
pub(crate) so the regression test can install the production-shape hook
(its doc comment already advertised it as "for unit tests").