Skip to content

fix(kata-agent): use syslog::try_poll in drain loop to avoid kata 255#154

Merged
zvonkok merged 1 commit into
NVIDIA:mainfrom
fidencio:topic/fix-kata-agent-exit-code
May 25, 2026
Merged

fix(kata-agent): use syslog::try_poll in drain loop to avoid kata 255#154
zvonkok merged 1 commit into
NVIDIA:mainfrom
fidencio:topic/fix-kata-agent-exit-code

Conversation

@fidencio
Copy link
Copy Markdown
Collaborator

@fidencio fidencio commented May 24, 2026

NVRC's syslog_loop runs in the syslog poller child of fork_agent for
~136 years, draining /dev/log every 500ms. NVRC's power-off panic hook
(installed in main() before the fork) is inherited by the child — by
design: a real NVRC failure in there must still tear down the VM, per
the ephemeral confidential-VM fail-hard policy.

Guest VMs have no syslog daemon: main() binds /dev/log via
syslog::poll() before fork_agent, and the syslog child inherits that
socket. A transient drain I/O error during the long-running loop —
EBADF on recv, ENOSPC on append to /run/syslog.log, etc. — is NOT a
configuration failure and NOT VM-fatal. Yet syslog::poll() (or_panic on
any error) made every such hiccup panic the syslog child, fire the
inherited power-off hook, and reboot the VM while kata-agent (now PID 1)
was still serving the workload. kata-runtime would then observe a dead
agent and return exit code 255 to docker/nerdctl, masking the workload's
real exit code. This is the bug behind the '|| true' masks that had been
added to .github/workflows/ci.yaml around the nerdctl invocations.

Fix: syslog_loop now calls syslog::try_poll() (best-effort, silently
swallows transient I/O errors — already provided for exactly this case)
instead of syslog::poll(). The fail-hard panic hook stays in place for
every other code path; only the auxiliary syslog drain is demoted from
VM-fatal to best-effort.

Add test_syslog_loop_does_not_trigger_power_off_hook as a regression
guard: fork to isolate the production-shape panic hook from parallel
tests, run a short syslog_loop in the child, and assert the hook never
fired (exit 0). The test child does a fresh bind (unlike production,
where main() already bound /dev/log); on dev hosts where a host syslog
daemon owns /dev/log, reverting to syslog::poll() reproduces the bug
via EADDRINUSE — a local reproducer, not the in-VM failure mode.

Drop the '|| true' workarounds in .github/workflows/ci.yaml that were
masking the bug; replace them with explicit exit-code checks now that
exit codes propagate honestly. Bump lockdown::set_panic_hook_with to
pub(crate) so the regression test can install the production-shape hook
(its doc comment already advertised it as "for unit tests").

@fidencio fidencio force-pushed the topic/fix-kata-agent-exit-code branch from d119a38 to 765a80b Compare May 24, 2026 19:51
@fidencio fidencio marked this pull request as ready for review May 24, 2026 19:54
@fidencio
Copy link
Copy Markdown
Collaborator Author

@zvonkok, could you allow the other tests to run? I'm specially interested in the nerdctl tests as those reproduce exactly an issue I've faced with docker + kata (using a NVRC based rootfs).

@zvonkok zvonkok added the ok-to-test Ok to test label May 25, 2026
@fidencio fidencio force-pushed the topic/fix-kata-agent-exit-code branch from 765a80b to 4ba2010 Compare May 25, 2026 13:24
NVRC's syslog_loop runs in the syslog poller child of fork_agent for
~136 years, draining /dev/log every 500ms. NVRC's power-off panic hook
(installed in main() before the fork) is inherited by the child — by
design: a real NVRC failure in there must still tear down the VM, per
the ephemeral confidential-VM fail-hard policy.

Guest VMs have no syslog daemon: main() binds /dev/log via
syslog::poll() before fork_agent, and the syslog child inherits that
socket. A transient drain I/O error during the long-running loop —
EBADF on recv, ENOSPC on append to /run/syslog.log, etc. — is NOT a
configuration failure and NOT VM-fatal. Yet syslog::poll() (or_panic on
any error) made every such hiccup panic the syslog child, fire the
inherited power-off hook, and reboot the VM while kata-agent (now PID 1)
was still serving the workload. kata-runtime would then observe a dead
agent and return exit code 255 to docker/nerdctl, masking the workload's
real exit code. This is the bug behind the '|| true' masks that had been
added to .github/workflows/ci.yaml around the nerdctl invocations.

Fix: syslog_loop now calls syslog::try_poll() (best-effort, silently
swallows transient I/O errors — already provided for exactly this case)
instead of syslog::poll(). The fail-hard panic hook stays in place for
every other code path; only the auxiliary syslog drain is demoted from
VM-fatal to best-effort.

Add test_syslog_loop_does_not_trigger_power_off_hook as a regression
guard: fork to isolate the production-shape panic hook from parallel
tests, run a short syslog_loop in the child, and assert the hook never
fired (exit 0). The test child does a fresh bind (unlike production,
where main() already bound /dev/log); on dev hosts where a host syslog
daemon owns /dev/log, reverting to syslog::poll() reproduces the bug
via EADDRINUSE — a local reproducer, not the in-VM failure mode.

Drop the '|| true' workarounds in .github/workflows/ci.yaml that were
masking the bug; replace them with explicit exit-code checks now that
exit codes propagate honestly. Bump lockdown::set_panic_hook_with to
pub(crate) so the regression test can install the production-shape hook
(its doc comment already advertised it as "for unit tests").

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
@fidencio fidencio force-pushed the topic/fix-kata-agent-exit-code branch from 4ba2010 to 04b8be5 Compare May 25, 2026 13:29
@zvonkok zvonkok merged commit b9b26fd into NVIDIA:main May 25, 2026
20 checks passed
fidencio added a commit to fidencio/nvrc that referenced this pull request May 25, 2026
After fork_agent handoff, kata-agent keeps /dev/log across exec; the fork
child only duplicated that socket fd. Running syslog_loop forever left a
stray child under PID 1 and blocked VM teardown (kata exit 255 / nerdctl
ttrpc: closed), even with try_poll (NVIDIA#154) avoiding spurious power-off.

Exit the fork child right after a best-effort try_poll drain. syslog_loop
and the power-off-hook regression test remain for unit coverage.

CI: wrap nerdctl with set +e so non-zero exits are captured under bash -e,
assert exit code 0, and fail on kata-runtime fatal output (ttrpc: closed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Ok to test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants