fix(kata-agent): use syslog::try_poll in drain loop to avoid kata 255 by fidencio · Pull Request #154 · NVIDIA/nvrc

fidencio · 2026-05-24T19:41:36Z

NVRC's syslog_loop runs in the syslog poller child of fork_agent for
~136 years, draining /dev/log every 500ms. NVRC's power-off panic hook
(installed in main() before the fork) is inherited by the child — by
design: a real NVRC failure in there must still tear down the VM, per
the ephemeral confidential-VM fail-hard policy.

Guest VMs have no syslog daemon: main() binds /dev/log via
syslog::poll() before fork_agent, and the syslog child inherits that
socket. A transient drain I/O error during the long-running loop —
EBADF on recv, ENOSPC on append to /run/syslog.log, etc. — is NOT a
configuration failure and NOT VM-fatal. Yet syslog::poll() (or_panic on
any error) made every such hiccup panic the syslog child, fire the
inherited power-off hook, and reboot the VM while kata-agent (now PID 1)
was still serving the workload. kata-runtime would then observe a dead
agent and return exit code 255 to docker/nerdctl, masking the workload's
real exit code. This is the bug behind the '|| true' masks that had been
added to .github/workflows/ci.yaml around the nerdctl invocations.

Fix: syslog_loop now calls syslog::try_poll() (best-effort, silently
swallows transient I/O errors — already provided for exactly this case)
instead of syslog::poll(). The fail-hard panic hook stays in place for
every other code path; only the auxiliary syslog drain is demoted from
VM-fatal to best-effort.

Add test_syslog_loop_does_not_trigger_power_off_hook as a regression
guard: fork to isolate the production-shape panic hook from parallel
tests, run a short syslog_loop in the child, and assert the hook never
fired (exit 0). The test child does a fresh bind (unlike production,
where main() already bound /dev/log); on dev hosts where a host syslog
daemon owns /dev/log, reverting to syslog::poll() reproduces the bug
via EADDRINUSE — a local reproducer, not the in-VM failure mode.

Drop the '|| true' workarounds in .github/workflows/ci.yaml that were
masking the bug; replace them with explicit exit-code checks now that
exit codes propagate honestly. Bump lockdown::set_panic_hook_with to
pub(crate) so the regression test can install the production-shape hook
(its doc comment already advertised it as "for unit tests").

fidencio · 2026-05-24T19:57:02Z

@zvonkok, could you allow the other tests to run? I'm specially interested in the nerdctl tests as those reproduce exactly an issue I've faced with docker + kata (using a NVRC based rootfs).

NVRC's syslog_loop runs in the syslog poller child of fork_agent for ~136 years, draining /dev/log every 500ms. NVRC's power-off panic hook (installed in main() before the fork) is inherited by the child — by design: a real NVRC failure in there must still tear down the VM, per the ephemeral confidential-VM fail-hard policy. Guest VMs have no syslog daemon: main() binds /dev/log via syslog::poll() before fork_agent, and the syslog child inherits that socket. A transient drain I/O error during the long-running loop — EBADF on recv, ENOSPC on append to /run/syslog.log, etc. — is NOT a configuration failure and NOT VM-fatal. Yet syslog::poll() (or_panic on any error) made every such hiccup panic the syslog child, fire the inherited power-off hook, and reboot the VM while kata-agent (now PID 1) was still serving the workload. kata-runtime would then observe a dead agent and return exit code 255 to docker/nerdctl, masking the workload's real exit code. This is the bug behind the '|| true' masks that had been added to .github/workflows/ci.yaml around the nerdctl invocations. Fix: syslog_loop now calls syslog::try_poll() (best-effort, silently swallows transient I/O errors — already provided for exactly this case) instead of syslog::poll(). The fail-hard panic hook stays in place for every other code path; only the auxiliary syslog drain is demoted from VM-fatal to best-effort. Add test_syslog_loop_does_not_trigger_power_off_hook as a regression guard: fork to isolate the production-shape panic hook from parallel tests, run a short syslog_loop in the child, and assert the hook never fired (exit 0). The test child does a fresh bind (unlike production, where main() already bound /dev/log); on dev hosts where a host syslog daemon owns /dev/log, reverting to syslog::poll() reproduces the bug via EADDRINUSE — a local reproducer, not the in-VM failure mode. Drop the '|| true' workarounds in .github/workflows/ci.yaml that were masking the bug; replace them with explicit exit-code checks now that exit codes propagate honestly. Bump lockdown::set_panic_hook_with to pub(crate) so the regression test can install the production-shape hook (its doc comment already advertised it as "for unit tests"). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>

After fork_agent handoff, kata-agent keeps /dev/log across exec; the fork child only duplicated that socket fd. Running syslog_loop forever left a stray child under PID 1 and blocked VM teardown (kata exit 255 / nerdctl ttrpc: closed), even with try_poll (NVIDIA#154) avoiding spurious power-off. Exit the fork child right after a best-effort try_poll drain. syslog_loop and the power-off-hook regression test remain for unit coverage. CI: wrap nerdctl with set +e so non-zero exits are captured under bash -e, assert exit code 0, and fail on kata-runtime fatal output (ttrpc: closed).

fidencio force-pushed the topic/fix-kata-agent-exit-code branch from d119a38 to 765a80b Compare May 24, 2026 19:51

fidencio marked this pull request as ready for review May 24, 2026 19:54

zvonkok added the ok-to-test Ok to test label May 25, 2026

fidencio force-pushed the topic/fix-kata-agent-exit-code branch from 765a80b to 4ba2010 Compare May 25, 2026 13:24

fidencio force-pushed the topic/fix-kata-agent-exit-code branch from 4ba2010 to 04b8be5 Compare May 25, 2026 13:29

zvonkok merged commit b9b26fd into NVIDIA:main May 25, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kata-agent): use syslog::try_poll in drain loop to avoid kata 255#154

fix(kata-agent): use syslog::try_poll in drain loop to avoid kata 255#154
zvonkok merged 1 commit into
NVIDIA:mainfrom
fidencio:topic/fix-kata-agent-exit-code

fidencio commented May 24, 2026 •

edited

Loading

Uh oh!

fidencio commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fidencio commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fidencio commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fidencio commented May 24, 2026 •

edited

Loading