Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does nix-daemon.service set KillMode=process #10964

Open
RuRo opened this issue Jun 26, 2024 · 5 comments
Open

Why does nix-daemon.service set KillMode=process #10964

RuRo opened this issue Jun 26, 2024 · 5 comments
Labels
protocol Things involving the daemon protocol & compatibility issues question

Comments

@RuRo
Copy link

RuRo commented Jun 26, 2024

Describe the bug

The nix-daemon.service.in file (which is used in NixOS to generate /etc/systemd/system/nix-daemon.service) sets KillMode=process.

Quoting from the systemd.kill man page (emphasis mine):

KillMode=
Specifies how processes of this unit shall be killed. One of control-group, mixed, process, none.

If set to control-group, all remaining processes in the control group of this unit will be killed on unit stop (for services: after the stop command is executed, as configured with ExecStop=). If set to mixed, the SIGTERM signal (see below) is sent to the main process while the subsequent SIGKILL signal (see below) is sent to all remaining processes of the unit's control group. If set to process, only the main process itself is killed (not recommended!). If set to none, no process is killed (strongly recommended against!). In this case, only the stop command will be executed on unit stop, but no process will be killed otherwise. Processes remaining alive after stop are left in their control group and the control group continues to exist after stop unless empty.

Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources.

...

Defaults to control-group.

Is there a good reason, why KillMode=process is used here? I think that using control-group or mixed seems more appropriate for nix-daemon.

I suspect, that this misconfiguration might be the cause of some issues regarding nix-daemon not behaving well under OOM pressure on NixOS. Also, see systemd/systemd#32687.

Priorities

Add 👍 to issues you find important.

@RuRo RuRo added the bug label Jun 26, 2024
@roberth
Copy link
Member

roberth commented Jun 26, 2024

nix-daemon uses a forking model where the lifetime of the connection is coupled with long-running operations, both from the client as well as server's perspective of "operations"

  • A build can take a long time. By killing existing connection workers, all current builds will be terminated; usually for no good reason.
  • A client may remain connected for a long time, and is not equipped to recover from a connection failure (in fact it will refuse to connect ever again, which seems a little excessive, but might prevent accidental fork bombs perhaps; not sure). Examples
    • Cachix watch-store daemon
    • Hercules CI Agent main process
    • Hydra
    • Any other processes that connect for a long time, and/or perform many operations

So KillMode=process is intentional, and while it's not the only possible design for a system Nix service, changing it would require a significant architectural change.

@roberth roberth added question protocol Things involving the daemon protocol & compatibility issues and removed bug labels Jun 26, 2024
@RuRo
Copy link
Author

RuRo commented Jun 26, 2024

I don't quite understand your explanation. KillMode is supposed to control what happens when somebody (or something) requests for the nix-daemon service to be stopped. For example, this can happen if the user runs

sudo systemctl stop nix-daemon.service

or if the system is currently running out of memory and the user has configured systemd-oomd in such a way that it decided that nix-daemon.service should be killed to relieve memory pressure.

It seems to me that in these circumstances, terminating all current builds is exactly what should happen. nix-daemon runs its local build processes as direct children, in the same slice and (by default) in the same cgroup.

I don't think that it makes sense for systemctl stop nix-daemon.service to keep the builds going as orphan processes. Are the builders even capable of storing their results to /nix/store after they finish if no nix-daemon is running at that point?

@roberth
Copy link
Member

roberth commented Jun 26, 2024

Isn't the stop behavior also used when updating the unit, at least on NixOS?
In those cases it's usually not required to stop everything; that would be disruptive, as I described, although it'd be wise to gracefully restart any clients.
It's certainly not without flaws.

Are the builders even capable of storing their results to /nix/store after they finish if no nix-daemon is running at that point?

I haven't encountered any problems that would suggest that this breaks. As far as I can tell, the kernel and/or systemd don't release resources until the last worker is gone.


Some small changes we could make to improve matters:

  • Reopen daemon connections instead of outright refusing after the first disconnect or failure to connect
  • Add channel multiplexing to the daemon protocol so we can send heartbeat messages and time out stalled connections and clients, as well as signal to the client that the connection should be reestablished because of an update
  • Make the newly started nix-daemon terminate "foreign" processes in its cgroup after some timeout?

@roberth
Copy link
Member

roberth commented Jun 26, 2024

@RuRo
Copy link
Author

RuRo commented Jun 26, 2024

Isn't the stop behavior also used when updating the unit, at least on NixOS?

Hm, I was under the impression that there was some way to support reloading services without killing them by using ExecReload and SIGHUP, but that might only be relevant when the configuration of the program itself (ie nix.conf) is changed (systemctl reload), not when the service itself (ie nix-daemon.service) is changed (systemctl restart/daemon-reload).

Killing all builds due to a restart might indeed be undesirable, I haven't thought of that. I wonder if there is some way to distinguish between these two cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol Things involving the daemon protocol & compatibility issues question
Projects
None yet
Development

No branches or pull requests

2 participants