Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd: Interactions of restartIfChanged, reloadIfChanged, stopIfChanged is fragile #49528

Open
arianvp opened this issue Oct 31, 2018 · 21 comments

Comments

@arianvp
Copy link
Member

arianvp commented Oct 31, 2018

Issue description

The following three options describe how a unit must be reloaded on switch-to-configuration:

       systemd.user.services.<name>.reloadIfChanged
           Whether the service should be reloaded during a NixOS configuration switch if its definition has changed. If enabled, the value of
           restartIfChanged is ignored.

           Type: boolean

           Default: false

           Declared by:
               <nixpkgs/nixos/modules/system/boot/systemd.nix>


       systemd.user.services.<name>.restartIfChanged
           Whether the service should be restarted during a NixOS configuration switch if its definition has changed.

           Type: boolean

           Default: true

           Declared by:
               <nixpkgs/nixos/modules/system/boot/systemd.nix>


       systemd.services.<name>.stopIfChanged
           If set, a changed unit is restarted by calling systemctl stop in the old configuration, then systemctl start in the new one. Otherwise,
           it is restarted in a single step using systemctl restart in the new configuration. The latter is less correct because it runs the
           ExecStop commands from the new configuration.

           Type: boolean

           Default: true

           Declared by:
               <nixpkgs/nixos/modules/system/boot/systemd.nix>

Their behaviour tightly depends on eachother. E.g. restartIfChanged behaves differently than documented if stopIfChanged is true, but it's true by default, so restartIfChanged never behaves as documented.
But they are not documented in the same place (the options are not next to eachother in man configuration.nix). Also, stopIfChanged has a bit of a deceptive name. This makes these options hard to discover, and they can act a bit surprisingly.

Solution:

Lets instead replace the three options with one option:

options.onChanged.type   = types.enum [ "reload" "restart"  "stop-and-start" ]
options.onChanged.default = "stop-and-start"
@Gerschtli
Copy link
Contributor

Any comments of the maintainer? I would love to have this fixed

@arianvp
Copy link
Member Author

arianvp commented Apr 23, 2019

Cc @edolstra as he has written most of the activation logic. I still very much would like to see it fixed. Also happy to help with implementing. Just need feedback whether I am on the right track and not missing any edge cases.

@matthewbauer matthewbauer modified the milestones: 20.03, 19.09 May 27, 2019
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/design-choice-of-switch-to-configuration/6016/2

@flokli
Copy link
Contributor

flokli commented May 1, 2020

I had some discussion with @aanderse about reload behaviour and also digged into the switch-to-configuration script some time, so I can provide a bit more insight on the behaviour:

As for the 3 booleans, they might still be required, but maybe the names are misleading:

  • restartIfChanged will control whether the unit will be systemctl restarted, or separately stopped, and later started during activation
  • stopIfChanged controls whether the unit should be stopped, if it isn't part of the next generation we're switching to, or should be left running
  • reloadIfChanged currently controls whether a changed unit should just be reloaded on all changes, or restarted.

Even when a unit provides a way to be reloaded, there's still cases where a restart is necessary - like when the underlying binary is exchanged due to a (security) update, or some of the other directives in the systemd unit changed (like the user a service should run as, sandboxing options etc.).

This is currently a big bug IMHO, and we should only reload when its safe, and in all other cases restart.

I'll tinker around with the semantics a bit, but my current idea could entirely deprecate the need for a reloadIfChanged option, while still allowing safe restarts, if only configuration has been updated.

@flokli
Copy link
Contributor

flokli commented May 23, 2020

So, to be a bit more specific here, the idea is to do the following:

  • If any parameter in the systemd service file changed, restart the service (via systemctl restart or systemctl start|stop, depending on stopIfChanged). (ignore the X-Re*IfChanged parameters while comparing old and new unit file)
    • Changes of the underlying binary, change of sandboxing or other systemd options should basically always restart the service, as there's no way to apply these to an already running process. We currently just never restart services with reloadIfChanged, which is a serious bug IMHO.
    • Provide some circuit breaker (neverRestartIfChanged) for critical services, that shouldn't be automatically restarted (think of display-manager.service). We should still complain during activation to tell the user what services were skipped, so they can restart them at their convenience.

Then, we could sketch a safer reloading mechanism:

Services that provide a way to reload their configuration while running, and that want to provide the reload possibility need to indirect their configuration file via environment.etc (so the ExecStart*= parameters don't change if only the configuration file is updated)

During activation of the new configuration, the activation script will notice only the monitored config files changed, and trigger a reload.

I think these semantics make reloading services much safer and more predictable.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/design-choice-of-switch-to-configuration/6016/4

@flokli
Copy link
Contributor

flokli commented May 23, 2020

Part of this effort would also be to fix our restart behaviour of systemd's .socket units, which also helps keeping connections open during a service restart:

If a service is upgraded we can restart the service while keeping around its sockets, thus ensuring the service is continously responsive. Not a single connection is lost during the upgrade.

This requires some coorperation of the managed service, but is also something we should make use of, as it allows us to systemctl restart services without the user really noticing ;-)

@stale
Copy link

stale bot commented Nov 20, 2020

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 20, 2020
@aanderse
Copy link
Member

Definitely still relevant. I guess @flokli and I should chat about this again at some point...

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 20, 2020
@flokli flokli modified the milestones: 19.09, 21.03 Dec 8, 2020
@symphorien
Copy link
Member

I came here from #49415.
It looks like I had an instance where:

  • kresd has a new configuration file
  • nixos-rebuild stops kresd
  • an acme timer starts kresd again, with the old configuration
  • systemd is restarted
  • kresd@1.service: Current command vanished from the unit file, execution of the command list won't be resumed.
  • kresd is not restarted with the new configuration

This is a case where stopIfChanged should be false:

#49415 (comment)

systemctl restart has the problem that it is less correct and can lead to subtle bugs. Hence we don't default to it. See
systemd.services..stopIfChanged
If set, a changed unit is restarted by calling systemctl stop in the old
configuration, then systemctl start in the new one. Otherwise, it is restarted in a
single step using systemctl restart in the new configuration. The latter is less
correct because it runs the ExecStop commands from the new configuration.
Moving the stop to after the daemon-reload has similar problems. and would be a major breaking change IMO

but kresd has no ExecStop stanza, so it's not a problem here. (cc @vcunat, maintainer)

My suggestion: what if we make all services without ExecStop stopIfChanged=false ?

@symphorien
Copy link
Member

For the race condition where a service stopped before systemctl deamon-reload can still be started by a timer, a possibility is to use systemd-run:
Instead of

# systemtl stop earlyoom.service
# systemctl daemon-reloead

we run

#  systemd-run --unit=nixos-upgrade-reload-systemd --property=After=earlyoom.service --property=Conflicts=earlyoom.service --pipe --wait --collect --service-type=exec systemctl daemon-reload
Running as unit: nixos-upgrade-reload-systemd.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 2.014s
CPU time consumed: 78ms
IP traffic received: 0B
IP traffic sent: 0B

This tells systemd to ensure earlyoom is still down when running daemon-reload.

This also extends to daemon-reexec, but we have to check ourselves that the command was successful, because systemd fails to keep track of this across reexec:

#  systemd-run --unit=nixos-upgrade-reload-systemd --property=After=earlyoom.service --property=Conflicts=earlyoom.service --pipe --wait --collect --service-type=exec systemctl daemon-reexec
Running as unit: nixos-upgrade-reload-systemd.service
Failed to query unit state: Message recipient disconnected from message bus without replying
Finished with result: success
CPU time consumed: 25ms
IP traffic received: 0B
IP traffic sent: 0B

vcunat added a commit that referenced this issue Dec 25, 2020
Since version 5.2.0 there's non-empty stop phase:
    ExecStopPost=/usr/bin/env rm -f "/run/knot-resolver/control/%i"
but it's perfectly OK to run that from a different version
(and typically it's no-op anyway).  Real-life example where this helps:
#49528 (comment)
@stale
Copy link

stale bot commented Jun 18, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 18, 2021
@mohe2015
Copy link
Contributor

definitely not stale

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 18, 2021
@aanderse
Copy link
Member

I don't like the unsavory things I've had to do to work around this. Is anyone interested in an online and informal hackathon to tackle this?

@symphorien
Copy link
Member

I'm quite busy at the moment but I can report on my failed experiment using systemd-run like in #49528 (comment) (from what I can recall, I tried this several months ago)

when a service foo.service must be stopped, systemd-run --property=Conflicts=foo.service will fail to start. Apparently there is an internal property of jobs that tells what happens when there is a conflict with a running service, and normal systemd services behave differently from systemd-run jobs. So I gave up.

But I learned recently that one can create impure services by writing somewhere in /run instead of /etc/systemd. nixos-rebuild could write a new service with a unique name to /run. it would conflict with services which must be stopped. If the name is unique, there is no need to systemctl daemon-reload to start it if I understand correctly.

@stale
Copy link

stale bot commented Jan 9, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 9, 2022
@dasJ
Copy link
Member

dasJ commented Feb 6, 2022

Is this solved by #157329?

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Feb 6, 2022
@flokli
Copy link
Contributor

flokli commented Feb 7, 2022

#157329 introduces more stuff, but doesn't deprecate using reloadIfChanged, no?

@dasJ
Copy link
Member

dasJ commented Feb 7, 2022

Agreed. Would people here agree the problem is somewhat mitigated when reloadIfChanged is gone and the options are somewhat documented in the manual?

@Artturin Artturin modified the milestones: 21.05, 23.05 Dec 31, 2022
@RaitoBezarius
Copy link
Member

This seems to require a serious focus group to fix this issue.

@nh2
Copy link
Contributor

nh2 commented Apr 28, 2024

Would people here agree the problem is somewhat mitigated when reloadIfChanged is gone and the options are somewhat documented in the manual?

No, see #49415 (comment)

NixOS configuration switching is fundamentally broken because forget about reload: systemctl stop + start itself is broken, independent of the *IfChanged configs.

That violates the fundamental promise of NixOS, congruent configuration management: "The system runs the declared services, not some old leftover ones".

I've re-opened #49415 (comment) for that, so then this issue can be about the *IfChanged options again.

And I've created a PR that will hopefully help: #307562

Please help test!

nh2 added a commit to nh2/nixpkgs that referenced this issue Apr 28, 2024
…ixes NixOS#49415

See the added comment for a detail description.

See also the related NixOS#49528.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To Do
Development

No branches or pull requests