systemd: Interactions of `restartIfChanged, reloadIfChanged, stopIfChanged` is fragile #49528

arianvp · 2018-10-31T14:14:02Z

Issue description

The following three options describe how a unit must be reloaded on switch-to-configuration:

       systemd.user.services.<name>.reloadIfChanged
           Whether the service should be reloaded during a NixOS configuration switch if its definition has changed. If enabled, the value of
           restartIfChanged is ignored.

           Type: boolean

           Default: false

           Declared by:
               <nixpkgs/nixos/modules/system/boot/systemd.nix>


       systemd.user.services.<name>.restartIfChanged
           Whether the service should be restarted during a NixOS configuration switch if its definition has changed.

           Type: boolean

           Default: true

           Declared by:
               <nixpkgs/nixos/modules/system/boot/systemd.nix>


       systemd.services.<name>.stopIfChanged
           If set, a changed unit is restarted by calling systemctl stop in the old configuration, then systemctl start in the new one. Otherwise,
           it is restarted in a single step using systemctl restart in the new configuration. The latter is less correct because it runs the
           ExecStop commands from the new configuration.

           Type: boolean

           Default: true

           Declared by:
               <nixpkgs/nixos/modules/system/boot/systemd.nix>

Their behaviour tightly depends on eachother. E.g. restartIfChanged behaves differently than documented if stopIfChanged is true, but it's true by default, so restartIfChanged never behaves as documented.
But they are not documented in the same place (the options are not next to eachother in man configuration.nix). Also, stopIfChanged has a bit of a deceptive name. This makes these options hard to discover, and they can act a bit surprisingly.

Solution:

Lets instead replace the three options with one option:

options.onChanged.type   = types.enum [ "reload" "restart"  "stop-and-start" ]
options.onChanged.default = "stop-and-start"

The text was updated successfully, but these errors were encountered:

Gerschtli · 2018-12-27T18:17:00Z

Any comments of the maintainer? I would love to have this fixed

arianvp · 2019-04-23T08:33:20Z

Cc @edolstra as he has written most of the activation logic. I still very much would like to see it fixed. Also happy to help with implementing. Just need feedback whether I am on the right track and not missing any edge cases.

nixos-discourse · 2020-02-24T15:01:52Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/design-choice-of-switch-to-configuration/6016/2

flokli · 2020-05-01T00:57:45Z

I had some discussion with @aanderse about reload behaviour and also digged into the switch-to-configuration script some time, so I can provide a bit more insight on the behaviour:

As for the 3 booleans, they might still be required, but maybe the names are misleading:

restartIfChanged will control whether the unit will be systemctl restarted, or separately stopped, and later started during activation
stopIfChanged controls whether the unit should be stopped, if it isn't part of the next generation we're switching to, or should be left running
reloadIfChanged currently controls whether a changed unit should just be reloaded on all changes, or restarted.

Even when a unit provides a way to be reloaded, there's still cases where a restart is necessary - like when the underlying binary is exchanged due to a (security) update, or some of the other directives in the systemd unit changed (like the user a service should run as, sandboxing options etc.).

This is currently a big bug IMHO, and we should only reload when its safe, and in all other cases restart.

I'll tinker around with the semantics a bit, but my current idea could entirely deprecate the need for a reloadIfChanged option, while still allowing safe restarts, if only configuration has been updated.

flokli · 2020-05-23T17:10:01Z

So, to be a bit more specific here, the idea is to do the following:

If any parameter in the systemd service file changed, restart the service (via systemctl restart or systemctl start|stop, depending on stopIfChanged). (ignore the X-Re*IfChanged parameters while comparing old and new unit file)
- Changes of the underlying binary, change of sandboxing or other systemd options should basically always restart the service, as there's no way to apply these to an already running process. We currently just never restart services with reloadIfChanged, which is a serious bug IMHO.
- Provide some circuit breaker (neverRestartIfChanged) for critical services, that shouldn't be automatically restarted (think of display-manager.service). We should still complain during activation to tell the user what services were skipped, so they can restart them at their convenience.

Then, we could sketch a safer reloading mechanism:

Services that provide a way to reload their configuration while running, and that want to provide the reload possibility need to indirect their configuration file via environment.etc (so the ExecStart*= parameters don't change if only the configuration file is updated)

During activation of the new configuration, the activation script will notice only the monitored config files changed, and trigger a reload.

I think these semantics make reloading services much safer and more predictable.

nixos-discourse · 2020-05-23T17:13:48Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/design-choice-of-switch-to-configuration/6016/4

flokli · 2020-05-23T17:19:25Z

Part of this effort would also be to fix our restart behaviour of systemd's .socket units, which also helps keeping connections open during a service restart:

If a service is upgraded we can restart the service while keeping around its sockets, thus ensuring the service is continously responsive. Not a single connection is lost during the upgrade.

This requires some coorperation of the managed service, but is also something we should make use of, as it allows us to systemctl restart services without the user really noticing ;-)

stale · 2020-11-20T01:32:08Z

I marked this as stale due to inactivity. → More info

aanderse · 2020-11-20T09:47:31Z

Definitely still relevant. I guess @flokli and I should chat about this again at some point...

symphorien · 2020-12-17T21:50:16Z

I came here from #49415.
It looks like I had an instance where:

kresd has a new configuration file
nixos-rebuild stops kresd
an acme timer starts kresd again, with the old configuration
systemd is restarted
kresd@1.service: Current command vanished from the unit file, execution of the command list won't be resumed.
kresd is not restarted with the new configuration

This is a case where stopIfChanged should be false:

#49415 (comment)

systemctl restart has the problem that it is less correct and can lead to subtle bugs. Hence we don't default to it. See
systemd.services..stopIfChanged
If set, a changed unit is restarted by calling systemctl stop in the old
configuration, then systemctl start in the new one. Otherwise, it is restarted in a
single step using systemctl restart in the new configuration. The latter is less
correct because it runs the ExecStop commands from the new configuration.
Moving the stop to after the daemon-reload has similar problems. and would be a major breaking change IMO

but kresd has no ExecStop stanza, so it's not a problem here. (cc @vcunat, maintainer)

My suggestion: what if we make all services without ExecStop stopIfChanged=false ?

symphorien · 2020-12-20T20:04:25Z

For the race condition where a service stopped before systemctl deamon-reload can still be started by a timer, a possibility is to use systemd-run:
Instead of

# systemtl stop earlyoom.service
# systemctl daemon-reloead

we run

#  systemd-run --unit=nixos-upgrade-reload-systemd --property=After=earlyoom.service --property=Conflicts=earlyoom.service --pipe --wait --collect --service-type=exec systemctl daemon-reload
Running as unit: nixos-upgrade-reload-systemd.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 2.014s
CPU time consumed: 78ms
IP traffic received: 0B
IP traffic sent: 0B

This tells systemd to ensure earlyoom is still down when running daemon-reload.

This also extends to daemon-reexec, but we have to check ourselves that the command was successful, because systemd fails to keep track of this across reexec:

#  systemd-run --unit=nixos-upgrade-reload-systemd --property=After=earlyoom.service --property=Conflicts=earlyoom.service --pipe --wait --collect --service-type=exec systemctl daemon-reexec
Running as unit: nixos-upgrade-reload-systemd.service
Failed to query unit state: Message recipient disconnected from message bus without replying
Finished with result: success
CPU time consumed: 25ms
IP traffic received: 0B
IP traffic sent: 0B

Since version 5.2.0 there's non-empty stop phase: ExecStopPost=/usr/bin/env rm -f "/run/knot-resolver/control/%i" but it's perfectly OK to run that from a different version (and typically it's no-op anyway). Real-life example where this helps: #49528 (comment)

stale · 2021-06-18T21:44:19Z

I marked this as stale due to inactivity. → More info

mohe2015 · 2021-06-18T22:27:50Z

definitely not stale

aanderse · 2021-06-23T00:44:49Z

I don't like the unsavory things I've had to do to work around this. Is anyone interested in an online and informal hackathon to tackle this?

symphorien · 2021-06-26T14:47:28Z

I'm quite busy at the moment but I can report on my failed experiment using systemd-run like in #49528 (comment) (from what I can recall, I tried this several months ago)

when a service foo.service must be stopped, systemd-run --property=Conflicts=foo.service will fail to start. Apparently there is an internal property of jobs that tells what happens when there is a conflict with a running service, and normal systemd services behave differently from systemd-run jobs. So I gave up.

But I learned recently that one can create impure services by writing somewhere in /run instead of /etc/systemd. nixos-rebuild could write a new service with a unique name to /run. it would conflict with services which must be stopped. If the name is unique, there is no need to systemctl daemon-reload to start it if I understand correctly.

stale · 2022-01-09T17:07:15Z

I marked this as stale due to inactivity. → More info

dasJ · 2022-02-06T21:27:31Z

Is this solved by #157329?

flokli · 2022-02-07T08:23:21Z

#157329 introduces more stuff, but doesn't deprecate using reloadIfChanged, no?

dasJ · 2022-02-07T13:11:24Z

Agreed. Would people here agree the problem is somewhat mitigated when reloadIfChanged is gone and the options are somewhat documented in the manual?

RaitoBezarius · 2023-04-03T12:15:02Z

This seems to require a serious focus group to fix this issue.

nh2 · 2024-04-28T21:09:50Z

Would people here agree the problem is somewhat mitigated when reloadIfChanged is gone and the options are somewhat documented in the manual?

No, see #49415 (comment)

NixOS configuration switching is fundamentally broken because forget about reload: systemctl stop + start itself is broken, independent of the *IfChanged configs.

That violates the fundamental promise of NixOS, congruent configuration management: "The system runs the declared services, not some old leftover ones".

I've re-opened #49415 (comment) for that, so then this issue can be about the *IfChanged options again.

And I've created a PR that will hopefully help: #307562

Please help test!

…ixes NixOS#49415 See the added comment for a detail description. See also the related NixOS#49528.

fpletz added 0.kind: enhancement 6.topic: nixos labels Feb 12, 2019

arianvp mentioned this issue Apr 23, 2019

nixos/system: socket unit restart logic #33661

Closed

8 tasks

matthewbauer modified the milestones: 20.03, 19.09 May 27, 2019

arianvp mentioned this issue Mar 21, 2020

systemd timers can result in changed services not being restarted reliably #49415

Open

flokli mentioned this issue May 1, 2020

nixos/nginx: run nginx-config-reload as the nginx user and group #85820

Closed

10 tasks

This was referenced May 23, 2020

nixos/haproxy: add reloading support, use upstream service hardening #88434

Merged

[RFE] Reloadable PostgreSQL #65296

Open

codygman mentioned this issue Jul 12, 2020

nixpkgs services.lorri requires reboot for the user lorri.socket to appear target/lorri#374

Open

flokli mentioned this issue Sep 26, 2020

The switch-to-configuration.pl logic is hard to follow and not really documented #24409

Open

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 20, 2020

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 20, 2020

flokli mentioned this issue Nov 24, 2020

nixos/unbound: add settings option, deprecate extraConfig #89572

Merged

10 tasks

flokli modified the milestones: 19.09, 21.03 Dec 8, 2020

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 18, 2021

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 18, 2021

This was referenced Dec 15, 2021

nixos/switch-to-configuration: Rework activation script restarts #150859

Merged

nixos/home-assistant: reload the daemon when configuration changed #151241

Merged

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 9, 2022

dasJ mentioned this issue Jan 12, 2022

nixos/switch-to-configuration: Proper unit file parser and clean/fix lower part of the script #154809

Merged

13 tasks

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Feb 6, 2022

Artturin modified the milestones: 21.05, 23.05 Dec 31, 2022

RaitoBezarius modified the milestones: 23.05, 23.11 May 31, 2023

nh2 mentioned this issue Apr 28, 2024

switch-to-configuration.pl: Fix timers resulting in old unit running #307562

Open

13 tasks

nh2 added a commit to nh2/nixpkgs that referenced this issue Apr 28, 2024

switch-to-configuration.pl: Fix timers resulting in old unit running. F…

3f896ed

…ixes NixOS#49415 See the added comment for a detail description. See also the related NixOS#49528.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systemd: Interactions of `restartIfChanged, reloadIfChanged, stopIfChanged` is fragile #49528

systemd: Interactions of `restartIfChanged, reloadIfChanged, stopIfChanged` is fragile #49528

arianvp commented Oct 31, 2018 •

edited

Loading

Gerschtli commented Dec 27, 2018

arianvp commented Apr 23, 2019

nixos-discourse commented Feb 24, 2020

flokli commented May 1, 2020 •

edited

Loading

flokli commented May 23, 2020

nixos-discourse commented May 23, 2020

flokli commented May 23, 2020

stale bot commented Nov 20, 2020

aanderse commented Nov 20, 2020

symphorien commented Dec 17, 2020

symphorien commented Dec 20, 2020

stale bot commented Jun 18, 2021

mohe2015 commented Jun 18, 2021

aanderse commented Jun 23, 2021

symphorien commented Jun 26, 2021

stale bot commented Jan 9, 2022

dasJ commented Feb 6, 2022

flokli commented Feb 7, 2022

dasJ commented Feb 7, 2022

RaitoBezarius commented Apr 3, 2023

nh2 commented Apr 28, 2024

systemd: Interactions of restartIfChanged, reloadIfChanged, stopIfChanged is fragile #49528

systemd: Interactions of restartIfChanged, reloadIfChanged, stopIfChanged is fragile #49528

Comments

arianvp commented Oct 31, 2018 • edited Loading

Issue description

Solution:

Gerschtli commented Dec 27, 2018

arianvp commented Apr 23, 2019

nixos-discourse commented Feb 24, 2020

flokli commented May 1, 2020 • edited Loading

flokli commented May 23, 2020

nixos-discourse commented May 23, 2020

flokli commented May 23, 2020

stale bot commented Nov 20, 2020

aanderse commented Nov 20, 2020

symphorien commented Dec 17, 2020

symphorien commented Dec 20, 2020

stale bot commented Jun 18, 2021

mohe2015 commented Jun 18, 2021

aanderse commented Jun 23, 2021

symphorien commented Jun 26, 2021

stale bot commented Jan 9, 2022

dasJ commented Feb 6, 2022

flokli commented Feb 7, 2022

dasJ commented Feb 7, 2022

RaitoBezarius commented Apr 3, 2023

nh2 commented Apr 28, 2024

systemd: Interactions of `restartIfChanged, reloadIfChanged, stopIfChanged` is fragile #49528

systemd: Interactions of `restartIfChanged, reloadIfChanged, stopIfChanged` is fragile #49528

arianvp commented Oct 31, 2018 •

edited

Loading

flokli commented May 1, 2020 •

edited

Loading