Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dhcpcd service: order before network target #44524

Merged
merged 1 commit into from
Aug 13, 2018

Conversation

vincentbernat
Copy link
Member

Motivation for this change

This reverts a change applied in PR #18491. When interfaces are
configured by DHCP (typical in a cloud environment), ordering after
network.target cause trouble to applications expecting some network to
be present on boot (for example, cloud-init is quite brittle when
network hasn't been configured for cloud-init.service) and on
shutdown (for example, collectd needs to flush metrics on shutdown).

When ordering after network.target, we ensure applications relying on
network.target won't have any network reachability on boot and
potentially on shutdown.

Therefore, I think ordering before network.target is better.

cc @groxxda @fpletz

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Fits CONTRIBUTING.md.

This reverts a change applied in PR NixOS#18491. When interfaces are
configured by DHCP (typical in a cloud environment), ordering after
network.target cause trouble to applications expecting some network to
be present on boot (for example, cloud-init is quite brittle when
network hasn't been configured for `cloud-init.service`) and on
shutdown (for example, collectd needs to flush metrics on shutdown).

When ordering after network.target, we ensure applications relying on
network.target won't have any network reachability on boot and
potentially on shutdown.

Therefore, I think ordering before network.target is better.
@GrahamcOfBorg GrahamcOfBorg added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Aug 5, 2018
@Mic92
Copy link
Member

Mic92 commented Aug 6, 2018

The semantics of the network.target is that all physical interface are available. We should not change that. There is a dedicated target called network-online.target that can be used for that (Also services would be more reliable if they would retry on failure instead of hoping that systemd can magically fix all network problems) Both systemd-networkd and network-manager are using it. Maybe dhcpcd's hook mechanism can be used to start/stop this target based on leases.

@globin
Copy link
Member

globin commented Aug 6, 2018

ping @fpletz who had been working on dhcpcd and network-online.target

@fpletz
Copy link
Member

fpletz commented Aug 6, 2018

@Mic92 The dhcpcd unit already activates network-online.target if a lease has been obtained but unfortunately does this on IPv6 or IPv4. So network-online.target may be activated without an IPv4 default route if IPv6 succeeds first.

Dhcpcd can be configured to either wait for IPv4 or IPv6 but this will break IPv4-only or IPv6-only systems. Not sure what we should do here without introducing new options and rethinking the NixOS networking autoconfiguration logic.

For instance, we might move away from automatic DHCP on all interfaces and let nixos-generate-config detect the physical networking interfaces and add a small example block to configuration.nix. Or adding wildcard matching for interfaces (which is possible with both dhcpcd and networkd) and add a default block for en* or similar. In that case you would have to specify whether you have IPv4 or IPv6 connectivity and if you've done that wrong, well, network-online.target might wait for a while.

@vincentbernat The preferred solution for now is to add a wantedBy for network-online.target in the cloud-init.service. Is this a problem for you?

@vincentbernat
Copy link
Member Author

It would solve half the problem. The documentation states if you use After=network.target, you will be stopped before the network. For something like dhcpcd, this means the network will disappear while other services (also ordered after network.target) are being stopped, breaching the contract. Such services could be a syslog daemon flushing its logs, a data collectors flushing its statistics, etc.

Moreover, I still think a system should do its best to ensure network is available when network.target is reached. The documentation may say this means nothing on boot, but many daemons use this ordering to get network in the easy cases (for servers where configuration can be done early in boot). For example, nginx is after network.target. If it listens on 0.0.0.0, no problem. If it listens on a specific IP address, it will crash (but hopefully be restarted, so it's not the perfect example). The fact that on other distributions (Debian, Redhat), DHCP is run before network.target for interfaces configured statically (in /etc/network/interfaces or /etc/syscfg) may contribute to the fact most upstreams are relying on network.target for servers.

Also, I don't understand why static IP configuration is before network.target while dhcpcd is after network.target (well, I understand that DHCP may block boot, but on servers, DHCP is usually configured on purpose).

@Mic92
Copy link
Member

Mic92 commented Aug 6, 2018

Mhm. Redhat does this. Archlinux puts dhcpcd after network. Ubuntu does not apply any network target to dhcpcd. Ubuntu's networking target is bound to network-online.target. Our nginx service is mainly bound to network.target because binding nginx to an ip only makes sense if the ip is fixed, in which case it can be set statically.

@vincentbernat
Copy link
Member Author

On Redhat and Ubuntu, the default DHCP client is ISC DHCP client, so maybe nobody cares about the ordering sequence during boot for dhcpcd. For Redhat, the service handling the network setup (both static IP and DHCP) is network.service and is run before network.target. In Ubuntu (Xenial, didn't check for Bionic which is using systemd-networkd), the networking.service is also before network.target and also handle both static IP and DHCP. On Debian, this is the same. In the three cases, for a server relying on these systems, unless there is some network issues, you can expect the network to be correctly configured after network.target.

What was the problem when dhcpcd.service was before network.target? Just the ability to get a clean shutdown should favor going back to the initial situation.

@fpletz
Copy link
Member

fpletz commented Aug 6, 2018

I've checked all mentioned distributions here and indeed all of their configured static and dynamic networking setup is ordered before network.target. All do ensure that network-online.target is also activated however.

If network-manager is used, the corresponding online-wait service is wanted by network-online.target. The network-manager service itself is started before network.target. Note that systemd-networkd works exactly like that. Because these daemons are ordered before network.target, the mentioned shutdown problem doesn't exist as you've explained. Also note that the systemd documentation mentions that network.target is only to be used for ordering services at shutdown.

So in my opinion these distributions are only using this "hack" because they haven't implemented a proper wait-for-online service for their networking setup scripts and have to solve shutdown, too.

If upstream or packages from those distros are distributing unit files using network.target for ordering at startup, this is clearly a bug and should be fixed. I know that lots of bundled unit files from upstream are wrong but haven't had a chance to look at the actual unit files from the mentioned distros for typical services.

Back to the topic of this PR: You clearly have a point here. We should fix the shutdown problem and your PR fixes this like in other distros because we currently have no way to wait for dhcpcd out of band. I think we should merge this PR and continue to ensure we only use network-online.target in services because we also support network-manager and systemd-networkd after all and eventually may move to networkd by default.

@fpletz
Copy link
Member

fpletz commented Aug 6, 2018

So regarding cloud-init.service: Even if this PR may fix things inadvertently, the unit should still be ordered after network-online.target instead of network.target because it breaks if networking is no fully working yet and will break with networkd.

vincentbernat added a commit to vincentbernat/nixpkgs that referenced this pull request Aug 7, 2018
Some modules of cloud-init can cope with a network not immediately
available (notably, the EC2 module), but some others won't retry if
network is not available (notably, the Cloudstack module).
network.target doesn't give much guarantee about the network
availability. Applications not able to start without a fully
configured network should be ordered after network-online.target.

Also see NixOS#44573 and NixOS#44524.
xeji pushed a commit that referenced this pull request Aug 7, 2018
Some modules of cloud-init can cope with a network not immediately
available (notably, the EC2 module), but some others won't retry if
network is not available (notably, the Cloudstack module).
network.target doesn't give much guarantee about the network
availability. Applications not able to start without a fully
configured network should be ordered after network-online.target.

Also see #44573 and #44524.
@fpletz fpletz merged commit 0371570 into NixOS:master Aug 13, 2018
@fpletz fpletz added this to the 18.09 milestone Aug 13, 2018
@lheckemann
Copy link
Member

This change seems to delay lightdm startup by about 10 seconds on my system, making time-to-usable desktop much worse. I'm not sure why — systemd-analyze reports display-manager as running after 3 seconds, but it doesn't actually display the greeter until dhcpcd has completed. This patch makes it start up much more rapidly:

a/nixos/modules/services/networking/dhcpcd.nix b/nixos/modules/services/networking/dhcpcd.nix
index efdbca5d52e..86c5adc8c2b 100644
--- a/nixos/modules/services/networking/dhcpcd.nix
+++ b/nixos/modules/services/networking/dhcpcd.nix
@@ -162,7 +162,7 @@ in

         wantedBy = [ "multi-user.target" ] ++ optional (!hasDefaultGatewaySet) "network-online.target";
         wants = [ "network.target" "systemd-udev-settle.service" ];
-        before = [ "network.target" ];
+        before = optional (!hasDefaultGatewaySet) "network-online.target";
         after = [ "systemd-udev-settle.service" ];

         # Stopping dhcpcd during a reconfiguration is undesirable

@lheckemann
Copy link
Member

I've tested this with a number of display managers — gdm and sddm are the same as lightdm wrt being delayed by dhcpcd; slim starts up fast both with and without this change.

@Mic92
Copy link
Member

Mic92 commented Oct 21, 2018

This problem sounds similar to systemd-udev-settle.service. If we would allow something like http://linux-ip.net/html/adv-nonlocal-bind.html then we would not need network connectivity at all for services that try to bind ips. And cloud-init should just restarted more often instead.

@lheckemann
Copy link
Member

This also affects nixos tests, adding 10s to each test run (!!)

@cprussin
Copy link

FWIW arch linux does order dhcpcd.service before network.target, but it also uses the -b flag for dhcpcd, which causes it to fork to the background immediately and not wait for a lease (nix uses -w, which forks the process after the lease is acquired). So actually while Arch has that unit ordering, it does not enforce that the network is up before reaching network.target (or for that matter, I would expect, network-online.target).

I don't think -b is the right fix--using it means we essentially can't reliably ensure network-online.target actually means the network is online. I think that ordering dhcpcd.service before network-online.target and after network.target would be the correct solution.

@iclanzan
Copy link
Contributor

iclanzan commented Jan 4, 2019

This little change gave me a lot of headaches since upgrading to NixOS 18.09. It increased the boot time to a usable desktop by 10s effectively doubling my laptop boot time. But the biggest headaches came from the fact that with no internet connection present the boot time is increased by 30+ seconds and all sorts of services malfunction or fail to start such as displayManager.sessionCommands, compton and commands such as poweroff and reboot would just throw errors instead of doing what they are supposed to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants