Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow boot due to dhcpcd (tracking issue) #60900

Closed
gloaming opened this issue May 3, 2019 · 20 comments · Fixed by #64621
Closed

Slow boot due to dhcpcd (tracking issue) #60900

gloaming opened this issue May 3, 2019 · 20 comments · Fixed by #64621

Comments

@gloaming
Copy link
Contributor

gloaming commented May 3, 2019

Issue description

At boot, the login window must wait for /run/nologin to be removed (see pam_nologin(8)) .
This happens when systemd launches systemd-user-sessions.service, which removes /run/nologin when it is started, and creates it when it is stopped.

Under the default NixOS configuration, there is a significant delay in this process due to systemd waiting for dhcpcd.service to fork. The root cause is a gross hack in upstream systemd, which seeks to provide clean termination of inbound SSH sessions by ordering systemd-user-sessions.service after network.target:
systemd/systemd@8c85680 fixes systemd/systemd#2390
This was done so that at shutdown, systemd will kill the SSH sessions (by stopping systemd-user-sessions.service) before bringing down the network.

The inevitable problem is that now, local logins must wait at boot for network.target, which is silly. It wouldn't matter much, except that NixOS has dhcpcd.service configured to block network.target instead of just network-online.target: #44524
It appears that this issue has gone largely unnoticed because other distros either don't use dhcpcd, use it in background mode, ignore the /run/nologin mechanism (which appears to be something of a relic - I can't find much discussion about it), or have dhcpcd.service blocking only network-online.target, in which case login will be blocked only by a few service initializations that don't take very long.

Workarounds:

Ordered best to worst, in my opinion. Any of these should be OK for home use.

  1. systemd.services.systemd-user-sessions.enable = false; (after rebuild do rm /run/nologin)
  2. Use systemd-networkd instead:
    networking.dhcpcd.enable = false;
    systemd.network.enable = true;
    systemd.network.networks = { internet0 = { matchConfig = { Name = "<device>"; }; networkConfig = { DHCP = "ipv4"; }; }; }; (where <device> is eth0 or whatever)
  3. Make dhcpcd time out fast, and unplug seatbelts:
    networking.dhcpcd.extraConfig = "timeout 1\nnoarp\nnodelay";
    Seems to work when multiple interfaces are available. Man page says If dhcpcd is working on a single interface then dhcpcd will exit when a timeout occurs. I don't know what will happen in that case.
  4. Make dhcpcd fork to background immediately
    systemd.services.dhcpcd.serviceConfig.ExecStart = pkgs.stdenv.lib.mkForce "@${pkgs.dhcpcd}/sbin/dhcpcd dhcpcd -b --quiet --config /nix/store/<...>-dhcpcd.conf";
    The nixos module doesn't export the generated conf path, so it has to be found manually (ps ax | grep dhcp) and kept updated. Or replace it with a separate one.
  5. Disable pam_nologin.so (how?) - sounds like a Bad Idea
  6. Not recommended: attempting to reconfigure the ordering. It "works", but renders network.target inactive, presumably due systemd barfing on a cyclic dependency.

Suggested solution for NixOS:

  • Override the ordering dependency somehow

See also #50930, #58526

Steps to reproduce

Boot a NixOS machine with networking and dhcp(cd) enabled, but no network connection available.

Expected behavior

The machine should boot and present a login window just as quickly as with the network available.

Actual behavior

User is presented with a black screen for up to 30 seconds before login window appears.

@FRidh FRidh added this to the 19.09 milestone May 4, 2019
@gloaming
Copy link
Contributor Author

gloaming commented May 6, 2019

Actually, it seems this has been reported by Arch users before:
systemd/systemd#4010

gloaming added a commit to gloaming/systemd that referenced this issue May 6, 2019
This reverts commit 8c85680.

This ordering caused issues with certain downstream configurations,
see NixOS/nixpkgs#60900
gloaming added a commit to gloaming/systemd that referenced this issue May 6, 2019
This reverts commit 8c85680.

This ordering caused issues with certain downstream configurations,
see NixOS/nixpkgs#60900
gloaming added a commit to gloaming/systemd that referenced this issue Jun 19, 2019
This reverts commit 8c85680.

This ordering caused issues with certain downstream configurations,
see NixOS/nixpkgs#60900
@Mic92
Copy link
Member

Mic92 commented Jun 19, 2019

cc @fpletz @andir

gloaming added a commit to gloaming/nixpkgs that referenced this issue Jul 11, 2019
Instead of network.target. Fixes NixOS#60900 (delayed boot).
coreyoconnor pushed a commit to coreyoconnor/nixpkgs that referenced this issue Jul 12, 2019
Instead of network.target. Fixes NixOS#60900 (delayed boot).
eadwu added a commit to eadwu/nixos-configuration that referenced this issue Apr 10, 2020
@NilsIrl
Copy link
Member

NilsIrl commented Jul 24, 2020

I don't think this has been fixed. Many people seem to describe slow boot times due to dhcpcd in #80871 and I personally still have this issue.

@Mic92
Copy link
Member

Mic92 commented Jul 24, 2020

It should no longer because of network.target, what is your biggest time consumer in systemd-analyze plot > /tmp/services.svg?

@NilsIrl
Copy link
Member

NilsIrl commented Jul 24, 2020

If I run systemd-analyze blame, dhcpcd.service, is the top service. Same with the plot.

#64621 (which closed this issue), just changed dhcpcd to be before network-online instead of network which in the end still means that we have to wait for dhcpcd to start.

@Mic92
Copy link
Member

Mic92 commented Jul 24, 2020

@NilsIrl Ok. But does your multi-user.target depend on it? dhcpcd beeing slow should not be a problem if nothing else depends on it.

@NilsIrl
Copy link
Member

NilsIrl commented Jul 24, 2020

But does your multi-user.target depend on it?

Yes multi-user.target depends on network-online.target. I am not quite sure why that is the case though.

$ systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

multi-user.target @11.485s
└─network-online.target @11.485s
  └─dhcpcd.service @3.216s +8.268s
      └─basic.target @3.211s
...

@Mic92
Copy link
Member

Mic92 commented Jul 24, 2020

I tracked this down to 15d761a from @flokli

However wantedBy should only trigger the target and not wait for it, no?

@Mic92 Mic92 reopened this Jul 24, 2020
@NilsIrl
Copy link
Member

NilsIrl commented Jul 24, 2020

I tracked this down to 15d761a from @flokli

However wantedBy should only trigger the target and not wait for it, no?

Yes (AFAIK). Before my last comment, I also looked around for the source of this but because all I could find where wantedBy "things", I thought it must be related to something else.

But the commit 15d761a reverts, 764c820, mentions decrease in boot time, so I guess it does wait.

@Mic92
Copy link
Member

Mic92 commented Jul 24, 2020

@NilsIrl
Copy link
Member

NilsIrl commented Jul 24, 2020

#86484 should fix this (when it is finished).

@Mic92
Copy link
Member

Mic92 commented Jul 24, 2020

I suppose multi-user.target is only marked as done when the online is and the graphical.target requires the multi-user.target.

@Mic92
Copy link
Member

Mic92 commented Jul 29, 2020

@NilsIrl the PR was closed but you can re-open it and do #86484 (comment)

@gloaming
Copy link
Contributor Author

gloaming commented Aug 6, 2020

@NilsIrl @Mic92

If multi-user.target does not pull in network-online.target, then you're telling systemd not to bother bringing up the network. I doubt that's what you want.

However wantedBy should only trigger the target and not wait for it, no?
I suppose multi-user.target is only marked as done when the online is and the graphical.target requires the multi-user.target.

Correct on both counts :)

The output of systemd-analyze blame isn't diagnostic here. It shows the complete time taken to reach default.target, but the system can be up and usable minutes before that. Reaching default.target means that all desired services are running, not just those that are required to use the machine.

You could try looking at the output of systemd-analyze critical-chain display-manager.service and systemd-analyze critical-chain systemd-user-sessions.service. But I note that #80871 points the finger at rngd. Have you already ruled that out?

@Mic92
Copy link
Member

Mic92 commented Aug 7, 2020

Thanks for clarification.

@jbaum98
Copy link
Contributor

jbaum98 commented Aug 26, 2020

I'm experiencing an issue like this, but I'm having trouble following the history a little bit. My understanding is that dhcpd used to delay network.target when it should have been delaying network-online.target, but that this was fixed in #64621.

I don't use systemd to launch my window manager, so I'm investigateding systemd-user-sessions.service. Based on systemd-analyze critical-chain systemd-user-sessions.service, I think the issue is that network-addresses-wlan0.service is blocking network.target when, like dhcpd, it should be blocking network-online.target.

systemd-user-sessions.service +30ms
└─network.target @1min 31.201s
  └─network-local-commands.service @1min 31.165s +22ms
    └─network-setup.service @1min 30.895s +238ms
      └─network-addresses-wlan0.service @6.823s +94ms
        └─network-pre.target @6.799s
          └─resolvconf.service @6.529s +256ms
            └─basic.target @6.394s
              └─sockets.target @6.378s
                └─nix-daemon.socket @6.376s
                  └─sysinit.target @6.304s
                    └─systemd-timesyncd.service @5.992s +309ms
                      └─systemd-tmpfiles-setup.service @5.882s +84ms
                        └─local-fs.target @5.859s
                          └─home.mount @5.430s +412ms
                            └─local-fs-pre.target @5.403s
                              └─systemd-tmpfiles-setup-dev.service @955ms +75ms
                                └─kmod-static-nodes.service @821ms +101ms
                                  └─systemd-journald.socket @758ms
                                    └─system.slice @558ms
                                      └─-.slice @558ms

So the actual issue is that network-setup.service has to run before network.target, but this should be changed to network-online.target?

@Mic92
Copy link
Member

Mic92 commented Aug 26, 2020

Those services come from: modules/tasks/network-interfaces-scripted.nix.
I think they just setup static network interfaces, which is usually fine for network.target and should just wait for local network interfaces to be present rather than waiting for external dhcp requests. Also see: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

@jbaum98
Copy link
Contributor

jbaum98 commented Aug 26, 2020

Sorry for the noise, I realized the issue was I had a line interfaces.eth0.useDHCP = true; even though this is a laptop and only has eth0 when ethernet is plugged in. So it waits for eth0 to appear and times out on every boot. Removing that fixed the problem.

@stale
Copy link

stale bot commented Feb 23, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Feb 23, 2021
@veprbl veprbl removed this from the 19.09 milestone May 31, 2021
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label May 31, 2021
curiousleo added a commit to curiousleo/setup that referenced this issue Oct 18, 2021
@lf-
Copy link
Member

lf- commented Jan 20, 2024

Another attempt to get rid of the patch is here: #258680, and I expect this one will not be reverted.

I think we can close this, even though dhcpcd is itself still slow due to conflict detection (there might be an option to turn that off, but you could also set useNetworkd which also fixes this).

@lf- lf- closed this as completed Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants