Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACME renewals fail due to DNS being unavailable during switch #85794

Closed
arianvp opened this issue Apr 22, 2020 · 16 comments
Closed

ACME renewals fail due to DNS being unavailable during switch #85794

arianvp opened this issue Apr 22, 2020 · 16 comments
Labels
0.kind: bug 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: nixos

Comments

@arianvp
Copy link
Member

arianvp commented Apr 22, 2020

Describe the bug
I nixos-rebuild switch'd on my server with a bunch of ACME certs. It failed due to DNS not being available

To Reproduce

nixos-rebuild switch
updating GRUB 2 menu...
installing the GRUB 2 boot loader on /dev/vda...
Installing for i386-pc platform.
Installation finished. No error reported.
stopping the following units: acme-arianvp.me.timer, acme-techstock.photos.timer, audit.service, bitwarden_rs.service, grafana.service, kmod-static-nodes.service, network-local-commands.service, network-setup.service, nscd.service, prometheus-node-exporter.service, prometheus.service, systemd-binfmt.service, systemd-machined.service, systemd-modules-load.service, systemd-networkd-wait-online.service, systemd-networkd.service, systemd-nspawn@test1.service, systemd-nspawn@test2.service, systemd-resolved.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-udevd.service, weechat.service
NOT restarting the following changed units: getty@tty1.service, serial-getty@ttyS0.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-tmpfiles-setup.service, systemd-udev-settle.service, systemd-update-utmp.service, systemd-user-sessions.service, user-runtime-dir@0.service, user@0.service
activating the configuration...
setting up /etc...
removing obsolete symlink ‘/etc/pam.d/cups’...
removing obsolete symlink ‘/etc/pam.d/ftp’...
restarting systemd...
reloading user units for root...
setting up tmpfiles
reloading the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, firewall.service, sys-kernel-debug.mount
restarting the following units: nginx.service, sshd.service, systemd-journald.service
starting the following units: acme-arianvp.me.timer, acme-techstock.photos.timer, audit.service, bitwarden_rs.service, grafana.service, kmod-static-nodes.service, network-local-commands.service, network-setup.service, nscd.service, prometheus-node-exporter.service, prometheus.service, systemd-binfmt.service, systemd-machined.service, systemd-modules-load.service, systemd-networkd-wait-online.service, systemd-networkd.service, systemd-nspawn@test1.service, systemd-nspawn@test2.service, systemd-resolved.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, weechat.service
warning: the following units failed: acme-arianvp.me.service, acme-techstock.photos.service

● acme-techstock.photos.service - Renew ACME Certificate for techstock.photos
   Loaded: loaded (/nix/store/5jy70nlaasq8kza150ksnmwpagml2cwp-unit-acme-techstock.photos.service/acme-techstock.photos.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2020-04-22 19:04:17 CEST; 1s ago
  Process: 23832 ExecStart=/nix/store/5idg7qnmvh9xiklfwml7d65sdwnxp78i-acme-start (code=exited, status=1/FAILURE)
 Main PID: 23832 (code=exited, status=1/FAILURE)
       IP: 0B in, 0B out
      CPU: 50ms

Apr 22 19:04:17 arianvp.me systemd[1]: Starting Renew ACME Certificate for techstock.photos...
Apr 22 19:04:17 arianvp.me 5idg7qnmvh9xiklfwml7d65sdwnxp78i-acme-start[23832]: 2020/04/22 19:04:17 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: no such host
Apr 22 19:04:17 arianvp.me 5idg7qnmvh9xiklfwml7d65sdwnxp78i-acme-start[23832]: 2020/04/22 19:04:17 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: no such host
Apr 22 19:04:17 arianvp.me systemd[1]: acme-techstock.photos.service: Main process exited, code=exited, status=1/FAILURE
Apr 22 19:04:17 arianvp.me systemd[1]: acme-techstock.photos.service: Failed with result 'exit-code'.
Apr 22 19:04:17 arianvp.me systemd[1]: Failed to start Renew ACME Certificate for techstock.photos.

● acme-arianvp.me.service - Renew ACME Certificate for arianvp.me
   Loaded: loaded (/nix/store/1qsrigif18i8m5nqppjynya3zdkdd3xm-unit-acme-arianvp.me.service/acme-arianvp.me.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2020-04-22 19:04:14 CEST; 4s ago
  Process: 23740 ExecStart=/nix/store/kkbx0l7z1fzsfy1sivygjzplhm93gzsh-acme-start (code=exited, status=1/FAILURE)
 Main PID: 23740 (code=exited, status=1/FAILURE)
       IP: 0B in, 1.3K out
      CPU: 52ms

Apr 22 19:04:14 arianvp.me systemd[1]: Starting Renew ACME Certificate for arianvp.me...
Apr 22 19:04:14 arianvp.me kkbx0l7z1fzsfy1sivygjzplhm93gzsh-acme-start[23740]: 2020/04/22 19:04:14 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: device or resource busy
Apr 22 19:04:14 arianvp.me kkbx0l7z1fzsfy1sivygjzplhm93gzsh-acme-start[23740]: 2020/04/22 19:04:14 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: device or resource busy
Apr 22 19:04:14 arianvp.me systemd[1]: acme-arianvp.me.service: Main process exited, code=exited, status=1/FAILURE
Apr 22 19:04:14 arianvp.me systemd[1]: acme-arianvp.me.service: Failed with result 'exit-code'.
Apr 22 19:04:14 arianvp.me systemd[1]: Failed to start Renew ACME Certificate for arianvp.me.
Apr 22 19:04:14 arianvp.me systemd[1]: acme-arianvp.me.service: Consumed 52ms CPU time, received 0B IP traffic, sent 1.3K IP traffic.

[arian@t490s:~/Projects/nixos-stuff]$ 


Expected behavior
ACME Certs renew as expected

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Notify maintainers

Metadata
Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
@arianvp
Copy link
Member Author

arianvp commented Apr 22, 2020

My gut feeling this is because of me using systemd-resolved and dbus.service was reloading

@arianvp arianvp changed the title Switching from 19.09 to 20.03 fails ACME renewals due to network being restarted Switching from 19.09 to 20.03 fails ACME renewals due to DNS being unavailable during switch Apr 22, 2020
@worldofpeace
Copy link
Contributor

I don't think dbus.service should reload?

@arianvp
Copy link
Member Author

arianvp commented Apr 22, 2020

Relatedly it tried to restart sshd.service whilst I was connected and It printed:

Apr 22 19:04:13 arianvp.me systemd[1]: Stopping SSH Daemon...
Apr 22 19:04:13 arianvp.me systemd[1]: Stopping Flush Journal to Persistent Storage...
Apr 22 19:04:13 arianvp.me systemd[1]: sshd.service: Succeeded.
Apr 22 19:04:13 arianvp.me systemd[1]: Stopped SSH Daemon.
Apr 22 19:04:13 arianvp.me systemd[1]: sshd.service: Consumed 57min 50.309s CPU time, received 2.8G IP traffic, sent 782.9M IP traffic.
Apr 22 19:04:13 arianvp.me systemd[1]: sshd.service: Found left-over process 23396 (sshd) in control group while starting unit. Ignoring.
Apr 22 19:04:13 arianvp.me systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 22 19:04:13 arianvp.me systemd[1]: sshd.service: Found left-over process 23397 (sshd) in control group while starting unit. Ignoring.
Apr 22 19:04:13 arianvp.me systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 22 19:04:13 arianvp.me systemd[1]: sshd.service: Found left-over process 23398 (sshd) in control group while starting unit. Ignoring.
Apr 22 19:04:13 arianvp.me systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 22 19:04:13 arianvp.me systemd[1]: Starting SSH Daemon...

Which is useful because it didn't disconnect me whilst remotely applying; but also sounds like a bug in our service file?

@arianvp
Copy link
Member Author

arianvp commented Apr 22, 2020

ah no the problem is that it also stopped and started systemd-resolved.service. That explains haha.

Usually you'd solve this ordering problem by ordering units After=network.target or After=nss-lookup.target. (Which we already have for our ACME cert services!) But that only works for ordering things correctly during reboot. But how to handle these kind of things in a nixos-rebuild switch I'm not sure. Sounds complicated to pull off correctly

nixos-rebuild switch'ing in place from one NixOS version to another is something I think we can no guarantee not to have these kind of bugs. Given the problem fixed itself after rebooting, perhaps this shouldn't be marked as a bug and I should just close it? ACME isn't the only thing experiencing this kind of issue. It's inherent to how our activation logic works...

What do you think @worldofpeace ?

@arianvp
Copy link
Member Author

arianvp commented Apr 22, 2020

I am rather confused. AFAIK systemd-resolved should be DBUS-activated, so once it is stop'd requests should be buffered to it and it should be started on demand... but that isn't happening:

[root@arianvp:~]# systemctl stop systemd-resolved

[root@arianvp:~]# ping google.com
ping: google.com: Name or service not known

[root@arianvp:~]# systemctl start systemd-resolved

[root@arianvp:~]# ping google.com
PING google.com (74.125.128.101) 56(84) bytes of data.
64 bytes from ec-in-f101.1e100.net (74.125.128.101): icmp_seq=1 ttl=45 time=4.42 ms


@flokli
Copy link
Contributor

flokli commented Apr 22, 2020

As per discussion, this isn't a regression, but so far we don't ship the dbus-activated units that systemd ships, causing resolved to not auto-activate. We'll create a follow-up issue to enable it on unstable.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/make-acme-renew-systemd-service-depend-on-dns-nss-lookup/7412/6

@nh2
Copy link
Contributor

nh2 commented Oct 20, 2020

This seems to be fixed on 20.09.

On 20.03, when using simple-nixos-mailserver (which installs kresd, the knot-resolver), I got the

Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: device or resource busy

and I couldn't curl https://acme-v02.api.letsencrypt.org/directory either there.

I found that according to https://logs.nix.samueldr.com/nixos/2020-02-23#3099629; @infinisil had a very similar problem with a server that ran its own DNS server.

For me the upgrade to 20.09 on the same machine fixes it.

But I'm not sure which concrete nixpkgs change fixed it.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/make-acme-renew-systemd-service-depend-on-dns-nss-lookup/7412/7

@infinisil
Copy link
Member

@nh2 Probably #99901

@andir
Copy link
Member

andir commented Nov 3, 2020

I'm reopening this issue as it wasn't entirely solved by #99901. I've been looking into it further and the only real solution to verify that DNS is working during startup would be to push it into a preStart script or to proceed with something like this: go-acme/lego#1280

@andir andir reopened this Nov 3, 2020
@roberth
Copy link
Member

roberth commented Feb 11, 2021

Switching to a socket-activated DNS helps but does not eliminate errors in the acme units entirely.
Another failure mode is

compute2.......................> Feb 11 09:08:24 compute2 acme-example.com-start[26893]: 2021/02/11 09:08:24 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": net/http: timeout awaiting response headers

which makes me wonder whether this unit should retry itself. Perhaps not, because of letsencrypt's rate limits. It runs on a systemd timer anyway.

So perhaps the solution is to avoid running the acme client during switch whenever possible.

On way we might do that is to duplicate the service unit, run the otherwise unmodified unit on a timer only and change the new unit to run during activation only and short-circuit when a certificate is already present.

@roberth roberth changed the title Switching from 19.09 to 20.03 fails ACME renewals due to DNS being unavailable during switch Switching fails ACME renewals due to DNS being unavailable during switch Feb 11, 2021
@roberth roberth changed the title Switching fails ACME renewals due to DNS being unavailable during switch ACME renewals fail due to DNS being unavailable during switch Feb 11, 2021
@roberth
Copy link
Member

roberth commented Mar 1, 2021

#114752 runs the expiration check offline, so we usually don't need network during switch.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/configure-acme-to-retry-challenge-multiple-times/12118/4

@stale
Copy link

stale bot commented Sep 21, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Sep 21, 2021
mweinelt pushed a commit that referenced this issue Dec 27, 2021
Closes #129838

It is possible for the CA to revoke a cert that has not yet
expired. We must run lego to validate this before expiration,
but we must still ignore failures on unexpired certs to retain
compatibility with #85794

Also changed domainHash logic such that a renewal will only
be attempted at all if domains are unchanged, and do a full
run otherwises. Resolves #147540 but will be partially
reverted when go-acme/lego#1532 is resolved + available.
@rnhmjoj
Copy link
Contributor

rnhmjoj commented Mar 16, 2022

Should have been fixed by #147784.

@rnhmjoj rnhmjoj closed this as completed Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: nixos
Projects
None yet
Development

No branches or pull requests

10 participants