Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACME renew fails if there is a local DNS server configured as the system resolver #106862

Open
m1cr0man opened this issue Dec 14, 2020 · 16 comments

Comments

@m1cr0man
Copy link
Contributor

Describe the bug
During a nixos-rebuild, ACME renewal can fail because the enabled local DNS server (confirmed with bind and dnsmasq) is not ready to serve requests.

To Reproduce
Steps to reproduce the behavior:

  1. Enable bind and an acme cert on your server
  2. Perform a nixos-rebuild
  3. The acme renew service usually fails

Expected behavior
The acme renew service should not fail.

The solution here would require a generalised way to reliably wait on the DNS server to be online. Changes need to be made in the mentioned DNS server modules more than the ACME module, despite it appearing as an ACME failure.

Screenshots
See this renew service output

Notify maintainers
@nixos/acme

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 5.4.83, NixOS, 21.03.git.16f80f8221cM (Okapi)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.9`
 - channels(root): `"nixos-21.03pre257780.e9158eca70a"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute: lego
# a list of nixos modules affected by the problem
module: security.acme
@m1cr0man
Copy link
Contributor Author

A quick but hacky solution to this to whoever may arrive here is to simply add a dependency on your DNS server to acme-fixperms.service like so:

systemd.services."acme-fixperms".wants = [ "bind.service" ];
systemd.services."acme-fixperms".after = [ "bind.service" ];

This works because all cert renewals already depend on acme-fixperms.service

@arianvp
Copy link
Member

arianvp commented Dec 14, 2020

This is another case where our nixos-rebuild switch abstraction breaks cc @NixOS/systemd

We really need to start taking indirect dependencies into account I think. These bugs keep popping up. As this is just another variant of #105354, #106336

We already fixed this during bootup with #99901 but nixos-rebuild switch doesn't take these dependencies into account at all which I consider a bug in the activation logic to be honest.

Switching to a socket-activated dns server can fix this problem. E.g. #101218 was introduced to fix the issue mentioned here. because for socket-activated units dependencies are automagic.

I would really like us to come up with a bit more generic solution

@spacekookie
Copy link
Member

@m1cr0man The fix you suggested doesn't work on my system :(

@mweinelt
Copy link
Member

mweinelt commented Dec 22, 2020

A quick but hacky solution to this to whoever may arrive here is to simply add a dependency on your DNS server to acme-fixperms.service like so:

systemd.services."acme-fixperms".wants = [ "bind.service" ];
systemd.services."acme-fixperms".after = [ "bind.service" ];

This works because all cert renewals already depend on acme-fixperms.service

I don't believe a fix like that can actually work, since when you do systemctl restart acme-fqdn.service unbound.service AFAIK no dependency resolution happens.

(Which basically repeats what Arian said in #106862 (comment))

@nh2
Copy link
Contributor

nh2 commented Jun 6, 2021

I can confirm this issue, here's how the failure looks like when the NixOS Matrix's element config uses ACME to update TLS certs when services.bind.enable = true; is newly enabled:

starting the following units: bind.service, network-setup.service
A dependency job for acme-finished-element.nh2.me.target failed. See 'journalctl -xe' for details.
A dependency job for acme-account-69ee2edeb33d810e73af.target failed. See 'journalctl -xe' for details.
A dependency job for acme-finished-webmail.nh2.me.target failed. See 'journalctl -xe' for details.
A dependency job for acme-finished-mail.nh2.me.target failed. See 'journalctl -xe' for details.
A dependency job for acme-finished-nh2.me.target failed. See 'journalctl -xe' for details.
warning: the following units failed: acme-element.nh2.me.service

● acme-element.nh2.me.service - Renew ACME certificate for element.nh2.me
     Loaded: loaded (/nix/store/jrwccfdkx7ci0q652vskbhnm9ks435fv-unit-acme-element.nh2.me.service/acme-element.nh2.me.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2021-06-06 18:46:54 CEST; 206ms ago
TriggeredBy: ● acme-element.nh2.me.timer
    Process: 27968 ExecStart=/nix/store/cqq25wmmsadr63xhl5hhswpa4xm95ymc-unit-script-acme-element.nh2.me-start/bin/acme-element.nh2.me-start (code=exited, status=1/FAILURE)
   Main PID: 27968 (code=exited, status=1/FAILURE)
         IP: 0B in, 0B out
        CPU: 53ms

Jun 06 18:46:54 nh2me acme-element.nh2.me-start[27968]: + echo ffffffffffffffffffff
Jun 06 18:46:54 nh2me acme-element.nh2.me-start[27972]: ++ ls -1 accounts
Jun 06 18:46:54 nh2me acme-element.nh2.me-start[27968]: + '[' -e certificates/element.nh2.me.key -a -e certificates/element.nh2.me.crt -a -n acme-v02.api.letsencrypt.org ']'
Jun 06 18:46:54 nh2me acme-element.nh2.me-start[27968]: + '[' -e certificates/domainhash.txt ']'
Jun 06 18:46:54 nh2me acme-element.nh2.me-start[27968]: + cmp -s domainhash.txt certificates/domainhash.txt
Jun 06 18:46:54 nh2me acme-element.nh2.me-start[27968]: + lego --accept-tos --path . -d element.nh2.me --email mail@nh2.me --key-type ec256 --http --http.webroot /var/lib/acme/acme-challenge renew --reuse-key --days 30
Jun 06 18:46:54 nh2me acme-element.nh2.me-start[27976]: 2021/06/06 18:46:54 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: no such host
Jun 06 18:46:54 nh2me systemd[1]: acme-element.nh2.me.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 18:46:54 nh2me systemd[1]: acme-element.nh2.me.service: Failed with result 'exit-code'.
Jun 06 18:46:54 nh2me systemd[1]: Failed to start Renew ACME certificate for element.nh2.me.
warning: error(s) occurred while switching to the new configuration

The system recovers shortly after when the systemd .timer fires again to renew the cert, but it should really work the first time.

@nh2
Copy link
Contributor

nh2 commented Jun 6, 2021

This works because all cert renewals already depend on acme-fixperms.service

@m1cr0man Is that still the case? I just got this problem and see:

# systemctl status acme-fixperms
● acme-fixperms.service - Fix owner and group of all ACME certificates
     Loaded: loaded (/nix/store/qsr142pp87mxjprx8xh767dlz6c36di5-unit-acme-fixperms.service/acme-fixperms.service; linked; vendor preset: enabled)
     Active: active (exited) since Thu 2021-05-20 04:41:58 CEST; 2 weeks 3 days ago

2 weeks 3 days ago.

So acme-fixperms.service doesn't seem to be a dependency of this activation.

@m1cr0man
Copy link
Contributor Author

m1cr0man commented Jun 6, 2021

This works because all cert renewals already depend on acme-fixperms.service

@m1cr0man Is that still the case? I just got this problem and see:

It is still a dependency of all renewals, as shown by systemctl list-dependencies:

acme-m1cr0man.com.service
....
● ├─acme-fixperms.service

However @mweinelt was right in that what I suggested doesn't work. acme-fixperms.service is a one shot with RemainAfterExit set to true. If you make a config change that restarts your DNS server and trigger a renewal like in your example, it won't trigger acme-fixperms so it won't wait for the DNS service. You would need to add the dependencies to each renewal service instead.

@nh2
Copy link
Contributor

nh2 commented Jun 6, 2021

You would need to add the dependencies to each renewal service instead.

Do you know what's the best way (nix expression) to do that programmatically?

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/make-acme-renew-systemd-service-depend-on-dns-nss-lookup/7412/12

@m1cr0man
Copy link
Contributor Author

You would need to add the dependencies to each renewal service instead.

Do you know what's the best way (nix expression) to do that programmatically?

Sorry - I thought I had replied already!

Yeah you can do this quite easily with a bit of mapping. Put this in a file e.g. acme-dns-fix.nix and add it to the imports in configuration.nix:

{ config, lib, ... }:
{
  systemd.services = let
    dependency = ["bind.service"];
  in lib.mapAttrs' (name: _: lib.nameValuePair "acme-${name}" {
    requires = dependency;
    after = dependency;
  }) config.security.acme.certs;
}

@nh2
Copy link
Contributor

nh2 commented Jun 14, 2021

Seems to work!

I added mkIf config.services.bind.enable to make it conditional on whether bind is enabled:

{
  # Module that fixes LetsEncrypt renewals failing in a startup race with `bind`.
  # From: https://github.com/NixOS/nixpkgs/issues/106862#issuecomment-860192745
  acmeBindRaceFixModule = { config, lib, ... }: lib.mkIf config.services.bind.enable {
    systemd.services =
      let
        dependency = [ "bind.service" ];
      in
        lib.mapAttrs'
          (name: _: lib.nameValuePair "acme-${name}" {
            requires = dependency;
            after = dependency;
          })
          config.security.acme.certs;
  };

and then using it:

{
  imports = [
    acmeBindRaceFixModule
  ];
}

The race and errors are gone that way.

@m1cr0man Should we just do that by default in the ACME modules, for all name server modules in nixpkgs?

@sbourdeauducq
Copy link
Contributor

sbourdeauducq commented Jun 14, 2021

Isn't there the same problem with unbound, dnsmasq, and other DNS servers?

@nh2
Copy link
Contributor

nh2 commented Jun 14, 2021

@sbourdeauducq Most likely yes, with

for all name server modules in nixpkgs

I meant that we'd write a mkIf for all of them, or introduce an option that defines what nameserver modules to wait on, that each of the nameservers set.

@sbourdeauducq
Copy link
Contributor

Also, a DNS server may not be the system resolver, or there can be several DNS servers. For example, a wifi/LAN internet gateway can have both unbound and dnsmasq running, with dnsmasq using unbound as the upstream resolver. Maybe it is better to let the user define which DNS server to wait on.

@m1cr0man
Copy link
Contributor Author

@m1cr0man Should we just do that by default in the ACME modules, for all name server modules in nixpkgs?

The intention of this ticket was to find some way to do so. Adding the dependency to the renew services isn't the problem, it is knowing what the system resolver is as @sbourdeauducq points out.

I meant that we'd write a mkIf for all of them

I don't want the burden of maintaining and tracking that on ACME, nor do I want to transparently create a dependency which DNS server module maintainers might not be aware of. We also aren't the only thing that suffers from this issue. Ideally, I want to put the solution in the DNS modules.

The simplest solution I've been able to think of is a local-resolver.target declared by all DNS server modules which could be used in situations like this. It would avoid the need to maintain a list in ACME and it can be used by other things.

I don't know how @arianvp's suggestion of a socket activated DNS server would work - mostly because I don't know enough about systemd sockets to understand how that could be used to wait on a local resolver, nor how much work it would be to implement per DNS server. It does sound like the best solution on principal.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/acme-fails-to-get-certificate-temporary-failure-in-name-resolution/25573/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants