Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acme service fails at step NAMESPACE spawning #103121

Closed
spacekookie opened this issue Nov 8, 2020 · 18 comments · Fixed by #106857
Closed

acme service fails at step NAMESPACE spawning #103121

spacekookie opened this issue Nov 8, 2020 · 18 comments · Fixed by #106857

Comments

@spacekookie
Copy link
Member

Describe the bug

When trying to run my acme service, I get this systemd error. My nixpkgs in currently following the unstable channel, on this commit: 34ad166a830d3ac1541dcce571c52231f2f0865a

security.acme.acceptTerms = true;
security.acme.certs."alarei.kookie.space" = {
  email = "letsencrypt@spacekookie.de";
  webroot = "/var/lib/acme/acme-challenge";
  extraDomainNames = [ 
    # .... more here
  ];
  group = "nginx";
};

A cat of the unit that gets generated

[Unit]
After=network.target network-online.target acme-fixperms.service nss-lookup.target acme-selfsigned-alarei.kookie.space.service
Description=Renew ACME certificate for alarei.kookie.space
Wants=network-online.target acme-fixperms.service acme-selfsigned-alarei.kookie.space.service

[Service]
Environment="LOCALE_ARCHIVE=/nix/store/kn1c5cbjnv7rmz6myrkf4iiifwsqgwsk-glibc-locales-2.32/lib/locale/locale-archive"
Environment="PATH=/nix/store/6mxv8qg3p1i40zyh2wg7hxyxc9wmmf8n-lego-4.0.1/bin:/nix/store/7fdm1ij6f7xwjmhpici1bh3xh08n0sp7-coreutils-8.31/bin:/nix/store/1avv8qnga55v6kyh9bvv72pf1nwhivv7-diffutils-3.7/bin:/nix/store/7fdm1ij6f7xwjmhpici1bh3xh08n0sp7-coreutils-8.31/bin:/nix/store/ny7vshs45d5k193ql3b51dak0hjwwqc2-findutils-4.7.0/bin:/nix/store/4crjhnvd8yhk063r7pwxpw6k>
Environment="TZDIR=/nix/store/bmrzi43cbz912y3g9q3iapyy7h5iz6qr-tzdata-2019c/share/zoneinfo"



BindPaths=/var/lib/acme/.lego/accounts/a08753dac6798f4adeab:/tmp/accounts /var/lib/acme/alarei.kookie.space:/tmp/out /var/lib/acme/.lego/alarei.kookie.space/549839e4ae8616de7ad5:/tmp/certificates 
ExecStart=/nix/store/v4yr3llvhm4cjlsmyyplhz5drm4q5s91-unit-script-acme-alarei.kookie.space-start/bin/acme-alarei.kookie.space-start 
ExecStartPost=+/nix/store/89zaac695xcpq76njkns5hp9sg616fmd-acme-postrun
Group=nginx
PrivateTmp=true
ProtectSystem=full
StateDirectory=acme/alarei.kookie.space acme/.lego/alarei.kookie.space acme/.lego/alarei.kookie.space/549839e4ae8616de7ad5
StateDirectoryMode=750
Type=oneshot
UMask=27
User=acme
WorkingDirectory=/tmp

The full log for the unit:

Nov 08 11:10:35 gaia systemd[1]: Starting Renew ACME certificate for alarei.kookie.space...
Nov 08 11:10:35 gaia systemd[3648]: acme-alarei.kookie.space.service: Failed to set up mount namespacing: /run/systemd/unit-root/tmp/accounts: No such file or directory
Nov 08 11:10:35 gaia systemd[3648]: acme-alarei.kookie.space.service: Failed at step NAMESPACE spawning /nix/store/v4yr3llvhm4cjlsmyyplhz5drm4q5s91-unit-script-acme-alarei.kookie.space-start/bin/acme-alarei.kookie.space-start: No such file or directory
Nov 08 11:10:35 gaia systemd[1]: acme-alarei.kookie.space.service: Main process exited, code=exited, status=226/NAMESPACE

Additional context

I don't know if this is an acme module issue, or if it's a general systemd issue that only manifests itself in this module. I'm also not sure if the accounts file it fails on has actually ever existed.

@spacekookie
Copy link
Member Author

spacekookie commented Nov 8, 2020

ping @NixOS/systemd (edit: actually @NixOS/acme might be more appropriate maybe)

@andir
Copy link
Member

andir commented Nov 8, 2020

Is this the first acme run (on this box / for this domain) or simply a new run on the new commit for already existing certificates?

@spacekookie
Copy link
Member Author

spacekookie commented Nov 8, 2020

The certificates already existed. But I think this might have been the first run since switching to the new module.

Edit: I have since gotten working certs again by running the script manually (with some hacks). But the unit is still broken and will fail to renew my certs in the future.

@ghost
Copy link

ghost commented Nov 8, 2020

If you still have state from the old module in /var/lib/acme you could try to move it somwhere else, re-run the system activation script and see if that unbreaks the services.

@arianvp
Copy link
Member

arianvp commented Nov 8, 2020

If you can reliably reproduce; could you type:

# systemd-analyze log-level debug

to enable debug logging

And see if the systemd logs on activation tell you more why it fails to set up the namespace? Please share the resulting logs here.

For completions' sake; are you running any custom kernel, and/or custom kernel commandline arguments that could influence systemd? e.g. cgroupsv2 enabled or not

@symphorien
Copy link
Member

In my experience, this error message means that one of the directories mentioned in the systemd unit does not exist. Did you run systemd-tmpfiles --create ?

@poelzi
Copy link
Member

poelzi commented Nov 29, 2020

@symphorien systemd-tmpfiles --create fixed acme for me. Why is this command not run properly ?

@flokli
Copy link
Contributor

flokli commented Nov 29, 2020

systemd-tmfiles only runs at NixOS activation and on bootup, and the folders it created might be shadowed by mounts happening later.

Also see #97702 (comment) for a slighly more verbose rant ;-)

@gtorrico
Copy link

gtorrico commented Dec 12, 2020

I am getting the same error and running the systemd-tmpfiles --create command does not fix the issue.

Here is a log. The root cause seems to be trying to read the oom_kill field error on the first line:

Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Failed to read oom_kill field of memory.events cgroup attribute: No such file or directory
Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Child 11115 belongs to acme-example.org.service.
Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Main process exited, code=exited, status=226/NAMESPACE
Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Failed with result 'exit-code'.
Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Service will not restart (restart setting)
Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Changed start -> failed
Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Job 13372 acme-example.org.service/start finished, result=failed
Dec 12 17:38:19 localhost systemd[1]: Failed to start Renew ACME certificate for example.org.
Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Unit entered failed state.
Dec 12 17:38:19 localhost systemd[1]: acme-example.org.service: Consumed 787us CPU time, no IP traffic.
Dec 12 17:42:49 localhost systemd[1]: acme-example.org.service: Changed failed -> dead
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Installed new job acme-example.org.service/start as 14205
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: cgroup-compat: Applying [Startup]IOWeight=100 as [Startup]BlockIOWeight=500
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Failed to set 'blkio.weight' attribute on '/system.slice/acme-example.org.service' to '500': No such file or directory
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Passing 0 fds to service
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: About to execute: /nix/store/6gc4smw6raih3wy6si07aqd0q9cpls2m-unit-script-acme-example.org-start/bin/acme-example.org-start
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Forked /nix/store/6gc4smw6raih3wy6si07aqd0q9cpls2m-unit-script-acme-example.org-start/bin/acme-example.org-start as 11630
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Changed dead -> start
Dec 12 17:42:50 localhost systemd[1]: Starting Renew ACME certificate for example.org...
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: User lookup succeeded: uid=998 gid=60
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Failed to read oom_kill field of memory.events cgroup attribute: No such file or directory
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Child 11630 belongs to acme-example.org.service.
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Main process exited, code=exited, status=226/NAMESPACE
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Failed with result 'exit-code'.
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Service will not restart (restart setting)
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Changed start -> failed
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Job 14205 acme-example.org.service/start finished, result=failed
Dec 12 17:42:50 localhost systemd[1]: Failed to start Renew ACME certificate for example.org.
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Unit entered failed state.
Dec 12 17:42:50 localhost systemd[1]: acme-example.org.service: Consumed 915us CPU time, no IP traffic.

BTW here is the relevant ACME configuration:

security.acme = {
  acceptTerms = true;
  email = "adminstrator@example.org";
  certs."example.org" = {
    domain="*.example.org";
    dnsProvider = "namecheap";
    credentialsFile = "/etc/namecheap-creds";
    group = "nginx";
  };
};

@m1cr0man
Copy link
Contributor

I am getting the same error and running the systemd-tmpfiles --create command does not fix the issue.

Here is a log. The root cause seems to be trying to read the oom_kill field error on the first line:

I can't see anything here that relates this particular failure directly to the acme module. My only assumption right now would be that something in systemd is failing to apply the ProtectSystem=full attribute in the service. Can you try putting this in your config and reporting back?:

systemd.services."example.org".serviceConfig.ProtectSystem = lib.mkForce false;

@gtorrico
Copy link

Can you try putting this in your config and reporting back?

Thanks, I tried it:

diff --git a/machines/localhost/nixos/configuration.nix b/machines/localhost/nixos/configuration.nix
index f872df8..9d65010 100644
--- a/machines/localhost/nixos/configuration.nix
+++ b/machines/localhost/nixos/configuration.nix
@@ -1,4 +1,4 @@
-{ config, pkgs, ... }:
+{ config, pkgs, lib, ... }:
 {
   imports =
     [
@@ -207,6 +207,7 @@
   };

   systemd.services.nginx.serviceConfig.ProtectHome = "read-only";
+  systemd.services."acme-example.org".serviceConfig.ProtectSystem = (lib.mkForce false);

   time.timeZone = "America/New_York";

but the error persists:

reloading the following units: dbus.service
restarting the following units: nginx.service, polkit.service
the following new units were started: acme-fixperms.service, acme-example.org.timer
warning: the following units failed: acme-example.org.service

● acme-example.org.service - Renew ACME certificate for example.org
     Loaded: loaded (/nix/store/4lx0w41wz0rzybfwkik7ybqbm45ff1d5-unit-acme-example.org.service/acme-example.org.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2020-12-13 12:35:08 EST; 238ms ago
TriggeredBy: ● acme-example.org.timer
    Process: 21728 ExecStart=/nix/store/6gc4smw6raih3wy6si07aqd0q9cpls2m-unit-script-acme-example.org-start/bin/acme-example.org-start (code=exited, status=226/NAMESPACE)
   Main PID: 21728 (code=exited, status=226/NAMESPACE)
         IP: 0B in, 0B out
        CPU: 760us

Dec 13 12:35:08 localhost systemd[1]: Starting Renew ACME certificate for example.org...
Dec 13 12:35:08 localhost systemd[21728]: acme-example.org.service: Failed to set up mount namespacing: /run/systemd/unit-root/tmp/accounts: No such file or directory
Dec 13 12:35:08 localhost systemd[21728]: acme-example.org.service: Failed at step NAMESPACE spawning /nix/store/6gc4smw6raih3wy6si07aqd0q9cpls2m-unit-script-acme-example.org-start/bin/        acme-example.org-start: No such file or directory
Dec 13 12:35:08 localhost systemd[1]: acme-example.org.service: Main process exited, code=exited, status=226/NAMESPACE
Dec 13 12:35:08 localhost systemd[1]: acme-example.org.service: Failed with result 'exit-code'.
Dec 13 12:35:08 localhost systemd[1]: Failed to start Renew ACME certificate for example.org.
warning: error(s) occurred while switching to the new configuration

@m1cr0man
Copy link
Contributor

Can you try running systemd-tmpfiles --clean, systemd-tmpfiles --create and try again? Failing that, can you set serviceConfig.PrivateTmp = false; and see if that works at all? I still believe the issue here is outside of the acme module. It must have something to do with /run/systemd/unit-root/tmp/ not existing for some reason.

@gtorrico
Copy link

Thank you @m1cr0man, I did try running the commands as well as setting PrivateTmp = false, but still got the same error:

[15:39] gtorrico@localhost ~/work/nixos-confs git[enable-acme:6b4f10]
% sudo systemd-tmpfiles --clean && echo "\ncleaned\n" && sudo systemd-tmpfiles --clean && echo "\ncreated\n" && sudo systemctl start acme-example.org.service
[sudo] password for gtorrico:

cleaned


created
Job for acme-example.org.service failed because the control process exited with error code.
See "systemctl status acme-example.org.service" and "journalctl -xe" for details.

[15:39] gtorrico@localhost ~/work/nixos-confs git[enable-acme:6b4f10]
%
Dec 13 15:39:44 localhost systemd[1]: Starting Renew ACME certificate for example.org...                                    ~/work/nixos-confs
Dec 13 15:39:44 localhost systemd[6642]: acme-example.org.service: Failed to set up mount namespacing: /run/systemd/unit-root/tmp/accounts: No such file or directory
Dec 13 15:39:44 localhost systemd[6642]: acme-example.org.service: Failed at step NAMESPACE spawning /nix/store/6gc4smw6raih3wy6si07aqd0q9cpls2m-unit-script-acme-torling.us-start/bin/acme-torling.us-start: No such file or directory
Dec 13 15:39:44 localhost systemd[1]: acme-example.org.service: Main process exited, code=exited, status=226/NAMESPACE
Dec 13 15:39:44 localhost systemd[1]: acme-example.org.service: Failed with result 'exit-code'.
Dec 13 15:39:44 localhost systemd[1]: Failed to start Renew ACME certificate for example.org.

[15:39] gtorrico@localhost ~/work/nixos-confs git[enable-acme:6b4f10]
% sudo systemctl --version
systemd 246 (246)
+PAM +AUDIT -SELINUX +IMA +APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ +LZ4 -ZSTD +SECCOMP +BLKID -ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified

I pasted the version of systemd in case it helps at all. For now I'll just continue running certbot to manually update my certificates.

@m1cr0man
Copy link
Contributor

@gtorrico could you give #106857 a test please and report back here whether it worked for you? If anyone else having this issue could do the same that would be great.

@gtorrico
Copy link

gtorrico commented Dec 14, 2020

@m1cr0man That fixed it for me!

There was an error the first run because LetsEncrypt was busy, but it worked during the second run:

[19:29] gtorrico@localhost ~/work/zother/nixpkgs git[m1cr0man-acme-fixes:5074de]
% sudo nixos-rebuild test -I nixpkgs=.

● acme-example.org.service - Renew ACME certificate for example.org
     Loaded: loaded (/nix/store/5ih39vi6wx790aw5nyhslj8j7nz2smb4-unit-acme-example.org.service/acme-example.org.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2020-12-13 19:28:14 EST; 1s ago
TriggeredBy: ● acme-example.org.timer
    Process: 677 ExecStart=/nix/store/g9c7wwcb5fmkh9k8yd072hdajn1i4pjy-unit-script-acme-example.org-start/bin/acme-example.org-start (code=exited, status=1/
FAILURE)
   Main PID: 677 (code=exited, status=1/FAILURE)
         IP: 724B in, 724B out
        CPU: 30ms

Dec 13 19:28:14 localhost systemd[1]: Starting Renew ACME certificate for example.org...
Dec 13 19:28:14 localhost acme-example.org-start[688]: 2020/12/13 19:28:14 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/dir
ectory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: device or resource busy
Dec 13 19:28:14 localhost systemd[1]: acme-example.org.service: Main process exited, code=exited, status=1/FAILURE
Dec 13 19:28:14 localhost systemd[1]: acme-example.org.service: Failed with result 'exit-code'.
Dec 13 19:28:14 localhost systemd[1]: Failed to start Renew ACME certificate for example.org.
Dec 13 19:28:14 localhost systemd[1]: acme-example.org.service: Consumed 30ms CPU time, received 724B IP traffic, sent 724B IP traffic.

[19:29] gtorrico@localhost ~/work/zother/nixpkgs git[m1cr0man-acme-fixes:5074de]
% sudo systemctl status acme-example.org.service
● acme-example.org.service - Renew ACME certificate for example.org
     Loaded: loaded (/nix/store/5ih39vi6wx790aw5nyhslj8j7nz2smb4-unit-acme-example.org.service/acme-example.org.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Sun 2020-12-13 19:29:26 EST; 1min 19s ago
TriggeredBy: ● acme-example.org.timer
    Process: 1671 ExecStart=/nix/store/g9c7wwcb5fmkh9k8yd072hdajn1i4pjy-unit-script-acme-example.org-start/bin/acme-example.org-start (code=exited, status=>
    Process: 1708 ExecStartPost=/nix/store/7grfn0rn97jlc2sjn55ld40hfr754x3s-acme-postrun (code=exited, status=0/SUCCESS)
   Main PID: 1671 (code=exited, status=0/SUCCESS)
         IP: 31.0K in, 12.4K out
        CPU: 96ms
        
Dec 13 19:29:25 localhost acme-example.org-start[1674]: 2020/12/13 19:29:25 [INFO] [*.example.org] acme: Cleaning DNS-01 challenge
Dec 13 19:29:26 localhost acme-example.org-start[1674]: 2020/12/13 19:29:26 [INFO] [*.example.org] acme: Validations succeeded; requesting certificates
Dec 13 19:29:26 localhost acme-example.org-start[1674]: 2020/12/13 19:29:26 [INFO] [*.example.org] Server responded with a certificate.
Dec 13 19:29:26 localhost acme-example.org-start[1671]: Installing new certificate
Dec 13 19:29:26 localhost acme-example.org-start[1703]: 'certificates/_.example.org.crt' -> 'out/fullchain.pem'
Dec 13 19:29:26 localhost acme-example.org-start[1704]: 'certificates/_.example.org.key' -> 'out/key.pem'
Dec 13 19:29:26 localhost acme-example.org-start[1705]: 'certificates/_.example.org.issuer.crt' -> 'out/chain.pem'
Dec 13 19:29:26 localhost systemd[1]: acme-example.org.service: Succeeded.
Dec 13 19:29:26 localhost systemd[1]: Finished Renew ACME certificate for example.org.
Dec 13 19:29:26 localhost systemd[1]: acme-example.org.service: Consumed 96ms CPU time, received 30.9K IP traffic, sent 12.4K IP traffic.

@m1cr0man
Copy link
Contributor

Amazing! Wrt the first failure, are you by chance running a DNS server locally too?

@gtorrico
Copy link

I am, just dnsmasq but just to get static IPs in my local network. I did not setup bind like the instructions mentioned.

Is the reason it failed the first time because the propagation check failed?

@m1cr0man
Copy link
Contributor

Yeah so there's a known issue where if your local resolver is used in your resolv.conf then the acme module may fail on a rebuild because it starts before your DNS server. I get the same error (or a similar one) on my system which runs Bind. The fix theoretically involves some sort of generalised ability for the ACME module to wait for a local DNS server to come online, and we haven't figured that out yet ;) but it will be done soon. I'll open a real ticket for it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants