Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul service doesn't start in nix-container on boot #80475

Open
jficz opened this issue Feb 18, 2020 · 2 comments
Open

Consul service doesn't start in nix-container on boot #80475

jficz opened this issue Feb 18, 2020 · 2 comments
Labels
0.kind: bug 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md

Comments

@jficz
Copy link
Contributor

jficz commented Feb 18, 2020

Describe the bug
When a consul service is defined in a declarative nix-container, the systemd service is started as soon as the networking.target is reached but the ExecStart consul process fails very quickly without any useful info in log. A default 100ms RestartSec causes the unit to fail within less then a second as 5 failed restarts are done within that time.

When RestartSec is increased to 1sec, the first 4 processes fails but the fifth starts normally:

[root@consulagent:~]# journalctl -eb -u consul
Feb 18 21:47:14 consulagent systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 21:47:14 consulagent systemd[1]: consul.service: Failed with result 'exit-code'.
Feb 18 21:47:15 consulagent systemd[1]: consul.service: Service RestartSec=1s expired, scheduling restart.
Feb 18 21:47:15 consulagent systemd[1]: consul.service: Scheduled restart job, restart counter is at 1.
Feb 18 21:47:15 consulagent systemd[1]: Stopped consul.service.
Feb 18 21:47:15 consulagent systemd[1]: Starting consul.service...
Feb 18 21:47:15 consulagent systemd[1]: Started consul.service.
Feb 18 21:47:15 consulagent systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 21:47:15 consulagent systemd[1]: consul.service: Failed with result 'exit-code'.
Feb 18 21:47:16 consulagent systemd[1]: consul.service: Service RestartSec=1s expired, scheduling restart.
Feb 18 21:47:16 consulagent systemd[1]: consul.service: Scheduled restart job, restart counter is at 2.
Feb 18 21:47:16 consulagent systemd[1]: Stopped consul.service.
Feb 18 21:47:16 consulagent systemd[1]: Starting consul.service...
Feb 18 21:47:16 consulagent systemd[1]: Started consul.service.
Feb 18 21:47:16 consulagent systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 21:47:16 consulagent systemd[1]: consul.service: Failed with result 'exit-code'.
Feb 18 21:47:18 consulagent systemd[1]: consul.service: Service RestartSec=1s expired, scheduling restart.
Feb 18 21:47:18 consulagent systemd[1]: consul.service: Scheduled restart job, restart counter is at 3.
Feb 18 21:47:18 consulagent systemd[1]: Stopped consul.service.
Feb 18 21:47:18 consulagent systemd[1]: Starting consul.service...
Feb 18 21:47:18 consulagent systemd[1]: Started consul.service.
Feb 18 21:47:18 consulagent systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 21:47:18 consulagent systemd[1]: consul.service: Failed with result 'exit-code'.
Feb 18 21:47:19 consulagent systemd[1]: consul.service: Service RestartSec=1s expired, scheduling restart.
Feb 18 21:47:19 consulagent systemd[1]: consul.service: Scheduled restart job, restart counter is at 4.
Feb 18 21:47:19 consulagent systemd[1]: Stopped consul.service.
Feb 18 21:47:19 consulagent systemd[1]: Starting consul.service...
Feb 18 21:47:19 consulagent systemd[1]: Started consul.service.
Feb 18 21:47:19 consulagent consul[524]: ==> Starting Consul agent...
Feb 18 21:47:19 consulagent consul[524]:            Version: 'v1.6.0'
Feb 18 21:47:19 consulagent consul[524]:            Node ID: 'c3f0b9f3-d00a-274c-86fb-75c991d85002'
Feb 18 21:47:19 consulagent consul[524]:          Node name: 'consulagent'
Feb 18 21:47:19 consulagent consul[524]:         Datacenter: 'dc1' (Segment: '')
Feb 18 21:47:19 consulagent consul[524]:             Server: false (Bootstrap: false)
Feb 18 21:47:19 consulagent consul[524]:        Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Feb 18 21:47:19 consulagent consul[524]:       Cluster Addr: 192.168.32.101 (LAN: 8301, WAN: 8302)
Feb 18 21:47:19 consulagent consul[524]:            Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
Feb 18 21:47:19 consulagent consul[524]: ==> Log data will now stream in as it occurs:
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [INFO] serf: EventMemberJoin: consulagent 192.168.32.101
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [INFO] agent: started state syncer
Feb 18 21:47:19 consulagent consul[524]: ==> Consul agent running!
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [WARN] manager: No servers available
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [ERR] agent: failed to sync remote state: No known Consul servers
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [INFO] serf: EventMemberJoin: nhost 192.168.32.1
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [WARN] memberlist: Refuting a suspect message (from: consulagent)
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [INFO] consul: adding server nhost (Addr: tcp/192.168.32.1:8300) (DC: dc1)
Feb 18 21:47:19 consulagent consul[524]:     2020/02/18 21:47:19 [INFO] consul: New leader elected: nhost
Feb 18 21:47:21 consulagent consul[524]:     2020/02/18 21:47:21 [INFO] agent: Synced node info
Feb 18 21:47:40 consulagent consul[524]: ==> Newer Consul version available: 1.7.0 (currently running: 1.6.0)
Feb 18 21:48:03 consulagent consul[524]:     2020/02/18 21:48:03 [INFO] agent: Synced service "consulagent"
Feb 18 21:49:19 consulagent consul[524]:     2020/02/18 21:49:19 [INFO] agent: Synced service "consulagent"
Feb 18 21:50:16 consulagent consul[524]:     2020/02/18 21:50:16 [INFO] agent: Synced service "consulagent"
Feb 18 21:51:18 consulagent consul[524]:     2020/02/18 21:51:18 [INFO] agent: Synced service "consulagent"
Feb 18 21:52:01 consulagent consul[524]:     2020/02/18 21:52:01 [INFO] agent: Synced service "consulagent"
Feb 18 21:53:13 consulagent consul[524]:     2020/02/18 21:53:13 [INFO] agent: Synced service "consulagent"
Feb 18 21:54:19 consulagent consul[524]:     2020/02/18 21:54:19 [INFO] agent: Synced service "consulagent"

Apparently there is some kind of circa 5-second race condition but so far I was unable to figure out where.

To Reproduce
Steps to reproduce the behavior:
1.

        config = { config, pkgs, lib, ... }: 
        {
           # systemd.services.consul.serviceConfig.RestartSec = 1;
            services.consul = {
                enable = true;
            };
           ...

  1. nixos-rebuild switch
  2. nix-container root-login consulagent
  3. systemctl status consul

Expected behavior
Service is active and running.

Additional context
I'm actually using nixops to deploy the container on a remote NixOS box but to my knowledge this shouldn't make a difference.

Metadata
Output from the laptop where nixops is run. Note that nixops uses the user's stable-19.09 channel for the deployment.

 - system: `"x86_64-linux"`
 - host os: `Linux 5.5.2, NixOS, 20.03pre212208.8130f3c1c2b (Markhor)`
 - multi-user?: `yes`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.3.2`
 - channels(root): `"nixos-20.03pre212208.8130f3c1c2b, stable-19.09.2036.c49da6435f3"`
 - channels(user): `"stable-19.09.2079.8731aaaf8b3, unstable-20.03pre212208.8130f3c1c2b"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute: consul
# a list of nixos modules affected by the problem
module: consul
@jficz
Copy link
Contributor Author

jficz commented Feb 19, 2020

this helps:

    systemd.services.consul = {
        requires        = [ "network-online.target" ];
    };

I'll create a PR later.

@stale
Copy link

stale bot commented Aug 17, 2020

Hello, I'm a bot and I thank you in the name of the community for opening this issue.

To help our human contributors focus on the most-relevant reports, I check up on old issues to see if they're still relevant. This issue has had no activity for 180 days, and so I marked it as stale, but you can rest assured it will never be closed by a non-human.

The community would appreciate your effort in checking if the issue is still valid. If it isn't, please close it.

If the issue persists, and you'd like to remove the stale label, you simply need to leave a comment. Your comment can be as simple as "still important to me". If you'd like it to get more attention, you can ask for help by searching for maintainers and people that previously touched related code and @ mention them in a comment. You can use Git blame or GitHub's web interface on the relevant files to find them.

Lastly, you can always ask for help at our Discourse Forum or at #nixos' IRC channel.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md
Projects
None yet
Development

No branches or pull requests

1 participant