Wireguard - waiting for DNS before trying to start the interface #30459

sjau · 2017-10-16T06:42:38Z

Issue description

I have wireguard installed and I use a domain name for the connection to the wireguard server. Problem is, that upon boot, wg doesn't seem to wait for dns resolution and hence starting the interface fails.

Steps to reproduce

Add wireguard client configuration like:

    # Enable Wireguard
    networking.wireguard.interfaces = {
        wg0 = {
            ips = [ "10.10.0.2/24" ];
            peers = [ {
                allowedIPs = [ "10.10.0.0/24" ];
                endpoint = "wireguard.server.tld:51820";
                publicKey = "yaddayaddayadda";
                persistentKeepalive = 25;
            } ];
            privateKey = "anotheryadda";
        };
    };

And often - not always - I get journalctl entries like this

Okt 16 08:27:14 subi ip[6066]: Cannot find device "wg0"
Okt 16 08:27:14 subi kernel: wireguard: WireGuard 0.0.20171005 loaded. See www.wireguard.com for information.
Okt 16 08:27:14 subi kernel: wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
Okt 16 08:27:14 subi polkitd[6128]: Started polkitd version 0.113
Okt 16 08:27:14 subi wg[6233]: Name or service not known: `wireguard.server.tld:51820`
Okt 16 08:27:14 subi systemd[1]: wireguard-wg0.service: Main process exited, code=exited, status=1/FAILURE
Okt 16 08:27:14 subi systemd[1]: Failed to start WireGuard Tunnel - wg0.
Okt 16 08:27:14 subi systemd[1]: wireguard-wg0.service: Unit entered failed state.
Okt 16 08:27:14 subi systemd[1]: wireguard-wg0.service: Failed with result 'exit-code'.

From what I read is that it can't resolve the dns properly.

Technical details

System: (NixOS: nixos-version, Ubuntu/Fedora: lsb_release -a, ...)

18.03pre117886.874a3c033c (Impala)

Nix version: (run nix-env --version)

nix-env (Nix) 1.11.15

Nixpkgs version: (run nix-instantiate --eval "<nixpkgs>" -A lib.nixpkgsVersion)

"18.03pre117939.3ee33f35f8"

Sandboxing enabled: (run grep build-use-sandbox /etc/nix/nix.conf)

build-use-sandbox = false

The text was updated successfully, but these errors were encountered:

Mic92 · 2017-10-16T11:32:26Z

This must be solved upstream. In my systemd-networkd pull request I do exponential backoff.

Mic92 · 2017-10-16T11:33:12Z

cc @zx2c4

zx2c4 · 2017-10-16T12:20:02Z

In my systemd-networkd pull request

Any plans to finish that soon?

Funny enough I changed the resolv algorithm a bit last night, actually:

https://git.zx2c4.com/WireGuard/commit/?id=76a9bb3898fa7ce8574a32d587014ae91ab34703

This is based on the discussion here:

https://sourceware.org/glibc/wiki/NameResolver

From the perspective of the application that calls getaddrinfo() it perhaps doesn't matter that much since EAI_FAIL, EAI_NONAME and EAI_NODATA are all permanent failure codes and the causes are all permanent failures in the sense that there is no point in retrying later.

However, it appears that @sjau is receiving EAI_NONAME, which means the above patch won't help. Why is @sjau receiving a permanent error at this stage? @sjau - could you describe in depth your DNS configuration?

sjau · 2017-10-16T12:31:01Z

Having a Turris Omni router. In it I run dnsmasq to resolve the server domain name (dyndns address) locally. So that it resolves properly from the lan and from the wan.

The entry for dnsmasq in the /etc/hosts.add file in the TO router is:

10.0.0.10 wireguard.server.tld

zx2c4 · 2017-10-16T12:34:36Z

The hardware used isn't relevant.

What is in your /etc/resolv.conf? nameserver 127.0.0.1? What's in your /etc/nsswitch.conf? Does this issue occur because wg is executed before dnsmasq is started? What is the purpose of you using /etc/hosts.add in that way, rather than ordinary DNS?

sjau · 2017-10-16T12:37:02Z

cat /etc/nsswitch.conf 
passwd:    files mymachines systemd
group:     files mymachines systemd
shadow:    files

hosts:     files mymachines mdns_minimal [!UNAVAIL=return] dns mdns myhostname
networks:  files

ethers:    files
services:  files
protocols: files
rpc:       files

cat /etc/resolv.conf 
# Generated by resolvconf
nameserver 10.10.10.1
nameserver 2a02:169:802::1
nameserver fdfe:f8a3:6ff2::1
options edns0

Client doesn't use dnsmasq... that's the router.

Mic92 · 2017-10-16T12:37:36Z

So either avahi's mdns or the glibc resolver return this.

Mic92 · 2017-10-16T12:38:10Z

I also remember that I had to treat every error as transient error in networkd.

zx2c4 · 2017-10-16T12:46:34Z

Client doesn't use dnsmasq... that's the router.

In that case, I have no idea what belongs to what and what on earth you're talking about. So let's start over:

Our issue is with the client. I don't want to hear about other computers. Just the client. Would you summarize in one post all of the relevant DNS information about the client?

sjau · 2017-10-16T12:48:14Z

Well, you wanted to know in depth dns configuration. Since I don't run a dns server on the client, I assumed you meant the dns server that I use... so I gave you that.

That's from the client, the information you requested:

#30459 (comment)

zx2c4 · 2017-10-16T12:58:31Z

Okay, thanks for the clarification. So:

hosts:     files mymachines mdns_minimal [!UNAVAIL=return] dns mdns myhostname

Are any of these tweaked by you? Or is this a standard NixOS situation? For comparison, my line (from a different distro) just looks like:

hosts:       files dns

So one line in files mymachines mdns_minimal [!UNAVAIL=return] dns mdns myhostname is returning EAI_NONAME instead of EAI_AGAIN like it should. Can you figure out which of these is responsible for that? For example, I could imagine [!UNAVAIL=return] is a bit problematic.

It looks like this is a NixOS-particularity:

nixpkgs/nixos/modules/config/nsswitch.nix

Line 21 in 72a64ea

++ optionals nssmdns [ "mdns_minimal [!UNAVAIL=return]" ]

It was caused by commit 987aac7 . This commit changed the more sensible [NOTFOUND=return] to [!UNAVAIL=return]. Neither the commit message nor the related issue #18183 address the reasoning for this particular change. I suspect it was accidental.

This is only a hypothesis. We won't know until @sjau tests it out, by changing [!UNAVAIL=return] to [NOTFOUND=return].

sjau · 2017-10-16T13:07:29Z

I did set to use Avahi:

    services.avahi = {
        enable = true;
        hostName = "${mySecrets.hostname}";
        nssmdns = true;
    };

but I don't really need it.

Commit 987aac7 and issue #18183 were intended to fix support for other things, but in the process, changed mdns_minimal to use the wrong return setting, resulting in permanent failures in early boot, affecting things like issue #30459. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

fpletz · 2017-10-23T23:15:15Z

Fixed by #30472. Thanks!

sjau · 2017-11-04T17:50:40Z

actually, this hasn't been solved yet...

it's still failing

-- Reboot --
Nov 04 18:45:18 subi systemd[1]: Starting WireGuard Tunnel - wg_home...
Nov 04 18:45:18 subi ip[5601]: Cannot find device "wg_home"
Nov 04 18:45:18 subi systemd[1]: wireguard-wg_home.service: Main process exited, code=exited, status=1/FAILURE
Nov 04 18:45:18 subi wg[6019]: Name or service not known: `servi.home.sjau.ch:51820'
Nov 04 18:45:18 subi systemd[1]: Failed to start WireGuard Tunnel - wg_home.
Nov 04 18:45:18 subi systemd[1]: wireguard-wg_home.service: Unit entered failed state.
Nov 04 18:45:18 subi systemd[1]: wireguard-wg_home.service: Failed with result 'exit-code'.

Wouldn't it be better for wg to retry automatically?

vcunat · 2017-11-04T18:48:38Z

On some kinds of "error conditions" it's not much meaningful to retry, as described in https://git.zx2c4.com/WireGuard/commit/?id=76a9bb3898fa7ce8574a32d587014ae91ab34703

but from the log it's not clear what exactly the OS got from DNS.

sjau · 2017-11-04T19:27:16Z

How to get different log? That's from journalctl....

And once Nixos is boot and I reissue again as root systemctl restart wireguard-wg_home it works then fine. Hence restart would help in that case.

zx2c4 · 2017-11-05T01:30:15Z

Looks like it's returning EAI_NONAME. Can you verify the contents of /etc/resolv.conf at that point during boot?

sjau · 2017-11-05T06:44:03Z

How to do that?

cat /etc/resolv.conf 
# Generated by resolvconf
nameserver 10.0.0.1
nameserver 2a02:169:800::1
nameserver fd15:eec1:83bf::1
options edns0

That's the resolv.conf once the system has booted up but no idea if that's the same at that point during boot.

vcunat · 2017-11-05T10:48:46Z

Something is probably different, given that restarting the service later fixes the problem. @sjau: what's your DNS setup? On the client you get all servers from DHCP only? DHCP is served by Turris Omnia router and these three addresses belong to it?

Assuming the name is servi.home.sjau.ch., I suspect there might be a problem if IPv6 gets up sooner than IPv4, because that particular name has no AAAA record...

sjau · 2017-11-05T11:05:00Z

except for a few entries in the hosts file, I get everything from the TO router.

sjau · 2017-11-05T12:10:37Z

Ok, I disabled IPv6 on my notebook and still the same:

    networking = {
        # Disable IPv6
        enableIPv6 = false;

-- Reboot --
Nov 05 13:04:50 subi systemd[1]: Starting WireGuard Tunnel - wg_home...
Nov 05 13:04:50 subi ip[5174]: Cannot find device "wg_home"
Nov 05 13:04:50 subi wg[5575]: Name or service not known: `servi.home.sjau.ch:51820'
Nov 05 13:04:50 subi systemd[1]: wireguard-wg_home.service: Main process exited, code=exited, status=1/FAILURE
Nov 05 13:04:50 subi systemd[1]: Failed to start WireGuard Tunnel - wg_home.
Nov 05 13:04:50 subi systemd[1]: wireguard-wg_home.service: Unit entered failed state.
Nov 05 13:04:50 subi systemd[1]: wireguard-wg_home.service: Failed with result 'exit-code'.

zx2c4 · 2017-11-05T19:10:17Z

Can you post a more substantial log to see the various other things happening?

tpanum · 2019-01-08T07:27:38Z

This is still happening for me :-(

sjau · 2019-01-08T07:50:25Z

My workaround is this:

https://github.com/sjau/nix-expressions/blob/master/wgStartFix.nix

timokau · 2019-01-13T08:57:23Z

I can confirm that this is still an issue. A nicer workaround without requiring a separate service to watch wireguard may be:

networking.wireguard.interfaces.wg0 = {
  preSetup = ''
    # Try to access the DNS for up to 300s
    for i in {1..300}; do
      ${pkgs.iputils}/bin/ping -c1 '<insert domain to resolve here>' && break
      echo "Attempt $i: DNS still not available"
      sleep 1s
    done
  '';
  ...
}

On my first try, that succeeded to start the wireguard service after 11 failed attempts (i.e. 11 seconds after network-online).

sjau · 2019-01-13T10:15:01Z

The question is why this happens.

Yours works also but what if you have multiple wg up?

timokau · 2019-01-13T16:21:47Z

The question is why this happens.

Yes, that would be good to know. @Mic92 do you have any ideas?

Yours works also but what if you have multiple wg up?

Why would that make a difference? You'd need to add a similar preSetup to each that uses DNS.

timokau · 2019-01-16T19:22:58Z

After some research I though the solution would be to add nss-lookup.target to the After and Requires section of the service. Unfortunately that did not improve the situation.

Mic92 · 2019-01-16T20:22:06Z

@timokau between our glibc resolver and the application is nscd. Maybe that alters the errno returned to the application. Apart from that anything in our /etc/nsswitch.conf can change the results.

timokau · 2019-01-16T21:48:18Z

I'm not sure I understand that. Why do you assume an error is altered?

Mic92 · 2019-01-16T22:15:10Z

@timokau because the error returned by our setup seems not to match what other distributions return, when the resolver cannot be reached: #30459 (comment)

timokau · 2019-01-16T22:29:09Z

Looks like I don't know enough about the subject to contribute :/

sjau · 2019-02-11T11:22:16Z

Hmmm, I just discovered that systemd.network does support now wireguard as well:

https://manpages.debian.org/testing/systemd/systemd.netdev.5.en.html#%5BWIREGUARDPEER%5D_SECTION_OPTIONS

Maybe it's easier to just setup there? According to the nixos options it's not yet supported though.

sjau · 2019-05-23T06:21:05Z

Coming back to this ancient problem:

Last night gchristensen and I were talking a bit about this problem. in the end he suggested to add a

    systemd.services.wireguard-wg[...].serviceConfig.Restart = "on-failure";
    systemd.services.wireguard-wg[...].serviceConfig.RestartSec = "5s";

To the configuration.nix so that it will restart. However during rebuilding it complained that restart on failure does not go hand-in-hand with "oneshot" systemd type. (There's also an issue on systemd tracker that asks for adding retry on failure to oneshot type: systemd/systemd#2582 ).

So, I tried to alter the oneshot type to simple by altering

          Type = "oneshot";

to

          Type = "simple";

in the https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/services/networking/wireguard.nix#L248 file.

Since I didn't know the proper syntax to add RestartSec = "5s"; in the nix file there, I did just add

    systemd.services.wireguard-wg[...].serviceConfig.RestartSec = "5s";

to the configuration.nix

After that I did rebuild and it works fine. It will now re-try to connect to the wg vpns even if first attempt isn't successfull.

Currently my notebook connects to 3 wg vpns with this and I've had several reboots (for testing) and they always came back up again.

So the question is: What is the benefit of "oneshot" compared to "simple" since "oneshot" seems to prevent wg coming up properly when using a domain name as server traget instead of an ip address.

sjau · 2019-05-26T11:04:19Z

This got fixed:

#61971

sjau · 2019-06-20T19:03:13Z

While #61971 did fix the issue, later changes introduced the same problem again for the wg peers. They will not get started properly because of oneshot type and no dns available at bootup.

stale · 2020-06-02T17:10:42Z

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse.
Ask on the #nixos channel on irc.freenode.net.

sjau · 2020-06-04T04:56:08Z

Still important to me

llewxam-kache · 2020-07-09T21:08:32Z

Relevant to my interests as well, and I'm also a little confused as to why the systemd service runs as a 'oneshot' instead of continually retrying (maybe at exponential intervals up to 30 seconds)
If you're running this service, presumably you always want it to come up, or at least want to define how many times it tries to restart, right?

Edit:
I think a minor fix for my Ubuntu based install of WG was to change the wg-quick@.service to include StartLimitBurst=5 (maybe not needed) under [Unit]; change the Type under [Service] to simple, and add Restart=on-failure ; RestartSec=5s
systemctl daemon-relaod, etc, reboot, looks happy

stale · 2021-01-06T04:03:21Z

I marked this as stale due to inactivity. → More info

tgunnoe · 2021-01-08T17:27:37Z

still important to me as well, an easier vpn setup

flokli · 2021-01-09T18:38:11Z

Our whole networking.wireguard setup is somewhat scripted, so it being handled in oneshot services seems appropriate - there's just no long-running daemon here, but just a bunch of iproute2/wg command invocations.

The oneshot services setting up the interface do wait for network-online.target (which means DNS should be reachable), and since 1de35c7, all the per-peer scripts depend on that script. They also set WG_ENDPOINT_RESOLUTION_RETRIES = "infinity", which should help with some intermittent DNS blips.

@sjau the original issue seems to be solved. If you still experience problems, please open a new issue with a reproducer and up2date logs.

zx2c4 mentioned this issue Oct 16, 2017

nsswitch: use [NOTFOUND=return] for mdns #30472

Merged

fpletz closed this as completed Oct 23, 2017

timokau reopened this Jan 13, 2019

sjau mentioned this issue May 23, 2019

wireguard: restart on failure\nAs a oneshot service, if the startup f… #61971

Merged

10 tasks

sjau closed this as completed May 26, 2019

grahamc mentioned this issue May 31, 2019

wireguard: 0.0.20190406 -> 0.0.20190531 and Change peers without tearing down the interface, handle DNS failures better #62325

Merged

10 tasks

sjau reopened this Jun 20, 2019

Shados mentioned this issue Jul 24, 2019

Wireguard doesn't bring up peers #63869

Open

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 2, 2020

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 4, 2020

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 6, 2021

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 8, 2021

flokli closed this as completed Jan 9, 2021

Wireguard - waiting for DNS before trying to start the interface #30459

Wireguard - waiting for DNS before trying to start the interface #30459

Comments

sjau commented Oct 16, 2017

Issue description

Steps to reproduce

Technical details

Mic92 commented Oct 16, 2017

Mic92 commented Oct 16, 2017

zx2c4 commented Oct 16, 2017 • edited

sjau commented Oct 16, 2017

zx2c4 commented Oct 16, 2017 • edited

sjau commented Oct 16, 2017

Mic92 commented Oct 16, 2017

Mic92 commented Oct 16, 2017

zx2c4 commented Oct 16, 2017

sjau commented Oct 16, 2017

zx2c4 commented Oct 16, 2017 • edited

sjau commented Oct 16, 2017

fpletz commented Oct 23, 2017

sjau commented Nov 4, 2017

vcunat commented Nov 4, 2017

sjau commented Nov 4, 2017 • edited

zx2c4 commented Nov 5, 2017

sjau commented Nov 5, 2017 • edited

vcunat commented Nov 5, 2017

sjau commented Nov 5, 2017

sjau commented Nov 5, 2017

zx2c4 commented Nov 5, 2017

tpanum commented Jan 8, 2019

sjau commented Jan 8, 2019

timokau commented Jan 13, 2019

sjau commented Jan 13, 2019

timokau commented Jan 13, 2019

timokau commented Jan 16, 2019

Mic92 commented Jan 16, 2019

timokau commented Jan 16, 2019

Mic92 commented Jan 16, 2019

timokau commented Jan 16, 2019

sjau commented Feb 11, 2019

sjau commented May 23, 2019

sjau commented May 26, 2019

sjau commented Jun 20, 2019

stale bot commented Jun 2, 2020

sjau commented Jun 4, 2020

llewxam-kache commented Jul 9, 2020 • edited

stale bot commented Jan 6, 2021

tgunnoe commented Jan 8, 2021

flokli commented Jan 9, 2021

zx2c4 commented Oct 16, 2017 •

edited

zx2c4 commented Oct 16, 2017 •

edited

zx2c4 commented Oct 16, 2017 •

edited

sjau commented Nov 4, 2017 •

edited

sjau commented Nov 5, 2017 •

edited

llewxam-kache commented Jul 9, 2020 •

edited