Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wireguard - waiting for DNS before trying to start the interface #30459

Closed
sjau opened this issue Oct 16, 2017 · 42 comments
Closed

Wireguard - waiting for DNS before trying to start the interface #30459

sjau opened this issue Oct 16, 2017 · 42 comments

Comments

@sjau
Copy link

sjau commented Oct 16, 2017

Issue description

I have wireguard installed and I use a domain name for the connection to the wireguard server. Problem is, that upon boot, wg doesn't seem to wait for dns resolution and hence starting the interface fails.

Steps to reproduce

Add wireguard client configuration like:

    # Enable Wireguard
    networking.wireguard.interfaces = {
        wg0 = {
            ips = [ "10.10.0.2/24" ];
            peers = [ {
                allowedIPs = [ "10.10.0.0/24" ];
                endpoint = "wireguard.server.tld:51820";
                publicKey = "yaddayaddayadda";
                persistentKeepalive = 25;
            } ];
            privateKey = "anotheryadda";
        };
    };

And often - not always - I get journalctl entries like this

Okt 16 08:27:14 subi ip[6066]: Cannot find device "wg0"
Okt 16 08:27:14 subi kernel: wireguard: WireGuard 0.0.20171005 loaded. See www.wireguard.com for information.
Okt 16 08:27:14 subi kernel: wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
Okt 16 08:27:14 subi polkitd[6128]: Started polkitd version 0.113
Okt 16 08:27:14 subi wg[6233]: Name or service not known: `wireguard.server.tld:51820`
Okt 16 08:27:14 subi systemd[1]: wireguard-wg0.service: Main process exited, code=exited, status=1/FAILURE
Okt 16 08:27:14 subi systemd[1]: Failed to start WireGuard Tunnel - wg0.
Okt 16 08:27:14 subi systemd[1]: wireguard-wg0.service: Unit entered failed state.
Okt 16 08:27:14 subi systemd[1]: wireguard-wg0.service: Failed with result 'exit-code'.

From what I read is that it can't resolve the dns properly.

Technical details

  • System: (NixOS: nixos-version, Ubuntu/Fedora: lsb_release -a, ...)

18.03pre117886.874a3c033c (Impala)

  • Nix version: (run nix-env --version)

nix-env (Nix) 1.11.15

  • Nixpkgs version: (run nix-instantiate --eval "<nixpkgs>" -A lib.nixpkgsVersion)

"18.03pre117939.3ee33f35f8"

  • Sandboxing enabled: (run grep build-use-sandbox /etc/nix/nix.conf)

build-use-sandbox = false

@Mic92
Copy link
Member

Mic92 commented Oct 16, 2017

This must be solved upstream. In my systemd-networkd pull request I do exponential backoff.

@Mic92
Copy link
Member

Mic92 commented Oct 16, 2017

cc @zx2c4

@zx2c4
Copy link
Contributor

zx2c4 commented Oct 16, 2017

In my systemd-networkd pull request

Any plans to finish that soon?

Funny enough I changed the resolv algorithm a bit last night, actually:

https://git.zx2c4.com/WireGuard/commit/?id=76a9bb3898fa7ce8574a32d587014ae91ab34703

This is based on the discussion here:

https://sourceware.org/glibc/wiki/NameResolver

From the perspective of the application that calls getaddrinfo() it perhaps doesn't matter that much since EAI_FAIL, EAI_NONAME and EAI_NODATA are all permanent failure codes and the causes are all permanent failures in the sense that there is no point in retrying later.

However, it appears that @sjau is receiving EAI_NONAME, which means the above patch won't help. Why is @sjau receiving a permanent error at this stage? @sjau - could you describe in depth your DNS configuration?

@sjau
Copy link
Author

sjau commented Oct 16, 2017

Having a Turris Omni router. In it I run dnsmasq to resolve the server domain name (dyndns address) locally. So that it resolves properly from the lan and from the wan.

The entry for dnsmasq in the /etc/hosts.add file in the TO router is:

10.0.0.10 wireguard.server.tld

@zx2c4
Copy link
Contributor

zx2c4 commented Oct 16, 2017

The hardware used isn't relevant.

What is in your /etc/resolv.conf? nameserver 127.0.0.1? What's in your /etc/nsswitch.conf? Does this issue occur because wg is executed before dnsmasq is started? What is the purpose of you using /etc/hosts.add in that way, rather than ordinary DNS?

@sjau
Copy link
Author

sjau commented Oct 16, 2017

cat /etc/nsswitch.conf 
passwd:    files mymachines systemd
group:     files mymachines systemd
shadow:    files

hosts:     files mymachines mdns_minimal [!UNAVAIL=return] dns mdns myhostname
networks:  files

ethers:    files
services:  files
protocols: files
rpc:       files
cat /etc/resolv.conf 
# Generated by resolvconf
nameserver 10.10.10.1
nameserver 2a02:169:802::1
nameserver fdfe:f8a3:6ff2::1
options edns0

Client doesn't use dnsmasq... that's the router.

@Mic92
Copy link
Member

Mic92 commented Oct 16, 2017

So either avahi's mdns or the glibc resolver return this.

@Mic92
Copy link
Member

Mic92 commented Oct 16, 2017

I also remember that I had to treat every error as transient error in networkd.

@zx2c4
Copy link
Contributor

zx2c4 commented Oct 16, 2017

Client doesn't use dnsmasq... that's the router.

In that case, I have no idea what belongs to what and what on earth you're talking about. So let's start over:

Our issue is with the client. I don't want to hear about other computers. Just the client. Would you summarize in one post all of the relevant DNS information about the client?

@sjau
Copy link
Author

sjau commented Oct 16, 2017

Well, you wanted to know in depth dns configuration. Since I don't run a dns server on the client, I assumed you meant the dns server that I use... so I gave you that.

That's from the client, the information you requested:

#30459 (comment)

@zx2c4
Copy link
Contributor

zx2c4 commented Oct 16, 2017

Okay, thanks for the clarification. So:

hosts:     files mymachines mdns_minimal [!UNAVAIL=return] dns mdns myhostname

Are any of these tweaked by you? Or is this a standard NixOS situation? For comparison, my line (from a different distro) just looks like:

hosts:       files dns

So one line in files mymachines mdns_minimal [!UNAVAIL=return] dns mdns myhostname is returning EAI_NONAME instead of EAI_AGAIN like it should. Can you figure out which of these is responsible for that? For example, I could imagine [!UNAVAIL=return] is a bit problematic.

It looks like this is a NixOS-particularity:

++ optionals nssmdns [ "mdns_minimal [!UNAVAIL=return]" ]

It was caused by commit 987aac7 . This commit changed the more sensible [NOTFOUND=return] to [!UNAVAIL=return]. Neither the commit message nor the related issue #18183 address the reasoning for this particular change. I suspect it was accidental.

This is only a hypothesis. We won't know until @sjau tests it out, by changing [!UNAVAIL=return] to [NOTFOUND=return].

@sjau
Copy link
Author

sjau commented Oct 16, 2017

I did set to use Avahi:

    services.avahi = {
        enable = true;
        hostName = "${mySecrets.hostname}";
        nssmdns = true;
    };

but I don't really need it.

Mic92 pushed a commit that referenced this issue Oct 16, 2017
Commit 987aac7 and issue #18183 were intended to fix support for other
things, but in the process, changed mdns_minimal to use the wrong return
setting, resulting in permanent failures in early boot, affecting things
like issue #30459.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
@fpletz
Copy link
Member

fpletz commented Oct 23, 2017

Fixed by #30472. Thanks!

@fpletz fpletz closed this as completed Oct 23, 2017
@sjau
Copy link
Author

sjau commented Nov 4, 2017

actually, this hasn't been solved yet...

it's still failing

-- Reboot --
Nov 04 18:45:18 subi systemd[1]: Starting WireGuard Tunnel - wg_home...
Nov 04 18:45:18 subi ip[5601]: Cannot find device "wg_home"
Nov 04 18:45:18 subi systemd[1]: wireguard-wg_home.service: Main process exited, code=exited, status=1/FAILURE
Nov 04 18:45:18 subi wg[6019]: Name or service not known: `servi.home.sjau.ch:51820'
Nov 04 18:45:18 subi systemd[1]: Failed to start WireGuard Tunnel - wg_home.
Nov 04 18:45:18 subi systemd[1]: wireguard-wg_home.service: Unit entered failed state.
Nov 04 18:45:18 subi systemd[1]: wireguard-wg_home.service: Failed with result 'exit-code'.

Wouldn't it be better for wg to retry automatically?

@vcunat
Copy link
Member

vcunat commented Nov 4, 2017

On some kinds of "error conditions" it's not much meaningful to retry, as described in https://git.zx2c4.com/WireGuard/commit/?id=76a9bb3898fa7ce8574a32d587014ae91ab34703

but from the log it's not clear what exactly the OS got from DNS.

@sjau
Copy link
Author

sjau commented Nov 4, 2017

How to get different log? That's from journalctl....

And once Nixos is boot and I reissue again as root systemctl restart wireguard-wg_home it works then fine. Hence restart would help in that case.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 5, 2017

Looks like it's returning EAI_NONAME. Can you verify the contents of /etc/resolv.conf at that point during boot?

@sjau
Copy link
Author

sjau commented Nov 5, 2017

How to do that?

cat /etc/resolv.conf 
# Generated by resolvconf
nameserver 10.0.0.1
nameserver 2a02:169:800::1
nameserver fd15:eec1:83bf::1
options edns0

That's the resolv.conf once the system has booted up but no idea if that's the same at that point during boot.

@vcunat
Copy link
Member

vcunat commented Nov 5, 2017

Something is probably different, given that restarting the service later fixes the problem. @sjau: what's your DNS setup? On the client you get all servers from DHCP only? DHCP is served by Turris Omnia router and these three addresses belong to it?

Assuming the name is servi.home.sjau.ch., I suspect there might be a problem if IPv6 gets up sooner than IPv4, because that particular name has no AAAA record...

@sjau
Copy link
Author

sjau commented Nov 5, 2017

except for a few entries in the hosts file, I get everything from the TO router.

@sjau
Copy link
Author

sjau commented Nov 5, 2017

Ok, I disabled IPv6 on my notebook and still the same:

    networking = {
        # Disable IPv6
        enableIPv6 = false;
-- Reboot --
Nov 05 13:04:50 subi systemd[1]: Starting WireGuard Tunnel - wg_home...
Nov 05 13:04:50 subi ip[5174]: Cannot find device "wg_home"
Nov 05 13:04:50 subi wg[5575]: Name or service not known: `servi.home.sjau.ch:51820'
Nov 05 13:04:50 subi systemd[1]: wireguard-wg_home.service: Main process exited, code=exited, status=1/FAILURE
Nov 05 13:04:50 subi systemd[1]: Failed to start WireGuard Tunnel - wg_home.
Nov 05 13:04:50 subi systemd[1]: wireguard-wg_home.service: Unit entered failed state.
Nov 05 13:04:50 subi systemd[1]: wireguard-wg_home.service: Failed with result 'exit-code'.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 5, 2017

Can you post a more substantial log to see the various other things happening?

@tpanum
Copy link
Contributor

tpanum commented Jan 8, 2019

This is still happening for me :-(

@sjau
Copy link
Author

sjau commented Jan 8, 2019

@timokau
Copy link
Member

timokau commented Jan 13, 2019

I can confirm that this is still an issue. A nicer workaround without requiring a separate service to watch wireguard may be:

networking.wireguard.interfaces.wg0 = {
  preSetup = ''
    # Try to access the DNS for up to 300s
    for i in {1..300}; do
      ${pkgs.iputils}/bin/ping -c1 '<insert domain to resolve here>' && break
      echo "Attempt $i: DNS still not available"
      sleep 1s
    done
  '';
  ...
}

On my first try, that succeeded to start the wireguard service after 11 failed attempts (i.e. 11 seconds after network-online).

@timokau timokau reopened this Jan 13, 2019
@sjau
Copy link
Author

sjau commented Jan 13, 2019

The question is why this happens.

Yours works also but what if you have multiple wg up?

@timokau
Copy link
Member

timokau commented Jan 13, 2019

The question is why this happens.

Yes, that would be good to know. @Mic92 do you have any ideas?

Yours works also but what if you have multiple wg up?

Why would that make a difference? You'd need to add a similar preSetup to each that uses DNS.

@timokau
Copy link
Member

timokau commented Jan 16, 2019

After some research I though the solution would be to add nss-lookup.target to the After and Requires section of the service. Unfortunately that did not improve the situation.

@Mic92
Copy link
Member

Mic92 commented Jan 16, 2019

@timokau between our glibc resolver and the application is nscd. Maybe that alters the errno returned to the application. Apart from that anything in our /etc/nsswitch.conf can change the results.

@timokau
Copy link
Member

timokau commented Jan 16, 2019

I'm not sure I understand that. Why do you assume an error is altered?

@Mic92
Copy link
Member

Mic92 commented Jan 16, 2019

@timokau because the error returned by our setup seems not to match what other distributions return, when the resolver cannot be reached: #30459 (comment)

@timokau
Copy link
Member

timokau commented Jan 16, 2019

Looks like I don't know enough about the subject to contribute :/

@sjau
Copy link
Author

sjau commented Feb 11, 2019

Hmmm, I just discovered that systemd.network does support now wireguard as well:

https://manpages.debian.org/testing/systemd/systemd.netdev.5.en.html#%5BWIREGUARDPEER%5D_SECTION_OPTIONS

Maybe it's easier to just setup there? According to the nixos options it's not yet supported though.

@sjau
Copy link
Author

sjau commented May 23, 2019

Coming back to this ancient problem:

Last night gchristensen and I were talking a bit about this problem. in the end he suggested to add a

    systemd.services.wireguard-wg[...].serviceConfig.Restart = "on-failure";
    systemd.services.wireguard-wg[...].serviceConfig.RestartSec = "5s";

To the configuration.nix so that it will restart. However during rebuilding it complained that restart on failure does not go hand-in-hand with "oneshot" systemd type. (There's also an issue on systemd tracker that asks for adding retry on failure to oneshot type: systemd/systemd#2582 ).

So, I tried to alter the oneshot type to simple by altering

          Type = "oneshot";

to

          Type = "simple";

in the https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/services/networking/wireguard.nix#L248 file.

Since I didn't know the proper syntax to add RestartSec = "5s"; in the nix file there, I did just add

    systemd.services.wireguard-wg[...].serviceConfig.RestartSec = "5s";

to the configuration.nix

After that I did rebuild and it works fine. It will now re-try to connect to the wg vpns even if first attempt isn't successfull.

Currently my notebook connects to 3 wg vpns with this and I've had several reboots (for testing) and they always came back up again.

So the question is: What is the benefit of "oneshot" compared to "simple" since "oneshot" seems to prevent wg coming up properly when using a domain name as server traget instead of an ip address.

@sjau
Copy link
Author

sjau commented May 26, 2019

This got fixed:

#61971

@sjau
Copy link
Author

sjau commented Jun 20, 2019

While #61971 did fix the issue, later changes introduced the same problem again for the wg peers. They will not get started properly because of oneshot type and no dns available at bootup.

@stale
Copy link

stale bot commented Jun 2, 2020

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 2, 2020
@sjau
Copy link
Author

sjau commented Jun 4, 2020

Still important to me

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 4, 2020
@llewxam-kache
Copy link

llewxam-kache commented Jul 9, 2020

Relevant to my interests as well, and I'm also a little confused as to why the systemd service runs as a 'oneshot' instead of continually retrying (maybe at exponential intervals up to 30 seconds)
If you're running this service, presumably you always want it to come up, or at least want to define how many times it tries to restart, right?

Edit:
I think a minor fix for my Ubuntu based install of WG was to change the wg-quick@.service to include StartLimitBurst=5 (maybe not needed) under [Unit]; change the Type under [Service] to simple, and add Restart=on-failure ; RestartSec=5s
systemctl daemon-relaod, etc, reboot, looks happy

@stale
Copy link

stale bot commented Jan 6, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 6, 2021
@tgunnoe
Copy link
Contributor

tgunnoe commented Jan 8, 2021

still important to me as well, an easier vpn setup

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 8, 2021
@flokli
Copy link
Contributor

flokli commented Jan 9, 2021

Our whole networking.wireguard setup is somewhat scripted, so it being handled in oneshot services seems appropriate - there's just no long-running daemon here, but just a bunch of iproute2/wg command invocations.

The oneshot services setting up the interface do wait for network-online.target (which means DNS should be reachable), and since 1de35c7, all the per-peer scripts depend on that script. They also set WG_ENDPOINT_RESOLUTION_RETRIES = "infinity", which should help with some intermittent DNS blips.

@sjau the original issue seems to be solved. If you still experience problems, please open a new issue with a reproducer and up2date logs.

@flokli flokli closed this as completed Jan 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants