-
-
Notifications
You must be signed in to change notification settings - Fork 380
Unbound fails to resolve certain domains when going from 1.11 to 1.12 #360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The issue seems to be this line Could it be that in the new version tcp fast open is enabled and this causes all tcp connections to fail? And then domains where the DNSSEC information is a bit larger then one packet need TCP but they fail because lookups are not possible. Or maybe something else related to TCP, for example a firewall, I am guessing closer to you not the fedoraproject.org site, is rejecting TCP traffic. If I attempt to resolve the fedoraproject.org name with the latest version (1.13.0rc4) it works just fine, but then the TCP succeeds for me. From the output I can see the site is in DNSSEC algorithm rollover, and because of that there is a double set of keys, and this causes a larger output than usual. That makes it not fit in one packet, and thus it falls back to TCP. This is for the fedoraproject.org query of type DNSKEY. |
Unbound 1.13.0rc4 build on Linux with --enable-tfo-client & --enable-tfo-server, error:
without tfo, no more errors. |
Thank you very much for your time and for the hint with the TCP error. As you can see I had Running the above config on unbound 1.12 and |
Thanks for the do-tcp issue fix, that closes the issue. For the tfo fix, there is nothing I can really do for you, but give hints to disable tfo in that case. I'll see if I can add a error printout text, and this is committed and may help others figure out their TCP Fast Open settings too. |
* nlnet/master: - Fix missing prototypes in the code. Changelog note for NLnetLabs#373 - Merge PR NLnetLabs#373 from fobser: Warning: arithmetic on a pointer to void is a GNU extension. Changelog note for NLnetLabs#335 - Merge PR NLnetLabs#335 from fobser: Sprinkle in some static to prevent missing prototype warnings. Warning: arithmetic on a pointer to void is a GNU extension. - Fix to squelch permission denied and other errors from remote host, they are logged at higher verbosity but not on low verbosity. - Fix NLnetLabs#371: unbound-control timeout when Unbound is not running. - iana portlist updated. - make depend. Code repo continues for 1.13.1 in development. - Fix update, with write event check with streamreuse and fastopen. - Fix for NLnetLabs#283: fix stream reuse and tcp fast open. - Fix on windows to ignore connection failure on UDP, unless verbose. - Fix unbound-dnstap-socket to not use log routine from interrupt handler and not print so frequently when invoked in sequence. - Fix NLnetLabs#356: deadlock when listening tcp. - Fix NLnetLabs#360: for the additionally reported TCP Fast Open makes TCP connections fail, in that case we print a hint that this is happening with the error in the logs. Sprinkle in some static to prevent missing prototype warnings.
unbound upgraded to 1.13 last night on our Fedora 33 server and DNS stopped resolving, many instances of the following with the different IPs associated with unbound:
Using the default configuration. |
Perhaps this is a firewall issue? It means unbound makes a tcp connection to that IP address at port 53 but the tcp result was connection refused. It is not possible to perform tcp traffic to that place, it seems. |
No changes were made to firewalld. We do use Fail2ban. I increased the logging
|
So it looks like things are working? Ipv6 is down, so it logs the attempts for IPv6 but perhaps you simply do not have that. The IPv4 failures at the start seem to go away quickly, perhaps like another device that is a firewall. Apart from the IPv4 failures at the start, there seems to be some work going on resolving queries. Not sure what causes the IPv4 failures, but I do not see a hint what causes it from these logs; unbound is logging that the TCP connection was not successful. Likely also dig +tcp |
Mostly yes, email is flowing again. We use Unbound for SpamAssassin since some DNSBlockLists have limits on free usage
Right is there a way to disable this? I see:
As I mentioned only firewalld is running.
dig appears to resolve. FWIW this started happening after the upgrade to
|
Is it perhaps TCP fast open? Something related to the kernel version, and complicated TCP setups. Unbound has a configure time option that would be different depending on the package options, and that could be enabled. If that has hiccups for TCP that could explain it. There may be kernel settings to disable it, with like sysctl. |
Looks like it is:
I ran The log continues to be peppered with these errors. Also seeing:
From
I'm also seeing a different issue as
There was a discussion on the Fedora users mailing list in Jan 2021,:
Could these issues be related? |
Disabling tcp fastopen did not solve things? I guess it is not a problem, or not the only one. I see from your strace that connect(192.0.32.132) is fine, and then returns an error ECONNREFUSED. This must be coming from the network stack. It must have got that from config, an error, or upstream refusal from some middlebox (because the ordinary DNS server are not denying access to others). So, checking middlebox interference is still the thing, that I already noted straight at the start, find things along the network path from this machine to that IP address, and check how this TCP refusal is made. Typically routers, firewall config (perhaps it is wrong?), perhaps ISP related. The dnf failure means your DNS is not working, that seems reasonable. The fedora discussion, do not see what that has to do with it. The top. failure, that could be related to packet size for UDP, top dnskey is a big response, and it may fall back to TCP, and this is not working. You may also have UDP size problems, eg. it does not allow large size UDP responses. |
Well a simple work around was to insert an IP address of either our University's DNS or something like 1.1.1.1. So I don't think it'd be a middlebox firewall. Our IT team did disable ping to outside university addresses, if that means anything. |
I guess maybe they blocked other DNS servers from TCP access? Your command above |
I'm starting to think this might be the issue. I'll have to confirm with out IT group.
Here are the results of
Alas the logs keep getting filled with the following. Is there a way to disable or suppress these?
|
The dig command prints the same error that unbound has, but from the commandline. The error is not with unbound, or unbound's settings, but because there is an issue with your network connectivity. The dig without tcp shows you are not connecting to the actual root server, but instead get a reply from another server. There is no way to suppress the error log of this message. Also, that would not be a good feature to add it; it shows that the network is malfunctioning. |
So what changed from version 1.12 to 1.13 that caused this, as it didn't happen in version 1.12? |
The unbound version does not matter. The error happens when you do the dig command from the commandline, so the unbound program is not involved. |
I'm experiencing failure to resolve certain domains after upgrading from unbound 1.11 to 1.12. I have confirmed that switching between these versions has direct impact on the issue. Also disabling DNSSEC validation, by commenting out the
trust-anchor-file
will resolve the problem (at the cost of disabling DNSSEC of course) which makes be believe that this is an DNSSEC issue.I'm running unbound 1.12 on Arch Linux using the exact config below for both tests on 1.11 and 1.12. The build script used by Arch Linux is available here: https://github.com/archlinux/svntogit-community/blob/packages/unbound/trunk/PKGBUILD
Multiple domains are affected by this issue:
Below is the log output of unbound 1.12 while trying to resolve the A record of
fedoraproject.org
(The log output seems to repeat a few times in an attempt by unbound to reattempt validation before failing)
Ignore the fact that unbound is running behind an dnsmasq. All testing was done directly on unbound with dnsmasq stopped. I've cross-checked my configuration with multiple sources and have no idea what I am doing wrong, which makes me believe that maybe there is an issue with unbound. Even though I couldn't find any related issue.
Please tell me if I need to provide additional information.
The text was updated successfully, but these errors were encountered: