Skip to content

Unbound fails to resolve certain domains when going from 1.11 to 1.12 #360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chrisliebaer opened this issue Dec 1, 2020 · 18 comments
Closed

Comments

@chrisliebaer
Copy link

chrisliebaer commented Dec 1, 2020

I'm experiencing failure to resolve certain domains after upgrading from unbound 1.11 to 1.12. I have confirmed that switching between these versions has direct impact on the issue. Also disabling DNSSEC validation, by commenting out the trust-anchor-file will resolve the problem (at the cost of disabling DNSSEC of course) which makes be believe that this is an DNSSEC issue.

I'm running unbound 1.12 on Arch Linux using the exact config below for both tests on 1.11 and 1.12. The build script used by Arch Linux is available here: https://github.com/archlinux/svntogit-community/blob/packages/unbound/trunk/PKGBUILD

server:
        do-daemonize: no

        # logging
        verbosity: 9
        use-syslog: yes
        log-queries: no

        # logs why dnssec validation failed
        val-log-level: 2

        # process chroots itself in /etc/unbound
        #logfile: dnslog.txt

        # root dns config
        root-hints: "/etc/unbound/named.cache"

        # this file will be copied from /etc/trusted-key by the unbound.server
        trust-anchor-file: /etc/unbound/trusted-key.key
        #auto-trust-anchor-file: /etc/unbound/trusted-key.key

        # bind to localhost for dnsmasq to redirect
        do-tcp: no
        do-ip6: no
        access-control: 127.0.0.1/8 allow
        interface: 127.0.0.1
        port: 10053

        # caching
        msg-cache-size: 4m
        msg-cache-slabs: 4

        # security
        hide-identity: no
        hide-version: no
        use-caps-for-id: no
        unwanted-reply-threshold: 10000000

        # misc
        do-not-query-localhost: yes
        prefetch: yes

        # disables mozilla (and google?) dns over https
        local-zone: "use-application-dns.net" static

Multiple domains are affected by this issue:

  • fedoraproject.org
  • unitymedia.de
  • a few other that I have no record of (took me a while to realize it's an dns issue)

Below is the log output of unbound 1.12 while trying to resolve the A record of fedoraproject.org

info: sending query: fedoraproject.org. DNSKEY IN
debug: sending to target: <fedoraproject.org.> 38.145.60.14#53
debug: dnssec status: expected
debug: EDNS lookup known=1 vs=0
debug: serviced query UDP timeout=477 msec
debug: inserted new pending reply id=3ab3
debug: opened UDP if=0 port=21121
debug: comm point start listening 10 (-1 msec)
debug: mesh_run: iterator module exit state is module_wait_reply
info: mesh_run: end 10 recursion states (9 with reply, 0 detached), 9 waiting replies, 0 recursion replies sent, 0 replies dropped, 0 states jostled out
info: 0vRDCD mod2  fedoraproject.org. DNSKEY IN
info: 1RDdc mod1 rep fedoraproject.org. A IN
info: 2RDdc mod1 rep fedoraproject.org. A IN
info: 3RDdc mod1 rep fedoraproject.org. A IN
info: 4RDdc mod1 rep fedoraproject.org. A IN
info: 5RDdc mod1 rep fedoraproject.org. A IN
info: 6RDdc mod1 rep fedoraproject.org. A IN
info: 7RDdc mod1 rep fedoraproject.org. A IN
info: 8RDdc mod1 rep fedoraproject.org. A IN
info: 9RDdc mod1 rep fedoraproject.org. A IN
debug: cache memory msg=67956 rrset=84388 infra=10260 val=68606 subnet=74504
debug: svcd callbacks end
debug: answer cb
debug: Incoming reply id = 3ab3
debug: Incoming reply addr = ip4 38.145.60.14 port 53 (len 16)
debug: lookup size is 1 entries
debug: received udp reply.
debug: udp message[46:0] 3AB3861000010000000000010D6665646F726170726F6A656374036F726700003000010000291000000080000000
debug: outnet handle udp reply
debug: measured roundtrip at 134 msec
debug: initiate TCP query EDNS
debug: close of port 21121
debug: close fd 10
debug: tcp error for address ip4 38.145.60.14 port 53 (len 16)
debug: svcd callbacks start
debug: worker svcd callback for qstate 0x55c28f1745a0
debug: mesh_run: start
debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_noreply
info: iterator operate: query fedoraproject.org. DNSKEY IN
debug: process_response: new external response event
debug: iter_handle processing q with state QUERY RESPONSE STATE
debug: query response was timeout
debug: iter_handle processing q with state QUERY TARGETS STATE
info: processQueryTargets: fedoraproject.org. DNSKEY IN
debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 7
info: DelegationPoint<fedoraproject.org.>: 4 names (0 missing), 6 addrs (6 result, 0 avail) cacheNS
info:   ns05.fedoraproject.org. * A AAAA
info:   ns-iad01.fedoraproject.org. * A
info:   ns02.fedoraproject.org. * A AAAA
info:   ns-iad02.fedoraproject.org. * A
debug:    ip4 38.145.60.14 port 53 (len 16)
debug:    ip6 2610:28:3090:3001:dead:beef:cafe:fed5 port 53 (len 28)
debug:    ip4 152.19.134.139 port 53 (len 16)
debug:    ip4 38.145.60.13 port 53 (len 16)
debug:    ip6 2001:4178:2:1269:dead:beef:cafe:fed5 port 53 (len 28)
debug:    ip4 85.236.55.10 port 53 (len 16)
debug: servselect ip4 38.145.60.14 port 53 (len 16)
debug:    rtt=481
debug: servselect ip4 152.19.134.139 port 53 (len 16)
debug:    rtt=436
debug: servselect ip4 38.145.60.13 port 53 (len 16)
debug:    rtt=459
debug: servselect ip4 85.236.55.10 port 53 (len 16)
debug:    rtt=306
debug: selrtt 306
info: sending query: fedoraproject.org. DNSKEY IN
debug: sending to target: <fedoraproject.org.> 85.236.55.10#53
debug: dnssec status: expected
debug: EDNS lookup known=1 vs=0
debug: serviced query UDP timeout=306 msec
debug: inserted new pending reply id=f118
debug: opened UDP if=0 port=35132
debug: comm point start listening 10 (-1 msec)
debug: mesh_run: iterator module exit state is module_wait_reply
info: mesh_run: end 10 recursion states (9 with reply, 0 detached), 9 waiting replies, 0 recursion replies sent, 0 replies dropped, 0 states jostled out
info: 0vRDCD mod2  fedoraproject.org. DNSKEY IN
info: 1RDdc mod1 rep fedoraproject.org. A IN
info: 2RDdc mod1 rep fedoraproject.org. A IN
info: 3RDdc mod1 rep fedoraproject.org. A IN
info: 4RDdc mod1 rep fedoraproject.org. A IN
info: 5RDdc mod1 rep fedoraproject.org. A IN
info: 6RDdc mod1 rep fedoraproject.org. A IN
info: 7RDdc mod1 rep fedoraproject.org. A IN
info: 8RDdc mod1 rep fedoraproject.org. A IN
info: 9RDdc mod1 rep fedoraproject.org. A IN
debug: cache memory msg=67956 rrset=84388 infra=10260 val=68606 subnet=74504
debug: svcd callbacks end
debug: answer cb
debug: Incoming reply id = f118
debug: Incoming reply addr = ip4 85.236.55.10 port 53 (len 16)
debug: lookup size is 1 entries
debug: received udp reply.
debug: udp message[46:0] F118861000010000000000010D6665646F726170726F6A656374036F726700003000010000291000000080000000
debug: outnet handle udp reply
debug: measured roundtrip at 22 msec
debug: initiate TCP query EDNS
debug: close of port 35132
debug: close fd 10
debug: tcp error for address ip4 85.236.55.10 port 53 (len 16)
debug: svcd callbacks start
debug: worker svcd callback for qstate 0x55c28f1745a0
debug: mesh_run: start
debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_noreply
info: iterator operate: query fedoraproject.org. DNSKEY IN
debug: process_response: new external response event
debug: iter_handle processing q with state QUERY RESPONSE STATE
debug: query response was timeout
debug: iter_handle processing q with state QUERY TARGETS STATE
info: processQueryTargets: fedoraproject.org. DNSKEY IN
debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 8
info: DelegationPoint<fedoraproject.org.>: 4 names (0 missing), 6 addrs (6 result, 0 avail) cacheNS
info:   ns05.fedoraproject.org. * A AAAA
info:   ns-iad01.fedoraproject.org. * A
info:   ns02.fedoraproject.org. * A AAAA
info:   ns-iad02.fedoraproject.org. * A
debug:    ip4 38.145.60.14 port 53 (len 16)
debug:    ip6 2610:28:3090:3001:dead:beef:cafe:fed5 port 53 (len 28)
debug:    ip4 152.19.134.139 port 53 (len 16)
debug:    ip4 38.145.60.13 port 53 (len 16)
debug:    ip6 2001:4178:2:1269:dead:beef:cafe:fed5 port 53 (len 28)
debug:    ip4 85.236.55.10 port 53 (len 16)
debug: servselect ip4 85.236.55.10 port 53 (len 16)
debug:    rtt=252
debug: servselect ip4 38.145.60.13 port 53 (len 16)
debug:    rtt=459
debug: servselect ip4 152.19.134.139 port 53 (len 16)
debug:    rtt=436
debug: servselect ip4 38.145.60.14 port 53 (len 16)
debug:    rtt=481
debug: selrtt 252
info: sending query: fedoraproject.org. DNSKEY IN
debug: sending to target: <fedoraproject.org.> 85.236.55.10#53
debug: dnssec status: expected
debug: EDNS lookup known=1 vs=0
debug: serviced query UDP timeout=252 msec
debug: inserted new pending reply id=cc8f
debug: opened UDP if=0 port=14758
debug: comm point start listening 10 (-1 msec)
debug: mesh_run: iterator module exit state is module_wait_reply
info: mesh_run: end 10 recursion states (9 with reply, 0 detached), 9 waiting replies, 0 recursion replies sent, 0 replies dropped, 0 states jostled out
info: 0vRDCD mod2  fedoraproject.org. DNSKEY IN
info: 1RDdc mod1 rep fedoraproject.org. A IN
info: 2RDdc mod1 rep fedoraproject.org. A IN
info: 3RDdc mod1 rep fedoraproject.org. A IN
info: 4RDdc mod1 rep fedoraproject.org. A IN
info: 5RDdc mod1 rep fedoraproject.org. A IN
info: 6RDdc mod1 rep fedoraproject.org. A IN
info: 7RDdc mod1 rep fedoraproject.org. A IN
info: 8RDdc mod1 rep fedoraproject.org. A IN
info: 9RDdc mod1 rep fedoraproject.org. A IN
debug: cache memory msg=67956 rrset=84388 infra=10260 val=68606 subnet=74504
debug: svcd callbacks end
debug: answer cb
debug: Incoming reply id = cc8f
debug: Incoming reply addr = ip4 85.236.55.10 port 53 (len 16)
debug: lookup size is 1 entries
debug: received udp reply.
debug: udp message[46:0] CC8F861000010000000000010D6665646F726170726F6A656374036F726700003000010000291000000080000000
debug: outnet handle udp reply
debug: measured roundtrip at 22 msec
debug: initiate TCP query EDNS
debug: close of port 14758
debug: close fd 10

(The log output seems to repeat a few times in an attempt by unbound to reattempt validation before failing)

Ignore the fact that unbound is running behind an dnsmasq. All testing was done directly on unbound with dnsmasq stopped. I've cross-checked my configuration with multiple sources and have no idea what I am doing wrong, which makes me believe that maybe there is an issue with unbound. Even though I couldn't find any related issue.

Please tell me if I need to provide additional information.

@wcawijngaards
Copy link
Member

The issue seems to be this line debug: tcp error for address ip4 85.236.55.10 port 53 (len 16). It is happening because DNSSEC is enabled, and thus larger answers need to be retrieved, just before that the UDP query returns +TC for tcp backoff. But then the tcp connection fails.

Could it be that in the new version tcp fast open is enabled and this causes all tcp connections to fail? And then domains where the DNSSEC information is a bit larger then one packet need TCP but they fail because lookups are not possible. Or maybe something else related to TCP, for example a firewall, I am guessing closer to you not the fedoraproject.org site, is rejecting TCP traffic.

If I attempt to resolve the fedoraproject.org name with the latest version (1.13.0rc4) it works just fine, but then the TCP succeeds for me. From the output I can see the site is in DNSSEC algorithm rollover, and because of that there is a double set of keys, and this causes a larger output than usual. That makes it not fit in one packet, and thus it falls back to TCP. This is for the fedoraproject.org query of type DNSKEY.

@pengelana
Copy link

pengelana commented Dec 1, 2020

Unbound 1.13.0rc4 build on Linux with --enable-tfo-client & --enable-tfo-server, error:

unbound: message repeated 95 times: [ [6702:1] error: read (in tcp s): Transport endpoint is not connected for xxx.xxx.xxx.xxx port 53

without tfo, no more errors.

@chrisliebaer
Copy link
Author

chrisliebaer commented Dec 1, 2020

Thank you very much for your time and for the hint with the TCP error. As you can see I had do-tcp: no set. I tried changing it to yes and can confirm that this is causing the TCP error. No other external factors involed.

Running the above config on unbound 1.12 and do-tcp: yes works without problems. I can't really tell if this is just me shooting myself in the foot by disabling tcp or if this is something that should be caught by unbound itself, so I'm leaving the field to the experts at this point.

@wcawijngaards
Copy link
Member

Thanks for the do-tcp issue fix, that closes the issue. For the tfo fix, there is nothing I can really do for you, but give hints to disable tfo in that case. I'll see if I can add a error printout text, and this is committed and may help others figure out their TCP Fast Open settings too.

jedisct1 added a commit to jedisct1/unbound that referenced this issue Dec 11, 2020
* nlnet/master:
  - Fix missing prototypes in the code.
  Changelog note for NLnetLabs#373 - Merge PR NLnetLabs#373 from fobser: Warning: arithmetic on a pointer to void   is a GNU extension.
  Changelog note for NLnetLabs#335 - Merge PR NLnetLabs#335 from fobser: Sprinkle in some static to prevent   missing prototype warnings.
  Warning: arithmetic on a pointer to void is a GNU extension.
  - Fix to squelch permission denied and other errors from remote host,   they are logged at higher verbosity but not on low verbosity.
  - Fix NLnetLabs#371: unbound-control timeout when Unbound is not running.
  - iana portlist updated.
  - make depend.
  Code repo continues for 1.13.1 in development.
  - Fix update, with write event check with streamreuse and fastopen.
  - Fix for NLnetLabs#283: fix stream reuse and tcp fast open.
  - Fix on windows to ignore connection failure on UDP, unless verbose.
  - Fix unbound-dnstap-socket to not use log routine from interrupt   handler and not print so frequently when invoked in sequence.
  - Fix NLnetLabs#356: deadlock when listening tcp.
  - Fix NLnetLabs#360: for the additionally reported TCP Fast Open makes TCP   connections fail, in that case we print a hint that this is   happening with the error in the logs.
  Sprinkle in some static to prevent missing prototype warnings.
@LinuxPersonEC
Copy link

unbound upgraded to 1.13 last night on our Fedora 33 server and DNS stopped resolving, many instances of the following with the different IPs associated with unbound:

unbound[3169826]: [3169826:0] error: read (in tcp s): Connection refused for 199.253.250.68 port 53

Using the default configuration.

@wcawijngaards
Copy link
Member

Perhaps this is a firewall issue? It means unbound makes a tcp connection to that IP address at port 53 but the tcp result was connection refused. It is not possible to perform tcp traffic to that place, it seems.

@LinuxPersonEC
Copy link

Perhaps this is a firewall issue? It means unbound makes a tcp connection to that IP address at port 53 but the tcp result was connection refused. It is not possible to perform tcp traffic to that place, it seems.

No changes were made to firewalld. We do use Fail2ban. I increased the logging verbosity to 2 and set val-log-level: 1

Feb 22 10:30:02 storm unbound[3180577]: [3180577:0] error: tcp connect: Connection refused for 199.7.91.13 port 53
Feb 22 10:30:02 storm unbound[3180577]: [3180577:0] error: tcp connect: Connection refused for 192.33.4.12 port 53
Feb 22 10:30:02 storm unbound[3180577]: [3180577:0] error: tcp connect: Connection refused for 199.9.14.201 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving 251.132.160.203.in-addr.arpa. PTR IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving 251.132.160.203.in-addr.arpa. PTR IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: priming . IN NS
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving . DNSKEY IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving . DNSKEY IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: priming . IN NS
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: response for . NS IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: reply from <.> 198.41.0.4#53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: query response was ANSWER
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: priming successful for . NS IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: priming successful for . NS IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving . NS IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: prime trust anchor
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:2f::f port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:7fe::53 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving a.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving a.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving b.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving b.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:503:ba3e::2:30 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:7fd::1 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving c.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving c.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:9f::42 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:503:ba3e::2:30 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:a8::e port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:2f::f port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:200::b port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:200::b port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:200::b port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving d.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving d.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:12::d0d port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:503:c27::2:30 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:a8::e port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:2d::d port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving e.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving e.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:503:c27::2:30 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:9f::42 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:503:ba3e::2:30 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving f.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving f.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving g.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving g.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:503:ba3e::2:30 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:200::b port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:7fe::53 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:12::d0d port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:2d::d port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:dc3::35 port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: error sending query to auth server 2001:500:200::b port 53
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving h.root-servers.net. AAAA IN
Feb 22 10:30:10 storm unbound[3180577]: [3180577:0] info: resolving h.root-servers.net. AAAA IN

@wcawijngaards
Copy link
Member

So it looks like things are working? Ipv6 is down, so it logs the attempts for IPv6 but perhaps you simply do not have that. The IPv4 failures at the start seem to go away quickly, perhaps like another device that is a firewall. Apart from the IPv4 failures at the start, there seems to be some work going on resolving queries.

Not sure what causes the IPv4 failures, but I do not see a hint what causes it from these logs; unbound is logging that the TCP connection was not successful. Likely also dig +tcp @ipaddress to those addresses fail. The IPv4 is like the one for b.root-servers.net, so I think nothing is wrong to connect to there at the server, since it works for others, so there must be some sort of network device interfering with connectivity close, or closer, to you. Like the firewall disallowing connections.

@LinuxPersonEC
Copy link

So it looks like things are working?

Mostly yes, email is flowing again. We use Unbound for SpamAssassin since some DNSBlockLists have limits on free usage

Ipv6 is down, so it logs the attempts for IPv6 but perhaps you simply do not have that.

Right is there a way to disable this? I see:

        # Enable IPv6, "yes" or "no".
        do-ip6: no

The IPv4 failures at the start seem to go away quickly, perhaps like another device that is a firewall. Apart from the IPv4 failures at the start, there seems to be some work going on resolving queries.

As I mentioned only firewalld is running.

Not sure what causes the IPv4 failures, but I do not see a hint what causes it from these logs; unbound is logging that the TCP connection was not successful. Likely also dig +tcp @ipaddress to those addresses fail. The IPv4 is like the one for b.root-servers.net, so I think nothing is wrong to connect to there at the server, since it works for others, so there must be some sort of network device interfering with connectivity close, or closer, to you. Like the firewall disallowing connections.

dig appears to resolve. FWIW this started happening after the upgrade to unbound-1.13.1-1.fc33.x86_64. Downgrading fixes this as well.

Feb 22 02:33:19 dsm unbound[2324728]: [2324728:2] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
Feb 22 02:33:19 dsm unbound[2324728]: [2324728:2] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
Feb 22 02:33:19 dsm unbound[2324728]: [2324728:2] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
Feb 22 02:33:19 dsm unbound[2324728]: [2324728:2] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
dig +tcp 200.10.60.53

; <<>> DiG 9.11.27-RedHat-9.11.27-1.fc33 <<>> +tcp 200.10.60.53
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 16844
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;200.10.60.53.                  IN      A

;; AUTHORITY SECTION:
.                       3600    IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2021022200 1800 900 604800 86400

;; Query time: 95 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Mon Feb 22 11:48:49 EST 2021
;; MSG SIZE  rcvd: 116

@wcawijngaards
Copy link
Member

Is it perhaps TCP fast open? Something related to the kernel version, and complicated TCP setups. Unbound has a configure time option that would be different depending on the package options, and that could be enabled. If that has hiccups for TCP that could explain it. There may be kernel settings to disable it, with like sysctl.

@LinuxPersonEC
Copy link

Is it perhaps TCP fast open?

Looks like it is:

cat /proc/sys/net/ipv4/tcp_fastopen
1
sysctl net.ipv4.tcp_fastopen
net.ipv4.tcp_fastopen = 1

Something related to the kernel version, and complicated TCP setups.

5.9.16-200.fc33.x86_64

Unbound has a configure time option that would be different depending on the package options, and that could be enabled. If that has hiccups for TCP that could explain it. There may be kernel settings to disable it, with like sysctl.

I ran echo 0 > /proc/sys/net/ipv4/tcp_fastopen

The log continues to be peppered with these errors. Also seeing:

Feb 23 09:35:02 storm unbound[3229864]: [3229864:3] error: read (in tcp s): Connection refused for 203.99.25.1 port 53
Feb 23 09:35:02 storm unbound[3229864]: [3229864:3] info: Missing DNSKEY RRset in response to DNSKEY query.
Feb 23 09:35:02 storm unbound[3229864]: [3229864:3] info: Missing DNSKEY RRset in response to DNSKEY query.
Feb 23 09:35:02 storm unbound[3229864]: [3229864:3] info: resolving top. DNSKEY IN

From strace:

socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 33
setsockopt(33, SOL_IP, IP_MTU_DISCOVER, [5], 4) = 0
bind(33, {sa_family=AF_INET, sin_port=htons(44924), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
fcntl(33, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(33, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
epoll_ctl(21, EPOLL_CTL_ADD, 33, {EPOLLIN, {u32=33, u64=33}}) = 0
sendto(33, "\375\10\0\0\0\1\0\0\0\0\0\0\0\0\6\0\1", 17, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.0.32.132")}, 16) = 17
epoll_wait(21, [{EPOLLIN, {u32=33, u64=33}}], 32, 100) = 1
recvfrom(33, "\375\10\204\0\0\1\0\1\0\r\0\r\0\0\6\0\1\0\0\6\0\1\0\1Q\200\0@\1a\froot-servers\3net\0\5nstld\fverisign-grs\3com\0xvZ\\\0\0\7\10\0\0\3\204\0\t:\200\0\1Q\200\0\0\2\0\1\0\7\351\0\0\2\300\34\0\0\2\0\1\0\7\351\0\0\4\1b\300\36\0\0\2\0\1\0\7\351\0\0\4\1c\300\36\0\0\2\0\1\0\7\351\0\0\4\1d\300\36\0\0\2\0\1\0\7\351\0\0\4\1e\300\36\0\0\2\0\1\0\7\351\0\0\4\1f\300\36\0\0\2\0\1\0\7\351\0\0\4\1g\300\36\0\0\2\0\1\0\7\351\0\0\4\1h\300\36\0\0\2\0\1\0\7\351\0\0\4\1i\300\36\0\0\2\0\1\0\7\351\0\0\4\1j\300\36\0\0\2\0\1\0\7\351\0\0\4\1k\300\36\0\0\2\0\1\0\7\351\0\0\4\1l\300\36\0\0\2\0\1\0\7\351\0\0\4\1m\300\36\300\34\0\1\0\1\0\7\351\0\0\4\306)\0\4\300t\0\1\0\1\0\7\351\0\0\4\307\t\16\311\300\203\0\1\0\1\0\7\351\0\0\4\300!\4\f\300\222\0\1\0\1\0\7\351\0\0\4\307\7[\r\300\241\0\1\0\1\0\7\351\0\0\4\300\313\346\n\300\260\0\1\0\1\0\7\351\0\0\4\300\5\5\361\300\277\0\1\0\1\0\7\351\0\0\4\300p$\4\300\316\0\1\0\1\0\7\351\0\0\4\306a\2765\300\335\0\1\0\1\0\7\351\0\0\4\300$\224\21\300\354\0\1\0\1\0\7\351\0\0\4\300:\200\36\300\373\0\1\0\1\0\7\351\0\0\4\301\0\16\201\301\n\0\1\0\1\0\7\351\0\0\4\307\7S*\301\31\0\1\0\1\0\7\351\0\0\4\312\f\33!", 65552, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.0.32.132")}, [128->16]) = 493
epoll_ctl(21, EPOLL_CTL_DEL, 33, 0x7fff16c6a8dc) = 0
close(33)                               = 0
getpid()                                = 3570006
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 33
setsockopt(33, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl(33, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(33, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
connect(33, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.0.32.132")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(21, EPOLL_CTL_ADD, 33, {EPOLLOUT, {u32=33, u64=33}}) = 0
epoll_wait(21, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=33, u64=33}}], 32, 10000) = 1
getsockopt(33, SOL_SOCKET, SO_ERROR, [ECONNREFUSED], [4]) = 0
getpid()                                = 3570006
sendto(12, "<27>Feb 23 09:49:52 unbound: [3570006:0] error: tcp connect: Connection refused for 192.0.32.132 port 53", 104, MSG_NOSIGNAL, NULL, 0) = 104
epoll_ctl(21, EPOLL_CTL_DEL, 33, 0x7fff16c6ab4c) = 0
close(33)                               = 0
getpid()                                = 3570006
socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 33
setsockopt(33, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl(33, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(33, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
connect(33, {sa_family=AF_INET6, sin6_port=htons(53), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2620:0:2d0:202::132", &sin6_addr), sin6_scope_id=0}, 28) = -1 ENETUNREACH (Network is unreachable)
close(33)                               = 0

I'm also seeing a different issue as dnf update is failing with:

dnf update
Fedora Modular 33 - x86_64 - Updates                                                                     0.0  B/s |   0  B     00:00
Errors during downloading metadata for repository 'updates-modular':
  - Curl error (6): Couldn't resolve host name for https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f33&arch=x86_64 [Could not resolve host: mirrors.fedoraproject.org]

There was a discussion on the Fedora users mailing list in Jan 2021,:

I don't know if this will have any effect, but since you're not
running/using systemd-resolved then it may be
a good idea to remove "resolve" from nsswitch.conf.

Noted. Many thanks. Next thing to test. Good to know.

Could these issues be related?

@wcawijngaards
Copy link
Member

Disabling tcp fastopen did not solve things? I guess it is not a problem, or not the only one. I see from your strace that connect(192.0.32.132) is fine, and then returns an error ECONNREFUSED. This must be coming from the network stack. It must have got that from config, an error, or upstream refusal from some middlebox (because the ordinary DNS server are not denying access to others). So, checking middlebox interference is still the thing, that I already noted straight at the start, find things along the network path from this machine to that IP address, and check how this TCP refusal is made. Typically routers, firewall config (perhaps it is wrong?), perhaps ISP related.

The dnf failure means your DNS is not working, that seems reasonable. The fedora discussion, do not see what that has to do with it. The top. failure, that could be related to packet size for UDP, top dnskey is a big response, and it may fall back to TCP, and this is not working. You may also have UDP size problems, eg. it does not allow large size UDP responses.

@LinuxPersonEC
Copy link

Disabling tcp fastopen did not solve things? I guess it is not a problem, or not the only one. I see from your strace that connect(192.0.32.132) is fine, and then returns an error ECONNREFUSED. This must be coming from the network stack. It must have got that from config, an error, or upstream refusal from some middlebox (because the ordinary DNS server are not denying access to others). So, checking middlebox interference is still the thing, that I already noted straight at the start, find things along the network path from this machine to that IP address, and check how this TCP refusal is made. Typically routers, firewall config (perhaps it is wrong?), perhaps ISP related.

Well a simple work around was to insert an IP address of either our University's DNS or something like 1.1.1.1. So I don't think it'd be a middlebox firewall. Our IT team did disable ping to outside university addresses, if that means anything.

@wcawijngaards
Copy link
Member

I guess maybe they blocked other DNS servers from TCP access?

Your command above dig +tcp 200.10.60.53 gave the wrong output you need dig +tcp @200.10.60.53 because now you looked up the name 200... instead of a tcp connection to that IP address of a root server.

@LinuxPersonEC
Copy link

I guess maybe they blocked other DNS servers from TCP access?

I'm starting to think this might be the issue. I'll have to confirm with out IT group.

Your command above dig +tcp 200.10.60.53 gave the wrong output you need dig +tcp @200.10.60.53 because now you looked up the name 200... instead of a tcp connection to that IP address of a root server.

Here are the results of dig with and without the +tcp:

dig +tcp @200.10.60.53
;; Connection to 200.10.60.53#53(200.10.60.53) for . failed: connection refused.
dig  @200.10.60.53

; <<>> DiG 9.11.27-RedHat-9.11.27-1.fc33 <<>> @200.10.60.53
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 9707
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: ebeaf577972d7e3f010000006039038ac0a47209adc3203c (good)
;; QUESTION SECTION:
;.                              IN      NS

;; Query time: 157 msec
;; SERVER: 200.10.60.53#53(200.10.60.53)
;; WHEN: Fri Feb 26 09:19:54 EST 2021
;; MSG SIZE  rcvd: 56

Alas the logs keep getting filled with the following. Is there a way to disable or suppress these?

Feb 26 09:20:06 storm unbound[3570006]: [3570006:1] reply: 127.0.0.1 mxb-0019cd01.gslb.pphosted.com. A IN NOERROR 0.068721 0 64
Feb 26 09:20:06 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
Feb 26 09:20:06 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
Feb 26 09:20:06 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Feb 26 09:20:06 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Feb 26 09:20:06 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
Feb 26 09:20:06 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 203.119.86.101 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 203.119.86.101 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 199.253.183.183 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 199.253.183.183 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 199.253.183.183 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 199.253.183.183 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 203.119.86.101 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 203.119.86.101 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 199.253.183.183 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 199.253.183.183 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 199.180.182.53 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 199.180.182.53 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 203.119.86.101 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 203.119.86.101 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 199.180.182.53 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:3] error: read (in tcp s): Connection refused for 199.180.182.53 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 203.119.86.101 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:0] error: read (in tcp s): Connection refused for 203.119.86.101 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 196.216.169.10 port 53
Feb 26 09:20:07 storm unbound[3570006]: [3570006:1] error: read (in tcp s): Connection refused for 196.216.169.10 port 53

@wcawijngaards
Copy link
Member

The dig command prints the same error that unbound has, but from the commandline. The error is not with unbound, or unbound's settings, but because there is an issue with your network connectivity.

The dig without tcp shows you are not connecting to the actual root server, but instead get a reply from another server.

There is no way to suppress the error log of this message. Also, that would not be a good feature to add it; it shows that the network is malfunctioning.

@LinuxPersonEC
Copy link

The dig command prints the same error that unbound has, but from the commandline. The error is not with unbound, or unbound's settings, but because there is an issue with your network connectivity.

So what changed from version 1.12 to 1.13 that caused this, as it didn't happen in version 1.12?

@wcawijngaards
Copy link
Member

The unbound version does not matter. The error happens when you do the dig command from the commandline, so the unbound program is not involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants