Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SERVFAIL problems with unbound 1.15.0 running on OpenBSD 7.1 #670

Closed
renaudallard opened this issue Apr 27, 2022 · 18 comments
Closed

SERVFAIL problems with unbound 1.15.0 running on OpenBSD 7.1 #670

renaudallard opened this issue Apr 27, 2022 · 18 comments

Comments

@renaudallard
Copy link

renaudallard commented Apr 27, 2022

Describe the bug
Since I upgraded my DNS servers to OpenBSD 7.1 with unbound 1.15.0, I have a lot of issues with DNS resolution (without changing anything in the config). I randomly get SERVFAIL for a lot of names, or something even stranger like some addresses and SERVFAIL for others (see dashlane example). The exact same config was working just fine with unbound 1.13.0 on OpenBSD 7.0. Note that it does not happen every time, it's quite sporadic.

Examples:
host dashlane.com
dashlane.com has address 65.9.82.43
dashlane.com has address 65.9.82.13
dashlane.com has address 65.9.82.36
dashlane.com has address 65.9.82.97
Host dashlane.com not found: 2(SERVFAIL)
Host dashlane.com not found: 2(SERVFAIL)

host forum.opnsense.org
Host forum.opnsense.org not found: 2(SERVFAIL)

To reproduce
Here is my partial config
server:
log-replies: yes
interface: 0.0.0.0@853
tls-port: 853
tls-service-pem: *********
tls-service-key: *********
outgoing-range: 8192
outgoing-num-tcp: 256
incoming-num-tcp: 256
serve-expired: yes
outbound-msg-retry: 5
cache-max-negative-ttl: 1
msg-cache-size: 64m
msg-cache-slabs: 4
num-queries-per-thread: 32
rrset-cache-size: 128m
rrset-cache-slabs: 4
infra-cache-slabs: 4
key-cache-slabs: 4
access-control: 0.0.0.0/0 allow
access-control: ::0/0 allow
hide-identity: yes
hide-version: yes
harden-short-bufsize: yes
harden-large-queries: yes
harden-glue: yes
harden-dnssec-stripped: yes
harden-below-nxdomain: yes
harden-referral-path: yes
use-caps-for-id: yes
qname-minimisation: yes
aggressive-nsec: yes
edns-tcp-keepalive: yes
so-reuseport: no
deny-any: yes
prefetch: yes
prefetch-key: yes
rrset-roundrobin: yes
minimal-responses: yes

Expected behavior
Names should resolve properly

System:

  • Unbound version: 1.15.0
  • OS: OpenBSD 7.1
  • unbound -V output: Version 1.15.0
    Configure line: --enable-allsymbols --with-ssl=/usr --with-libevent=/usr --with-libexpat=/usr --without-pythonmodule --with-chroot-dir=/var/unbound --with-pidfile= --with-rootkey-file=/var/unbound/db/root.key --with-conf-file=/var/unbound/etc/unbound.conf --with-username=_unbound --disable-shared --disable-explicit-port-randomisation --without-pthreads
    Linked libs: pluggable-libevent 1.4.15-stable (it uses kqueue), LibreSSL 3.5.2
    Linked modules: dns64 respip validator iterator

Additional information
I tried removing every "hardening" parameter one by one to get nearer to the stock config. And it seems that disabling use-caps-for-id solves the issue.

@wcawijngaards
Copy link
Member

For this query name I cannot reproduce issues with 0x20, the logs say the 0x20 replies are good. Looking at your config, I can make a guess, but it is pretty vague. Your num-queries-per-thread: 32 setting is very low. That should be adjusted to like 1024, or use the default. The setting could cause servfail, when the num queries are full, and another query comes in, the not fitting query can be rejected with SERVFAIL. How that could happen, you have prefetch, you query for the name, this gives you an answer, and starts the prefetch. Somehow with 0x20 enabled this takes longer than without 0x20, because some names have 0x20 fallback and then unbound has to query a number of hosts, like 5 queries or so, and that is more work. This makes the entry exist for a little longer. Then your second query comes in, but the list of queries is full now, with the previous one. And the new query is rejected. That would only happen when the list is full, so there is a bunch of lookups. And perhaps also in combination with prefetch and a name for which 0x20 backoff has to be performed.

If this happens, that should be shown in the logs. Maybe at higher verbosity settings. That makes the logs show what happens and why servfail is happening. You could try to fix the num queries per thread setting, or capture the event with verbosity in logs, or if logs show output now. There is also log-servfail: yes that prints summary information about servfail replies to log, that you could enable to get more information about these servfails.

@renaudallard
Copy link
Author

renaudallard commented Apr 28, 2022

I have been able to get the error, even without the num-queries-per-thread limit.
I have enabled the log-servfail, and I get this:
Apr 28 10:14:08 elendil unbound: [14984:0] error: SERVFAIL <f.root-servers.net. AAAA IN>: could not fetch nameservers for 0x20 fallback
Apr 28 10:14:08 elendil unbound: [14984:0] info: 127.0.0.1 a.root-servers.net. A IN NOERROR 0.000000 1 52
Apr 28 10:14:09 elendil unbound: [14984:0] error: SERVFAIL <www.hubo.be. A IN>: could not fetch nameservers for 0x20 fallback
Apr 28 10:14:09 elendil unbound: [14984:0] info: 127.0.0.1 www.hubo.be. A IN SERVFAIL 0.185887 0 29

@wcawijngaards
Copy link
Member

The error you have can happen if. Out of memory when making the structure. The added query makes a dependency cycle, eg. this lookup is a recursion for that nameserver lookup already. Or if the added query exceeds boundaries on number of allowed sub-requests, eg. no more than X allowed queries per incoming query, those types of limits. The failure that it is specifically is also logged, perhaps at a higher verbosity.

@wcawijngaards
Copy link
Member

It could be the combination of 0x20 fallback and the harden-referral-path: yes option you have. Both of those create new subrequests in the state machine, and those increase those counters that fail to add more subqueries once the limit is reached, and then the 0x20 stops because it cannot perform the fallback any more.

@wcawijngaards
Copy link
Member

wcawijngaards commented Apr 28, 2022

Also prefetch-key creates additional sub requests. Perhaps the max-negative-ttl of 1 second is also causing refetches of like nameserver information, in particular non existing ipv6 AAAA information, that causes extra subqueries to be made.

@renaudallard
Copy link
Author

The max-negative-ttl=1 was added afterwards to check if this would solve the issue, but it occured independently of that parameter.
Removing harden-referral-path also didn't change the behaviour.
I am trying to replicate with verbosity=4, but, as this is sporadic, it's not easy.

@wcawijngaards
Copy link
Member

Another possibility is that it is out of memory. I would expect log entries from that. But I have seen OpenBSD set a heap size ulimit of 256m for processes, and then unbound, with this configuration could run out of memory. It has 128m and 64m cache sizes, and those are the sizes calculated by the data stored, not the overhead of the malloc allocator, and other stuff in memory, that could slowly fill up to the max and then out of memory could happen.

@renaudallard
Copy link
Author

renaudallard commented Apr 28, 2022

Here is a log of a failure with verbosity=4
unbound.log

This is taken from a freshly fully restarted unbound not being busy.

@wcawijngaards
Copy link
Member

It looks like it fails because the query depth exceeds the target fetch policy max depth. Have you set the target-fetch-policy to something non default? That parameter needs to have a number of integers in it, otherwise this fetch would exceed the target fetch depth and fail. And not print anything in logs for it. I see no other log printout to indicate what caused it.

Otherwise I see it try a lot of ipv6 addresses, and also during 0x20 fallback. do-ip6: no could limit that.

@renaudallard
Copy link
Author

target-fetch-policy is not defined, so it's using the default.
The machine has indeed no ipv6, so putting do-ip6: no would indeed be a good idea.

@wcawijngaards
Copy link
Member

wcawijngaards commented Apr 28, 2022

Is your syslog omitting printout from the logs because of log volume? I also miss printout of lines like target fetch policy for level at startup. Perhaps a logfile: <file> directive would not omit the log lines.

@renaudallard
Copy link
Author

Log volume on this machine is very low, so I would not expect syslog to drop the lines, but I can retry with a logfile: <file>

@renaudallard
Copy link
Author

Interesting, now that I did put do-ip6: no, I am unable to reproduce the issue.

@renaudallard
Copy link
Author

Could it be somehow related to 956d7d4 ?

@wcawijngaards
Copy link
Member

No I do not think that commit is related.

I think it may be because the query_for_targets function checks for depth before a 0-return path. I would like to see the complete logs, though. That no ip6 stops it, fits with it being the issue.

@renaudallard
Copy link
Author

Here is a log (not with syslog) with both a success and a failure for www.hubo.be
unbound.log

@wcawijngaards
Copy link
Member

The commit is an attempt to fix this problem. The logs from file are very complete, thank you. The fix attempts to check for target limits after it has determined that there is work to do. So that when ipv6 fails, and then it retries the 0x20 backoff procedure, the query target limits are not exceeded, because they have already been done, allowing the 0x20 backoff procedure to continue.

jedisct1 added a commit to jedisct1/unbound that referenced this issue May 25, 2022
* nlnet/master:
  - Fix some lint type warnings.
  - Fix ede test to not use default pidfile, and use local interface.
  - Fix to silence test for ede error output to the console from the   test setup script.
  - Fix typos in config_set_option for the 'num-threads' and   'ede-serve-expired' options.
  - Fix NLnetLabs#678: [FR] modify behaviour of unbound-control rpz_enable zone,   by updating unbound-control's documentation.
  - For NLnetLabs#677: Added tls-system-cert to config parser and documentation. - Changelog note for NLnetLabs#677.
  Allow using system certificates not only on Windows
  - Fix NLnetLabs#417: prefetch and ECS causing cache corruption when used   together.
  - Fix NLnetLabs#673: DNS over TLS: error: SSL_handshake syscall: No route to   host.
  - Fix Python build in non-source directory; based on patch by   Michael Tokarev.
  Changelog entry for NLnetLabs#604: Add the basic EDE (RFC8914) cases
  Add the basic EDE (RFC8914) cases (NLnetLabs#604)
  - Fix NLnetLabs#670: SERVFAIL problems with unbound 1.15.0 running on   OpenBSD 7.1.
@lohphat
Copy link

lohphat commented Aug 12, 2022

FYI, this issue has cropped-up in a pfSense community support when several installations were reporting of slow/broken DNS lookups since it's using unbound 1.15.0 as part of the base FreeBSD install for the pfSense 22.05 release. So for us it's an upstream component problem. If you'd like to review the findings, logs, and triage process, here's the discussion thread link: https://forum.netgate.com/topic/173148/slow-dns-after-22-05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants