-
-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forwarding to mixed block of IPv4 and IPv6 name server addresses effectively doubles the query response time #899
Comments
There seems to be a bunch of things going on, and some comments on each part could be useful. I'll put them down one by one. For Unbound, normally, queries are processed in parallel. The server selection algorithm treats IPv4 and IPv6 equally. So it should be fine. When a query gets an answer from the upstream, it is immediately provided to the client, it does not wait for other client queries. In some cases it may wait for internally generated queries to validate DNSSEC security, but it does not wait until a client has been answered. |
The pcap seems to show, from what I can tell on that screenshot, there is first traffic over IPv4 and later traffic over IPv6. This would be a normal sequence of events if the following happens, for example. The client asks a query, randomly the IPv4 is chosen, a one in 4 chance, roughly, that provides an answer. The answer is returned to the client. The client then asks another query. Randomly the IPv6 upstream is chosen to answer it, and this then provides an answer. That is then also returned to the client. The reason it is not in parallel, is that the client, the querier, is making queries in sequence. Like for example asking for an A and then later for a AAAA address, or other query types than IPv4 and IPv6 addresses. This is the standard behaviour of some queriers, notably things like resolv.conf and systemd, but there may also be options to have them make parallel queries. Unbound has options to prefer the fastest server from the set. But normally this is not needed, and I think also not here. The default has a mix of randomness and preferring faster target from unresponsive targets, and also filtering unusual response targets, and I think is probably fine. Perhaps the response time issue that is seen, is not the ping-time responsiveness, but in fact the DNS resolution time. I mean, when a query is not in the cache for the destination, the server has to look up the data and this takes time. Much longer than the fast ping time that the upstream has. If this is considered an issue, what you could do, is not use forwarding, but instead have unbound run full resolver, and make this lookups itself. Then unbound is slow and takes that time to look up the data. But that behaviour is then visible, in the logs of Unbound, instead of hidden behind the upstream forwarder. That separates the concern of upstream forwarder speed from the resolution speed, or, makes it visible in logs. That said, there are options to make unbound prefer the fastest ping time, |
It is possible to get information from unbound, and that could be easier to read, or have more information than the pcap, with like It could also be useful to have more control over the incoming queries, in some way. Not sure what is used now, perhaps use a commandline lookup tool, and then control the sequential or parallel lookups by typing them one after the other or at the same time, in a different commandline terminal? Also the unbound logs would show what is happening inside the TLS channel, but that content does not seem to be a problem, right now. |
Thanks to all.
In the pcap example above it was a single client asking a single query with pcap taken on the WAN side of the router/firewall. The domain name was kia.com, as I knew this would not be in the cache:
Regarding the Click: Actual unbound.conf file (redacted)` Unbound Configuration########################## Server configurationserver: chroot: /var/unbound num-threads: 4 prefetch: yes StatisticsUnbound Statisticsstatistics-interval: 0 TLS Configurationtls-cert-bundle: "/etc/ssl/cert.pem" Interface IP addresses to bind tointerface: 172.16.1.1 Outgoing interfaces to be usedoutgoing-interface: 93.redacted DNS RebindingFor DNS Rebinding preventionprivate-address: 127.0.0.0/8 Access listsinclude: /var/unbound/access_lists.conf Static host entriesinclude: /var/unbound/host_entries.conf dhcp lease entriesinclude: /var/unbound/dhcpleases_entries.conf Domain overridesinclude: /var/unbound/domainoverrides.conf Forwardingforward-zone: Unbound custom optionsserver: Remote Control Configinclude: /var/unbound/remotecontrol.conf [23.05-RELEASE][admin@Router-8.redacted.me]/root:
Yes, that should be possible at the verbosity level requested.
I have been using dig from either a LAN client or from pfSense/BSD CLI. I have a further option to perform a look-up via the pfSense GUI directly but I presume that offers nothing different, under the hood. Is dig sufficient for testing and is there an easy way to disable the cache temporarily (ie without killing the 'warm' cache contents) so I don't have to dream-up unlikely domain names? |
How surprising that it makes two queries to resolve the domain name. If it gets one query, I would expect only one upstream query. The logs could tell what is going on. Or maybe it is a CNAME and then it resolves the target of the cname. In which case it looks very normal. The config looks okay. Nothing I would note. The unbound-control utility has a command to flush the cache for a name, it is then resolved the next time it is asked. It is not making two concurrent queries then, there is only one query coming in, and this is getting answered. dig should be great for testing. It is also possible to use dig or a commandline tool to make queries towards the upstream servers, at their IPv4 and IPv6 address, and then it shows what they answer. Is it just this one time, or is IPv6 a lot slower than IPv4? I would imagine that the upstream resolver is located roughly in the same place, so the time increase is a bit. If the IPv6 connection has a lot of lag on it, something that can be attributed to tunnels, perhaps |
The screenshot of timers, shows that the ping times are fine. But the IPv6 addresses, both of them, have an RTO that is much larger than the RTT, this is an increase because of recent timeouts. It has exponentially backed off. The IPv4 addresses are fine, and do not seem to have a lot of timeouts, even though 1 is listed in the timeout other column. The ping is fine, when it connects, but the IPv6 addresses have timeouts, perhaps this is causing slowdown. |
It could be exactly that but I am not sure how these counters work. If they are cumulative the errors over the ~12 days the router has been 'up' they look inconsequential. Another thing I don't understand in the table is the reported ping time, which varies between 20 to 28 ms. If I ping any of those servers directly they return an average of 7.522 ms. I have run ping plotter against them and the trace is reassuringly flat. Pinging from the router itself to 9.9.9.9 directly provides these figures:
The values are those expected of my connection but not representative of the table extracted from unbound. Pinging from a wired client behind the router is no different (aside from the TTL & added latency of the extra hop):
Of course, the unbound data requested of me earlier can only help (waiting for a suitable network opportunity) but if any clarity can be added on how to correctly interpret the data in the unbound table above would be helpful. ☕️ |
The higher value of the ping is likely the TLS handshake that is counted in. If forward-tls-upstream was turned off, unbound would use UDP and then get those ping values too. Yes the logs could be useful. The timeouts look to be a problem, to me, because the high value of 600+ msec compared to the roundtrip time of 70-80 msec, that is about 8x, or 3 timeouts that happened in sequence. So both of them have had 3 connections fail. Now the RTO is so large, that unbound is likely no longer choosing them and is using only the IPv4 addresses by preference. |
@wcawijngaards our responses crossed each other but your comments makes my table above look even weirder. I'll let you ponder them but I suspect that you probably don't have enough data from me just yet to fully explain them. ☕️ |
The low value is very likely due to lack of information, there are almost no observations of a roundtrip, and the estimate is mostly variance, the average is just uncertain. So that is not really a problem. The RTT value is the ping value and the variance combined, and the RTO value has timeout backoff applied to it. No, the timeouts are kept track of temporarily, and this table shows again timeouts that have happened. |
@wcawijngaards Thanks for the subtle prod over TCP, I could have added +tcp to the ping command but, well, I didn't - Doh! I'm now fighting the
This makes simple & repeatable testing somewhat challenging. The pfSense config of unbound has this at the very end:
The file referenced contains this:
So it does not reflect the example in the unbound documentation and, in this case, it does not work. Any ideas on what needs fixing? ☕️ [I seem to be dragging you around with my unfamiliarity with unbound & pfSense; I do apologise and appreciate the help.] |
The unbound-control tool seems to be reading from a different config file. The |
I tried the -c command to point at the actual
I also checked the
I'm not sure if this is relevant or not but the
I'm either lost in unbound syntax or failing to understand how pfSense arranges the ☕️ time |
The first command is failing because there is no command on the commandline. The second has a status command. |
Aargh, I think my brain has failed and your patience is appreciated. I guess I should have been typing:
|
unbound set at verbosity 5, using just 2 Quad9 IPv4 server addresses, so I am presuming no major issues here: Click: unbound log v5 for approx 1 second (redacted)Jun 27 16:15:32 Router-8 unbound[96737]: [96737:0] debug: chdir to /var/unbound I'll re-run with the Quad9 IPv6 name servers added to the forwarding list when network traffic allows. ☕️ |
As requested: Click: unbound log verbosity 5 for IPv4 and IPv6 forwarders [Quad9] (redacted)[23.05-RELEASE][admin@Router-8.redacted.me]/root: cat /var/log/resolver.log [23.05-RELEASE][admin@Router-8.redacted.me]/root: Click: unbound-control stats_noreset: (redacted)[23.05-RELEASE][admin@Router-8.redacted.me]/root: unbound-control -c /var/unbound/unbound.conf stats_noreset The DNS Resolver Infrastructure Cache Speed summaries can look quite wild: I hope these help you shed some light as to why the query response time can become so protracted when using both IPv4 and IPv6 forwarders. ☕️ |
In the logs there is one oddity, it seems this happens:
That looks like unbound 71149 ceases to exist, and then about two minutes and 14 seconds later it starts again. That, if intentional is harmless, but there would normally be logs about the shutdown sequence. From the infra cache stats, it looks like there are a lot of timeouts happening. Also for an IPv4 address, I spot. And the other timeouts for IPv6 addresses. These timeouts cause unbound to pause and wait, and I think, but I already mentioned earlier, are the cause of the wait times. Something must be wrong with the connection. I would think that fixing the upstream connectivity would likely be the issue. Somehow this does not happen if only IPv4 is used? The presence of IPv6 traffic causes packet drops? From the logs, there is no mention of dropped connections in there. Also because they are fairly short, I guess they did not capture them. I guess it would not look like anything in particular, if this is some sort of loss of network connectivity. Normally, networks connectivity does not have this kind of packet drops, like 0 drops, would be expected. Also to these forwarders, would not expect packet drops. Perhaps prefer-ip4 can help, if ip6 network connectivity just does not work right, eg. has packet drops. Or fix the IPv6 connection. But then there is also some packet drops for IPv4. The TLS connections lag a bunch, and this is the TLS stack that does that. If the forwarders are configured to use UDP, unbound chooses a timeout, and that would likely be fairly short, 93 msec, for the IPv4 addresses, and this retry is quick and easy. It may be easier to help around this packet lossy network connection, as a packet drop for IPv4 then is only a couple hundred msec of delay, once in a while. Unbound can configure retries, the default is 5 retries to a server, |
Click: Oddity expanded (redacted):
I presume it was caused by this line:
The upstream connection to the forwarders is perfect with hardly a ripple on PingPlotter and steady at 7.2ms. It is a 1 Gbit fibre service via a 2.5 GbE ONT. Clearly I am blind to what happens behind Quad9's door but I tried the same tests with the Cloudflare equivalents and there was no change in the symptoms or performance.
The issue appears when you add IPv6 addresses to the forwarder list. When I added Cloudflare IPv6 addresses to the 4 current forwarders the problem increased further - more IPv6 addresses = more issues. I'm not seeing anything that suggests dropped packets on the WAN side and the PCAPs support this; only that the timings get longer and are cumulative, tripping a timer somewhere.
If it comes to it I will have to remove the IPv6 addresses but this is not ideal. I'd rather stick with DoT, so the TLS / TCP handshakes are just one of those things. If UDP is the answer it kind of of defeats the reason I moved away from Dnsmasq as my caching dns service.
Where should I look for these retries (if present) as all I see in the pcaps is the sequential use of IPv6 after the IPv4 and the total time of these queries driving up the answer time? Whilst you should read the next observation with care, as I am far from sure of what I am seeing myself, but some of the expanded / delinquent timings seem to be associated with additional probing post the main request (prefetch activity?) - with the client only receiving its actual answer once all the other queries are complete. Again, I am not sure of what is going on under the hood but there is a lot more activity going on when the query times go sideways. Timings get summed and multiplied whilst the pcaps show nothing but protracted but otherwise normal activity. I included a Apologies with the clipped data provided previously, I was limited by the character limit. I do have larger logs so feel free to point me at things to look at or grep from. Thanks again for looking at this. ☕️ |
So, if the upstream is working fine, the issue must be close to the server. If not actually the server itself. This happens when IPv6 is used, and more IPv6 causes more issues. The issues are packet drops, for IPv6, but also a packet drop for IPv4 is visible in the infra stats. This then causes slowdown. The first ipv4 and then ipv6 behaviour is caused by unbound selecting the best servers, and that is the ipv4 servers because they do not packet drop. Then unbound retries and attempts the IPv6 servers after that. That means the IPv4 failed somehow. it could also be random selection, and that should be even weighted, because that is what the unbound server selection code does. The statistics output did not look problematic to me, there is the long resolution times when timeouts must be happening. If the problem is close to the server, something must be wrong. If unbound is just creating a socket, then the system, network card, cable, network router or more network equipment up to the working WAN link is then likely the cause, and drops packets once IPv6 gets enabled. The failure where the process ceases to exist is not explained by the stats_noreset command, that should not end the server process. The process seems to have been killed, and then it is restarted two minutes later. If that is caused by a failure in hardware, like the mainboard, or overheating, and then the router restarts, that could explain it, and may also explain packet drop behaviour. Or the problem could be in software, if the machine is out of memory, unbound should log out of memory, but the linux OOM killer, can kill the process if the machine is out of memory without further logs from Unbound. And the machine could run out of memory because the extra ipv6 sockets use buffer memory. Perhaps it drops packets because of lack of buffer space, causing the connection failures? Unbound does not actually perform probing, there is a root key sentinel lookup in recent versions, but that is only once. And other queries only happen in line with client queries. Sometimes unbound sends another lookup because of a failure, or CNAME chasing. |
Looking at the IPv6 address that the packets are being dropped at it corresponds to this place: LONAP is a "not for profit" Layer 2 Internet Exchange Point (IXP) based in London. Our data-centres host a network of interconnected switches providing free-flowing peering to help minimise interconnection costs. We provide exclusive connectivity between members, who are effectively LONAP stakeholders. This ensures that LONAP members enjoy excellent value and maximum benefits: ☕️ |
It is nice that LONAP delivers not-for-profit IXP connectivity. The 25% packet loss indication looks like the WAN-side issue we were looking for. The 0.2 % for IPv4 is also important to note, because that means degraded performance. The cutout of the process is also worrying, in that the server process disappeared. But likely the 25% packet loss for IPv6, apparently some of the time, is certainly something that grinds connectivity to a halt. I do not think that TCP or TLS is going to cope with that sort of number. And it seems to not be doing so, in this issue. So, a solution is to not list the IPv6 addresses. Still have the 0.2% IPv4 trouble and process cutout issues. But then it avoids the 25% packet loss. Another is to use UDP, instead of TLS. In that case, Unbound performs the retries and they are much faster and light weight, comparatively, so that would be able to work. But since the IPv6 host is the same host as the IPv4 address, it is simply another way to contact that upstream service, perhaps this is not as useful as just using IPv4. Also, it is possible to remove the forward altogether, and have unbound run as full resolver, contacting authority servers. Because that is likely not going over that packet loss hop towards the particular upstream forwarder service, for most lookups, it would likely work. It is then not using TLS. Unbound is configured to not use fragments, if possible, something that is advocated for DNS servers. So the fragment failure is not really an issue, at all. |
@wcawijngaards - Thanks again and I have raised a support ticket (#32073) with Quad9 and I will report back here with any details they provide. ☕️ |
Quad9 are on the case as they can:
☕️ |
Forwarding to a mixed block of IPv4 and IPv6 name server addresses effectively doubles the query response time, as the IPv6 sever address query does not start until the IPv4 query has completed (ie it runs sequentially, not in parallel). Additionally, both IPv4 and IPv6 queries have to be fully resolved before any answer is provided to clients.
unbound Version 1.17.1
as bundled with pfSense Plus Version 23.05-Release
Desired behaviour:
With a list of forwarding name servers containing both IPv4 and IPv6 addresses (example below) the lookups should run in parallel with the option of selecting the fastest response from the 2 chosen servers selected by unbound in the normal manner. This would also provide an element of fallback should either the IPv4 or IPv6 address fail to provide a response, as well as a faster 'first-past-the-post' response.
It is accepted that the number of queries sent will still be doubled (as it is now) but by running in parallel it would avoid a faster IPv4 response being masked to the client until the IPv6 query has started and run to completion (or vice versa). As a stretch target, it would be ideal if the normal unbound forwarder selection behaviour was IPv4/6 agnostic, allowing either address protocol to be utilised by the selection algorithm, as this would halve the traffic and mimic the current behaviour if only IPv4 or IPv6 addressed forwarders were in use.
Attachments
PCAP overview showing sequential IPv4 query-response + IPv6 query-response-answer 6 & answer 4:
Forwarding addresses used in the above example:
forward-zone:
name: "."
forward-tls-upstream: yes
forward-addr: 9.9.9.9@853#dns.quad9.net
forward-addr: 149.112.112.112@853#dns.quad9.net
forward-addr: 2620:fe::fe@853#dns.quad9.net
forward-addr: 2620:fe::9@853#dns.quad9.net
☕️
The text was updated successfully, but these errors were encountered: