-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prefetch and ECS causing cache corruption when used together #417
Comments
As a new point of reference, I reproed on unbound 1.11 there is still this 5s stall:
and unbound logs showing the servfail:
the timestamp of those logs matchers the stall
|
Sounds a bit similar to #388 could it be the same issue? |
I see two issues mentioned here: For a. nothing loads when I visit the posted pcap so I can't comment on that. It may also be because of b. but I can't say for certain. Could you check if at least b. is resolved? |
yeah, I suppose the website does not keep content for ever. I can try to upload a new trace for illustration purpose.
I will give it a go soon. |
Uploaded a new trace: ecs-before.pcap.gz I have tested https://github.com/NLnetLabs/unbound/tree/prefetch_when_ecs_enabled and it seems to work. I did not dive too much into it, but it seems that at least the symptom is not happening anymore when unbound find that "all the configured stub or forward servers failed, at zone example.com". There is some backoff, but no issue as was detected previously by client. Gist with client and unbound logs: https://gist.github.com/chantra/c1ea4221ce60f635047701d7a680da1f I have a tcpdump of that run, but it is quite big. I can share a dropbox link privately if needed. |
Hi @chantra, my focus was diverted from the issue and quite some time has passed already but am I correct to assume that the issue is solved with the https://github.com/NLnetLabs/unbound/tree/prefetch_when_ecs_enabled branch? |
@gthess I don't remember the details of the test as it was a while ago, but it does look like it solved the issue based on my comment. |
Thanks, I remember the same but was confused with the pcap which seems to be in response to the expired first pcap file. |
Fix #417: prefetch and ECS causing cache corruption when used
* nlnet/master: - Fix some lint type warnings. - Fix ede test to not use default pidfile, and use local interface. - Fix to silence test for ede error output to the console from the test setup script. - Fix typos in config_set_option for the 'num-threads' and 'ede-serve-expired' options. - Fix NLnetLabs#678: [FR] modify behaviour of unbound-control rpz_enable zone, by updating unbound-control's documentation. - For NLnetLabs#677: Added tls-system-cert to config parser and documentation. - Changelog note for NLnetLabs#677. Allow using system certificates not only on Windows - Fix NLnetLabs#417: prefetch and ECS causing cache corruption when used together. - Fix NLnetLabs#673: DNS over TLS: error: SSL_handshake syscall: No route to host. - Fix Python build in non-source directory; based on patch by Michael Tokarev. Changelog entry for NLnetLabs#604: Add the basic EDE (RFC8914) cases Add the basic EDE (RFC8914) cases (NLnetLabs#604) - Fix NLnetLabs#670: SERVFAIL problems with unbound 1.15.0 running on OpenBSD 7.1.
This used to be on unbound bugzilla, but it is not accessible anymore. for completeness, I copied/pasted what was in web archive in https://gist.github.com/chantra/fd333f62539e6ec2f9c5f94f690fb41c
The original bug report narrowed down the problem to the interaction between ECS and prefetch, e.g disabling prefetch in the config used for the repro would not reproduce the issue anymore.
I provide both an unbound config to repro and a go client to reproduce the issue. I used our in-house auth server to repro this, any OSS auth server supporting ECS could be used too, I just don't have any config handy.
The current theory is that when a record is prefetched, it is being prefetched without the ECS context and ends up filling up the global cache.
Repro
Using client
https://gist.github.com/chantra/d7b49d5b38b07c20d9fc7772b624e794
with the following authoritative responses:
foo.example.com CNAME bar.example.com
bar.example.com CNAME extcode.example.com if ECS is not set or if ECS is set and subnet is not in RFC1918
otherwise
bar.example.com CNAME intcode.example.com
First thing that stand out is that even if I do a simple
The servers are marked as edns lame?
I have uploaded a pcap to https://www.packettotal.com/app/analysis?id=ab657e5aed3af196db710bf60da850e2 (mind that the server is running on 8853) (it was too big to upload to this bug report)
One of the interesting part is, when using:
See the 5 sec pause in the trace and then all a sudden there is no ECS data anymore.
Mind that unbound is config with:
Complete config is available at https://gist.github.com/chantra/3cd7d629b16fdd113c3b83239059e74e
Current thinking
IIRC, the auth becoming unresponsive for a short period of time was a trigger, but I may be wrong here.
cc @gthess , @ralphdolmans
The text was updated successfully, but these errors were encountered: