Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefetch and ECS causing cache corruption when used together #417

Closed
chantra opened this issue Feb 2, 2021 · 8 comments
Closed

prefetch and ECS causing cache corruption when used together #417

chantra opened this issue Feb 2, 2021 · 8 comments
Assignees

Comments

@chantra
Copy link

chantra commented Feb 2, 2021

This used to be on unbound bugzilla, but it is not accessible anymore. for completeness, I copied/pasted what was in web archive in https://gist.github.com/chantra/fd333f62539e6ec2f9c5f94f690fb41c

The original bug report narrowed down the problem to the interaction between ECS and prefetch, e.g disabling prefetch in the config used for the repro would not reproduce the issue anymore.

I provide both an unbound config to repro and a go client to reproduce the issue. I used our in-house auth server to repro this, any OSS auth server supporting ECS could be used too, I just don't have any config handy.

The current theory is that when a record is prefetched, it is being prefetched without the ECS context and ends up filling up the global cache.


Repro

Using client
https://gist.github.com/chantra/d7b49d5b38b07c20d9fc7772b624e794

with the following authoritative responses:

foo.example.com CNAME bar.example.com
bar.example.com CNAME extcode.example.com if ECS is not set or if ECS is set and subnet is not in RFC1918
otherwise
bar.example.com CNAME intcode.example.com

First thing that stand out is that even if I do a simple

 dig foo.example.com -p 1153 @::1  +subnet=10.0.0.0/24

The servers are marked as edns lame?

127.0.0.1@8853 example.com. ttl 897 ping 0 var 54 rtt 216 rto 216 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
::1@8853 example.com. ttl 897 ping 0 var 71 rtt 284 rto 284 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0

I have uploaded a pcap to https://www.packettotal.com/app/analysis?id=ab657e5aed3af196db710bf60da850e2 (mind that the server is running on 8853) (it was too big to upload to this bug report)

One of the interesting part is, when using:

tshark -r /tmp/dns_example.com.pcap -V -Y 'dns.flags.response == 0 ' -T fields -e "frame.time" -e "dns.opt.data" | less
Jan 30, 2019 16:35:14.618349000 PST     00:01:20:00:21:b6:3f:23
Jan 30, 2019 16:35:14.619272000 PST     00:01:20:00:b7:5e:fe:5b
Jan 30, 2019 16:35:14.619630000 PST     00:01:20:00:b7:5e:fe:5b
Jan 30, 2019 16:35:14.620376000 PST     00:01:20:00:4f:2b:1c:04
Jan 30, 2019 16:35:14.620439000 PST     00:01:20:00:22:f1:38:3f
Jan 30, 2019 16:35:14.621797000 PST     00:01:20:00:74:99:a8:ed
Jan 30, 2019 16:35:14.621881000 PST     00:01:20:00:1c:fe:8c:a1
Jan 30, 2019 16:35:14.622982000 PST     00:01:20:00:74:99:a8:ed
Jan 30, 2019 16:35:14.623875000 PST     00:01:20:00:74:99:a8:ed
Jan 30, 2019 16:35:19.000153000 PST
Jan 30, 2019 16:35:19.000200000 PST
Jan 30, 2019 16:35:19.000307000 PST
Jan 30, 2019 16:35:19.000347000 PST
Jan 30, 2019 16:35:19.000629000 PST
Jan 30, 2019 16:35:19.000655000 PST
Jan 30, 2019 16:35:19.000780000 PST
Jan 30, 2019 16:35:19.000808000 PST
Jan 30, 2019 16:35:19.000831000 PST
Jan 30, 2019 16:35:19.000901000 PST
Jan 30, 2019 16:35:19.000955000 PST
Jan 30, 2019 16:35:19.001006000 PST
Jan 30, 2019 16:35:19.001097000 PST
Jan 30, 2019 16:35:19.001142000 PST
Jan 30, 2019 16:35:19.001194000 PST
Jan 30, 2019 16:35:19.001203000 PST
Jan 30, 2019 16:35:19.001237000 PST
Jan 30, 2019 16:35:19.001302000 PST
Jan 30, 2019 16:35:19.001350000 PST
Jan 30, 2019 16:35:19.001370000 PST
Jan 30, 2019 16:35:19.001464000 PST
Jan 30, 2019 16:35:19.001498000 PST

See the 5 sec pause in the trace and then all a sudden there is no ECS data anymore.

Mind that unbound is config with:

send-client-subnet: 127.0.0.1
send-client-subnet: ::1

Complete config is available at https://gist.github.com/chantra/3cd7d629b16fdd113c3b83239059e74e

Current thinking

No ECS on prefetch queries makes sense, these queries do not have a reply_list on the mesh, since no replies are needed. The ECS module uses the reply_list to get the address to put in the options.

So, what needs to be done is making the prefetch mesh states unique (or, preferably, unique per subnet), and somehow passes the client information to the module. Let me see what I can do here.

IIRC, the auth becoming unresponsive for a short period of time was a trigger, but I may be wrong here.

cc @gthess , @ralphdolmans

@chantra
Copy link
Author

chantra commented Feb 3, 2021

As a new point of reference, I reproed on unbound 1.11

there is still this 5s stall:

2021/02/02 16:13:52 OK
2021/02/02 16:13:52 OK
2021/02/02 16:13:57 1612311237 NOK
2021/02/02 16:13:57 1612311237 NOK

and unbound logs showing the servfail:

[1612310746] unbound[922235:18] error: SERVFAIL <example.com. AAAA IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311231] unbound[922235:e] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311231] unbound[922235:3] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311231] unbound[922235:7] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311231] unbound[922235:1a] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311232] unbound[922235:8] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311232] unbound[922235:15] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311232] unbound[922235:10] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311232] unbound[922235:10] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311232] unbound[922235:b] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.
[1612311232] unbound[922235:1] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.

the timestamp of those logs matchers the stall

$ date -d @1612311231
Tue Feb  2 16:13:51 PST 2021

@mzealey
Copy link

mzealey commented Feb 13, 2021

Sounds a bit similar to #388 could it be the same issue?

@gthess
Copy link
Member

gthess commented Feb 17, 2021

I see two issues mentioned here:
a. Something going wrong with EDNS
b. Prefetch logic not playing well with ECS.

For a. nothing loads when I visit the posted pcap so I can't comment on that. It may also be because of b. but I can't say for certain.
For b. I have a branch(https://github.com/NLnetLabs/unbound/tree/prefetch_when_ecs_enabled) that fixes the issue. Namely if a prefetch needs to happen the client's info will be passed along to the subnet module to make sure to use the information if it needs to.

Could you check if at least b. is resolved?

@gthess gthess self-assigned this Feb 17, 2021
@chantra
Copy link
Author

chantra commented Feb 17, 2021

For a. nothing loads when I visit the posted pcap so I can't comment on that. It may also be because of b. but I can't say for certain.

yeah, I suppose the website does not keep content for ever. I can try to upload a new trace for illustration purpose.

Could you check if at least b. is resolved?

I will give it a go soon.

@chantra
Copy link
Author

chantra commented Feb 17, 2021

Uploaded a new trace: ecs-before.pcap.gz
. You can see that frame number 156058 has no ECS and also comes 5s after frame number 156044, which is the last query frame with ECS.

I have tested https://github.com/NLnetLabs/unbound/tree/prefetch_when_ecs_enabled and it seems to work. I did not dive too much into it, but it seems that at least the symptom is not happening anymore when unbound find that "all the configured stub or forward servers failed, at zone example.com". There is some backoff, but no issue as was detected previously by client.

Gist with client and unbound logs: https://gist.github.com/chantra/c1ea4221ce60f635047701d7a680da1f

I have a tcpdump of that run, but it is quite big. I can share a dropbox link privately if needed.

@gthess
Copy link
Member

gthess commented Apr 25, 2022

Hi @chantra, my focus was diverted from the issue and quite some time has passed already but am I correct to assume that the issue is solved with the https://github.com/NLnetLabs/unbound/tree/prefetch_when_ecs_enabled branch?

@chantra
Copy link
Author

chantra commented Apr 25, 2022

@gthess I don't remember the details of the test as it was a while ago, but it does look like it solved the issue based on my comment.

@gthess
Copy link
Member

gthess commented Apr 26, 2022

Thanks, I remember the same but was confused with the pcap which seems to be in response to the expired first pcap file.
I'll try to merge the fix and close this issue.

@gthess gthess closed this as completed in daf316e May 11, 2022
gthess added a commit that referenced this issue May 11, 2022
Fix #417: prefetch and ECS causing cache corruption when used
jedisct1 added a commit to jedisct1/unbound that referenced this issue May 25, 2022
* nlnet/master:
  - Fix some lint type warnings.
  - Fix ede test to not use default pidfile, and use local interface.
  - Fix to silence test for ede error output to the console from the   test setup script.
  - Fix typos in config_set_option for the 'num-threads' and   'ede-serve-expired' options.
  - Fix NLnetLabs#678: [FR] modify behaviour of unbound-control rpz_enable zone,   by updating unbound-control's documentation.
  - For NLnetLabs#677: Added tls-system-cert to config parser and documentation. - Changelog note for NLnetLabs#677.
  Allow using system certificates not only on Windows
  - Fix NLnetLabs#417: prefetch and ECS causing cache corruption when used   together.
  - Fix NLnetLabs#673: DNS over TLS: error: SSL_handshake syscall: No route to   host.
  - Fix Python build in non-source directory; based on patch by   Michael Tokarev.
  Changelog entry for NLnetLabs#604: Add the basic EDE (RFC8914) cases
  Add the basic EDE (RFC8914) cases (NLnetLabs#604)
  - Fix NLnetLabs#670: SERVFAIL problems with unbound 1.15.0 running on   OpenBSD 7.1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants