New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist - inconsistent responses from cache #4983

Closed
rygl opened this Issue Feb 8, 2017 · 7 comments

Comments

Projects
None yet
2 participants
@rygl

rygl commented Feb 8, 2017

Hi,
I would like to open following issue. I am running dnsdist 1.1.0 with enabled caches:
PrimaryCache = newPacketCache(20000000, 86400, 0)
getPool(""):setCache(PrimaryCache)
getPool("ipv6_pool"):setCache(PrimaryCache)

SecondaryCache = newPacketCache(2000000, 86400, 0)
getPool("quarantine"):setCache(SecondaryCache)
getPool("ipv6_quarantine"):setCache(SecondaryCache)

setStaleCacheEntriesTTL(600)

I have complaints from clients asking for A record of skye.exxonmobil.com which is 150.127.230.13 with TTL of 300 sec. For some reason dnsdist starts to respond with NXDOMAIN to some clients. It responds either with the correct answer or NXDOMAIN to different clients in the same moment. I have a pcapng capture of such behavior.

Clearing the record from cache solves the issue for couple of hours:
getPool(""):getCache():expungeByName(newDNSName("skye.exxonmobil.com"), dnsdist.A)

I had to add a rule skipping cache for this domain. I am afraid that also other domains could be affected.

Thanks for help.
Cheers
Ales

@rgacogne

This comment has been minimized.

Member

rgacogne commented Feb 8, 2017

Hi Aleš!

Looking at your capture, it looks like we have two different answers in the cache:

  • one for an EDNS0-enabled query, with the correct answer content (150.127.230.13)
  • one for the same query but without EDNS0 support, with an NXDOMAIN answer

From the dnsdist point of view, the two queries and their matching answers are treated as two different ones, since we can't serve the same answer to an EDNS-aware client and to an EDNS-ignorant one, so it's perfectly normal to have both entries in the cache.
The real question is how did we end up with an NXDOMAIN answer for this query? The cached answer has no RRs except for the query, which looks more like the type of NXDOMAIN dnsdist currently generates than a backend one. Do you have some rules that could generate such an NXDOMAIN? If so, perhaps you need some rules to prevent caching of those?

@rygl

This comment has been minimized.

rygl commented Feb 8, 2017

Hi Remi.

thanks for your answer. Regarding the EDNS0 and the clients - the queries in the trace above were generated manually from two different Linux boxes having different dig version. One of them use EDN0 by default, the second one (the old one) not.

Real client are not using EDNS0. They seems to be iPhones with MS ActiveSync client. You can see it here. I think see your point regarding caching two different answers. Unfortunately there are only several NXDOMAIN rules in my config and none of them seems to be interfering with this particular domain at all:

0 293212 qname==168.192.in-addr.arpa. set rcode 3
1 19010 qname==16.172.in-addr.arpa. set rcode 3
2 420065 qname==10.in-addr.arpa. set rcode 3
3 216794887 All Lua script
4 8403 qname==exxonmobil.com. skip cache
5 348047 !(qclass==1) set rcode 3
6 0 Regex: dns-spider.*\.cn$ drop
7 0 Regex: dns-spider.*\.net$ drop
8 0 Regex: dns-spider.*\.org$ drop
9 0 Regex: dns-spider.*\.com$ drop
10 700936 qname==home. set rcode 3
11 506901 qname==local. set rcode 3
12 931386 qname==cpe. set rcode 3
13 61428 qname==wpad. set rcode 3
14 516973 qname==teredo.ipv6.microsoft.com. set rcode 3
15 215817 qname==t-mobile.cz. allow
16 22541 qname==tmo.cz. allow
17 0 qname==openresolver.paegas.cz. allow
18 51388 Regex: .*\.symantec\..* allow
19 1040342 Regex: .*\.avast\..* allow
20 194322 Regex: .*\.avg\..* allow
21 248269 Regex: .*\.eset\..* allow
22 164106 Regex: .*\.mcafee\..* allow
23 3 Src: 62.141.2.10/32 allow

Then just some quarantine rules etc. follow. Nothing what could send NXDOMAIN.

A.

@rgacogne

This comment has been minimized.

Member

rgacogne commented Feb 8, 2017

Ok, thanks for clarifying! If the NXDOMAIN answer is not generated locally, either it's sent by a backend or something really weird is happening. Perhaps you could try capturing the traffic related to that domain, or to export it using protobuf?

@rygl

This comment has been minimized.

rygl commented Feb 8, 2017

Hi Remi. What I have not mentioned yet: when this occurs all the backends are responding correctly when queried directly. Just dnsdist returns NXDOMAIN. I have two installations and both do the same. I try to capture some traffic.

@rygl

This comment has been minimized.

rygl commented Feb 8, 2017

Hi Remi. I think I have found it. There is an issue in authoritative NS of exxonmobil.com. One of three NS does not respond correctly - does not even know this domain. What confused me was the fact that when asking backends directly there were correct record all the time. So my theory is that the negative answer was cached in dnsdist and persisted there while the backends had already refreshed records... TTL is just 300 sec here. How is it with negative caching on dnsdist?

I should have checked it first. Sorry for that.

Regards
Ales

@rgacogne

This comment has been minimized.

Member

rgacogne commented Feb 8, 2017

Ok, so indeed ns01.exxonmobil.com replies NXDOMAIN (with no RRs except the question, no SOA) to queries for exxonmobil.com. It also replies FORMERR to EDNSenabled queries..
Since the answer has no RR, dnsdist's cache has nothing to compute the TTL from and will default to the maximum (86400 by default). I think it's a bug, we should probably follow RFC2308 recommendation and not cache this kind of answer:

Negative responses without SOA records SHOULD NOT be cached as there is no way to prevent the negative responses looping forever between a pair of servers even with a short TTL.

I'll open a PR to prevent the caching of answers with no RRs soon.

@rgacogne rgacogne referenced this issue Feb 8, 2017

Merged

dnsdist: Don't cache answers without any TTL #4987

3 of 6 tasks complete
@rygl

This comment has been minimized.

rygl commented Feb 9, 2017

Thanks, Remi! This is interesting. No I know why avoiding cache solved (partially) the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment