Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Recursor: server-down-max-fails is amplified by chained queries. #6462
We recently ran into a problem with
Steps to reproduce
Since the 200 queries for the same name only trigger 1 outgoing query (per NS) due to query chaining (located within
I passed a
Although I did not do extensive testing on this yet, I believe it would limit counting only on-wire (non-chained) timeouts.
I don't believe this situation is limited to just broken auth (although in this case it was). Consider where people only have 2 auth servers and one might be down. In that situation popular names may trigger this based on unlucky packet loss. We often see 'micro bursts' of lookups like this due to local CPE caching expiring.
Looking for guidance, but I think failures should only be accounted for if they were actually sent on the wire. This may have implications for lots of timeout/outgoing metric counting as well.
Thanks for reading.