New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursor: server-down-max-fails is amplified by chained queries. #6462

Closed
phonedph1 opened this Issue Apr 10, 2018 · 0 comments

Comments

Projects
None yet
2 participants
@phonedph1
Contributor

phonedph1 commented Apr 10, 2018

  • Program: Recursor
  • Issue type: Bug report

Short description

We recently ran into a problem with qname=sip.voip.blackberry.com, qtype=A failing to respond from both their auth servers. This situation was compounded by CPE requesting the name at extremely bursty rates (eg: requesting this name hundreds of times within the same second). This lead to a situation where both auth servers ended up in a throttled state.

Environment

  • Operating system: Linux
  • Software version: 4.1.2
  • Software source: Compiled

Steps to reproduce

  1. Launch 200 queries for a name that upstream will not respond to.
  2. Query for a name that would normally succeed

Expected behaviour

Since the 200 queries for the same name only trigger 1 outgoing query (per NS) due to query chaining (located within asendto) it was expected that it would only count for 1 increment in server-down-max-fails and the following query would succeed.

Actual behaviour

  1. Initial 200 queries very quickly increment the remoteIP failure rate to exceed server-down-max-fails
  2. The re-try for these queries against the second NS quickly does the same
  3. This query fails because both servers have been throttled

Other information

I passed a bool& chained down from doResolveAtThisIP into asendto and changed the value if the query was going to be chained. I then excluded such queries from the timeout code:

if (s_serverdownmaxfails > 0 && (auth != g_rootdnsname) && !chained && t_sstorage.fails.incr(remoteIP) >= s_serverdownmaxfails) {

Although I did not do extensive testing on this yet, I believe it would limit counting only on-wire (non-chained) timeouts.

Usecase

I don't believe this situation is limited to just broken auth (although in this case it was). Consider where people only have 2 auth servers and one might be down. In that situation popular names may trigger this based on unlucky packet loss. We often see 'micro bursts' of lookups like this due to local CPE caching expiring.

Description

Looking for guidance, but I think failures should only be accounted for if they were actually sent on the wire. This may have implications for lots of timeout/outgoing metric counting as well.

Thanks for reading.

@rgacogne rgacogne added this to the rec-4.1.x milestone Apr 10, 2018

@pieterlexis pieterlexis closed this in #6465 Apr 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment