Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dnsdist "stuck" after a few empty IXFRs if config has setMaxTCPClientThreads #12099

Closed
hlindqvist opened this issue Oct 18, 2022 · 6 comments · Fixed by #12100
Closed

Dnsdist "stuck" after a few empty IXFRs if config has setMaxTCPClientThreads #12099

hlindqvist opened this issue Oct 18, 2022 · 6 comments · Fixed by #12100

Comments

@hlindqvist
Copy link
Contributor

hlindqvist commented Oct 18, 2022

  • Program: dnsdist
  • Issue type: Bug report

Short description

Dnsdist gets "stuck" after completing a number of "empty IXFRs" (ie IXFR for current SOA.SERIAL) if the config has eg setMaxTCPClientThreads(20).
IXFRs that has changes (ie IXFR for past SOA.SERIAL) or AXFR does not seem to trigger this problem.

Environment

  • Operating system: Ubuntu 20.04
  • Software version: 1.7.2
  • Software source: PowerDNS repository

Steps to reproduce

  1. Prereq: ensure you have an IXFR-capable nameserver. I validated the problem primarily with ixfrdist, but pdns-auth likely also works for the actual breaking case (IXFR for current SOA.SERIAL), although I think it will "unstick" dnsdist by closing connections after a while.

  2. Install dnsdist, configure it like so:

addLocal("127.0.0.1")

newServer({address='127.0.3.1:53', pool='ixfrdist', checkName="test.example", checkType="SOA" })
addAction(AllRule(), PoolAction('ixfrdist'))

setMaxTCPClientThreads(20)
  1. Run something like this (example for current SOA.SERIAL = 2):
for f in `seq 1 100`; do echo $f ; dig @127.0.0.1 test.example IXFR=2 ; sleep 0.1 ; done

Expected behaviour

All rounds should complete

Actual behaviour

Every single time the loop gets stuck after some number of rounds as dnsdist stops responding to TCP queries (in my tests it seemed to be after round 10, but idk if that is reliably so).

(Also compare what happens with IXFR=1 (previous serial), AXFR, or without setMaxTCPClientThreads in dnsdist.conf; none of those variations of the above scenario results in the same breakage.)

Other information

At the point when dnsdist is "stuck" it has a bunch of connections remaining open to the backend (as observable in eg netstat -an).

Possibly the issue is something along the lines that dnsdist doesn't close connections after completing the empty IXFR, and then is stuck after having exhausted all TCP connections to the backend.

For me personally, the immediate problem has been solved by simply removing setMaxTCPClientThreads from dnsdist.conf; the meaning of that directive has changed quite radically since the time I added it to the config anyway, and it isn't really of interest to me anymore after having read documentation for current dnsdist.
However, the above behavior seems to indicate that something is wrong in the IXFR handling, which may still manifest in some other way, or at the very least for others who do want to use setMaxTCPClientThreads.

@rgacogne
Copy link
Member

That sounds bad. I don't immediately understand what is happening because even if dnsdist was expecting more data on the IXFR TCP connections, it should only add the corresponding file descriptors to the watch list and go on with its life, so something weird is happening.
Also removing setMaxTCPClientThreads(20) should get you less TCP worker threads (the default is 10), so I'm not sure how it helps..

@hlindqvist
Copy link
Contributor Author

That sounds bad. I don't immediately understand what is happening because even if dnsdist was expecting more data on the IXFR TCP connections, it should only add the corresponding file descriptors to the watch list and go on with its life, so something weird is happening. Also removing setMaxTCPClientThreads(20) should get you less TCP worker threads (the default is 10), so I'm not sure how it helps..

My suspicion is that dnsdist indeed does go on with its life in a sense, and that the "stuckness" is a result of it having exhausted all available TCP connections in the backend with these already finished IXFRs (which are seemingly not recognized as such).
At least with ixfrdist no party seems to want to give up on those connections any time soon, either.

@rgacogne
Copy link
Member

So what is happening is that the server is replying with a single SOA record, leading dnsdist to believe the transfer is not finished, waiting for at least one more SOA record. Thus dnsdist keeps the connection open, waiting for more data. The ixfrdist worker is then blocked on a read, doing nothing else, since it has no more data to send.
Unfortunately ixfrdist has only 10 workers by default, so once 10 incoming ends up in that state it stops responding indeed, since it cannot accept more incoming connections.

Now, the question is whether the authoritative server responding with a single SOA is valid. I guess it is, otherwise it would likely have been noticed before, but this is absolutely not obvious to me after re-reading rfc1995 from scratch. And if it is indeed valid, this is yet another IXFR corner case that we will have to handle in dnsdist to be able to detect the end of a zone transfer..

@rgacogne
Copy link
Member

It is valid, but why it is described in the "Brief description of the protocol" section of the RFC and not at all in the "Response Format" is a mystery to me. Oh well.

If an IXFR query with the same or newer version number than that of
the server is received, it is replied to with a single SOA record of
the server's current version, just as in AXFR.

@hlindqvist
Copy link
Contributor Author

It is valid, but why it is described in the "Brief description of the protocol" section of the RFC and not at all in the "Response Format" is a mystery to me. Oh well.

If an IXFR query with the same or newer version number than that of
the server is received, it is replied to with a single SOA record of
the server's current version, just as in AXFR.

Ok, so basically if the IXFR query's requested SOA.SERIAL >= the SOA.SERIAL in the first record in the response, then we're done?

@rgacogne
Copy link
Member

Looks that way, yes. But I'm afraid it means we cannot distinguish such a short IXFR response from the start of a normal AXFR response without parsing the SOA in the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants