dnsdist: Too many open files when backend authoritative server restarts #3300

Closed
rodecker opened this Issue Jan 25, 2016 · 15 comments

Projects

None yet

5 participants

@rodecker

Hello,

We have dnsdist running in front of a single gdnsd instance. We also have carbonServer configured to send data to a Graphite instance. Occasionally when gdnsd restarts (for example because of a configuration update) we see the following:

Jan 25 01:34:26 hostname dnsdist[17920]: Marking downstream 127.0.0.1:5300 as 'down'
Jan 25 01:34:27 hostname dnsdist[17920]: Marking downstream 127.0.0.1:5300 as 'up'
Jan 25 01:49:28 hostname dnsdist[17920]: Marking downstream 127.0.0.1:5300 as 'down'
Jan 25 01:49:58 hostname dnsdist[17920]: Problem sending carbon data: Too many open files
Jan 25 01:50:47 hostname dnsdist[17920]: While reading a TCP question: accepting new connection on socket: Too many open files
Jan 25 01:50:58 hostname dnsdist[17920]: Problem sending carbon data: Too many open files
Jan 25 01:52:58 hostname dnsdist[17920]: message repeated 2 times: [ Problem sending carbon data: Too many open files]
Jan 25 01:53:57 hostname dnsdist[17920]: While reading a TCP question: accepting new connection on socket: Too many open files

Even though gdnsd is available again, dnsdist does no longer see it as up. A restart of dnsdist is required.

$ dnsdist -V
dnsdist 1.0.0-alpha1

$ cat /etc/dnsdist/config
controlSocket("0.0.0.0")
webserver("0.0.0.0:8080", "")
setKey("")
setACL({"0.0.0.0/0", "::/0"})
carbonServer('', '', 60)
truncateTC(true) -- fix up possibly badly truncated answers from pdns 2.9.22

warnlog(string.format("Script starting %s", "up!"))

-- define the good servers
newServer {address="127.0.0.1:5300", useClientSubnet=true}

$ cat /proc/sys/fs/file-max
900000

@ahupowerdns
Member

do you have the fd-usage metric available somewhere?

@rgacogne rgacogne added this to the dnsdist-1.0.0 milestone Jan 25, 2016
@rgacogne
Member

Based on a PCAP file kindly provided by @rodecker, dnsdist accepts a new TCP connection which is closed right away by the remote end. dnsdist acknowledges the FIN, but sometimes does not close the connection (no FIN or RST sent). I still trying to figure that out.

@rgacogne rgacogne added a commit to rgacogne/pdns that referenced this issue Jan 26, 2016
@rgacogne rgacogne dnsdist: Fix TCP clients threads vector and counters initialization
By tracking the FD leak reported in #3300, I observed that:
* we could create up to g_maxTCPClientThreads TCP threads,
but the corresponding vector size was hardcoded at 1024
(which the default for g_maxTCPClientThreads)
* the counters were not explicitely initialized

This commit fixes that and adds some additional checks to make
sure we don't add more TCP client threads, as that could lead to
a race if the vector is resized.
a9bf3ec
@rgacogne
Member

So apparently the TCP acceptor thread accepts a new connection then get a file descriptor in order to notify the selected TCP client thread. For a reason yet unknown, the value of this FD is 0, which is mapped to /dev/null, causing the connection to be leaked:

[pid 25209] <... accept resumed> {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("192.0.2.1")}, [16]) = 85
[pid 25209] write(0, "\240\23\0,\222\177\0\0", 8) = 8
lrwx------ 1 root root 64 Jan 26 14:32 0 -> /dev/null
lrwx------ 1 root root 64 Jan 26 14:32 1 -> /dev/null

The above-mentioned commit adds several checks and minor fixes to try to fix this bug, but I am not 100% convinced it will be enough, as I am not able to reproduce it.

@rodecker

Thanks! This appears to have solved it. I have not seen any more occurrences of this issue as of yet.

@rgacogne
Member

@rodecker Thanks for the update!

@rgacogne rgacogne closed this Feb 25, 2016
@dimsua
dimsua commented Apr 27, 2016

dnsdist-1.0.0-1pdns.el6.x86_64

Apr 27 13:00:06 client5545 dnsdist[6911]: Read configuration from '/etc/dnsdist/dnsdist.conf'
Apr 27 13:00:06 client5545 dnsdist[25110]: Got control connection from 127.0.0.1:49168
Apr 27 13:00:06 client5545 dnsdist[25110]: Closed control connection from 127.0.0.1:49168
Apr 27 13:00:07 client5545 dnsdist[25110]: While reading a TCP question: accepting new connection on socket: Too many open files
Apr 27 13:00:07 client5545 dnsdist[25110]: While reading a TCP question: accepting new connection on socket: Too many open files
Apr 27 13:00:20 client5545 dnsdist[6929]: Read configuration from '/etc/dnsdist/dnsdist.conf'
Apr 27 13:00:20 client5545 dnsdist[6929]: Fatal error: connecting socket to 127.0.0.1:5199: Connection refused

@rgacogne
Member

Hi dimsua,

Thank you for reporting this, do you have any additional information you could provide? Do you export metrics to a graphite/metronome server?

@dimsua
dimsua commented Apr 27, 2016

We do not use graphite/metronom, use only zabbix. we write simple plugin for zabbix
collect queries, example:
echo "dumpStats()"|dnsdist --client|grep queries|awk '{print $3,$4}'|grep "^queries "|awk '{print $2}'
we need monitoring fd-usage?

@rgacogne
Member

Yes, it would be nice if you could monitor that and report how it increases. Could you also check the maximum number of open files dnsdist has? If you are under Linux, you can do something like:
grep -E '^Max open files' /proc/$(pidof dnsdist)/limits

Could you indicate an estimate of the number of UDP and TCP binds you have, as well as the number of backends? Did you change the value of setMaxTCPClientThreads()?

From the top of my head, I think that the maximum number of file descriptors used by dnsdist is something like "number of UDP binds" + "number of TCP binds" + "number of backends (for UDP listeners)" + 1 for the control server + "number of console connections" + 1 for the web server + "number of web connections" + 1 for carbon export + setMaxTCPClientThreads() * "number of backends".

@dimsua
dimsua commented Apr 27, 2016 edited
  1. Ok we will monitoring fd-usage
  2. Max open files 1024 4096 files
  3. Our settings:
controlSocket("127.0.0.1")
setKey("********")
newServer({address="*.*.*.*", name="***", qps=2000, order=1, weight=1, checkName="domain.name.", checkType="A", maxCheckFailures=1, retries=5})
newServer({address="*.*.*.*", name="***", qps=2000, order=2, weight=2, checkName="domain.name.", checkType="A", maxCheckFailures=1, retries=5})
newServer({address="*.*.*.*", name="***", qps=2000, order=3, weight=2, checkName="domain.name.", checkType="A", maxCheckFailures=1, retries=5})
setServerPolicy(firstAvailable)
addLocal("0.0.0.0:53", true, true)
addLocal("0.0.0.0:53", true, true)
addLocal("0.0.0.0:53", true, true)
addLocal("0.0.0.0:53", true, true)
setACL("0.0.0.0/0")
pc = newPacketCache(10000, 86400, 0, 60, 60)
getPool(""):setCache(pc)
addAction(AndRule({makeRule("0.0.0.0/0"), NotRule(MaxQPSRule(1000)), TCPRule(false)}), TCAction())
setVerboseHealthChecks(true)
setMaxUDPOutstanding(35535)
@dimsua
dimsua commented Apr 27, 2016

From the top of my head, I think that the maximum number of file descriptors used by dnsdist is something like "number of UDP binds" + "number of TCP binds" + "number of backends (for UDP listeners)" + 1 for the control server + "number of console connections" + 1 for the web server + "number of web connections" + 1 for carbon export + setMaxTCPClientThreads() * "number of backends".

Thanks, we try change open files.

@pr0vieh
pr0vieh commented May 18, 2016

Hi,

dnsdist[14983]: Had an error accepting new webserver connection: accepting new connection on socket: Too many open files
after an oparation time of 2 - 3 houers :(

@rgacogne
Member

Hi @pr0vieh,

Did you have a look at the above comments? There are several legitimate reasons for dnsdist running out of file descriptors, in which case you might simply want to increase the maximum number of open files. Could you please post your configuration somewhere, and monitor file descriptor usage either via graphite/metronome or simply via dumpStats()?

@pr0vieh
pr0vieh commented May 18, 2016 edited

yay i adapt the PR #3866 for dnsdist and all works fine 👍
now it handles ~4,5k QPS splitted over 250 backend servers with 1 Core, 1 GB RAM VPS 😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment