Upstream dns connection pooling broken. #659

saltama · 2019-03-23T11:11:26Z

Hi, sorry for ignoring the issue template but let me just describe what has been broken with the last few releases.
I really want to use adguard as my main dns considering all the features it has but while 0.92-hotfixX was barely usable for me, I can't use the last two releases without restarting adguard every hour or so (cpu at 100%, stops responding) and sometimes multiple restarts are needed to make it resolve something.

To make it short, with 0.94 i'm seeing a huge number of i/o timeout errors, that start appearing sometimes at startup, other times after a while (sooner if parallel enabled) and in extremely(10%) rare occasions don't manifest themselves at all.
Describing/debugging this kind of issues is a bit hard, so bear with me if sometimes the description below will seem a bit incoherent, but I think the root cause can be found in how the connection pool has been implemented.

This should be quite easy to replicate even on a laptop or docker container, enabling parallel to reach the breaking point sooner. Testing this on a raspberry with 512mb of ram made seeing these issues easier i guess.

Setup: I'm running it on a raspberry pi zero (debian stretch, 512mb ram), using the binaries you provide. AdGuardHome configured as a systemd service to start automatically at boot, no other significant services are active.

Let me recap what i was seeing in the latest releases:

0.92:

Issues with some type of upstream servers, fixed with the hotfixes.
Adguard started by systemd, doesn't resolve anything for 10 minutes or so, then starts working. Most of the times restarting it fixes the issue. Looking at the verbose log i noticed that the first few upstream connections were failing with high RTT (60-120 seconds)

0.93

Still not responding for 10 minutes. After half an hour the process takes 100% of the cpu and stops responding. Restarting helps, but only for another 30 minutes. Didn't test it much in this case, i was just waiting for the next release.

0.94

Did you introduce connection pooling in addition to the parallel query mechanism?
Because now I can see messages hinting to the fact that an upstream dns connection has been reused.
But the pooling mechanism appears to be broken.
Under various circumstances, I can't see connections being reused anymore and a new connection is created every time AdGuardHome receives a new connection. Something that makes adguard unresponsive after a while.

This is what i'm seeing in a typical session, with parallel enabled to make it easier to reach the point where the errors appear, but as said above, even with parallel off, sometimes the i/o timeout error you see below will start right away:

It looks like there is some sort of pooling mechanism for upstream dns server connections since sometimes I see these messages:

 2019/03/23 08:12:03 5629#372 [debug] github.com/AdguardTeam/dnsproxy/upstream.(*TLSPool).Get(): 
 Returning existing connection to 9.9.9.9:853 with updated deadLine

BUT after a short while the log is full of normal successful connection attempts without reuse,like if the pool just didn't have any reusable connection available anymore:

 2019/03/23 07:56:03 3787#4781 [debug] github.com/AdguardTeam/dnsproxy/upstream.createDialContext.func1(): 
 Dialing to 1.0.0.1:853 2019/03/23 07:56:04 3787#4722 [debug] github.com/AdguardTeam/dnsproxy/upstream.createDialContext.func1(): 
 dialer successfully initialize connection to 1.0.0.1:853 in 553 milliseconds

After a while adguard stops resolving at all, RTTs increase to 60,100+ seconds, i still can't see those "Returning existing connection" or "successfully initialize connection" messages anymore, and every query fails:

 2019/03/23 08:24:44 5629#636 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).Resolve(): RTT: 62833 ms
 2019/03/23 08:24:44 5629#636 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).logDNSMessage(): OUT: ;; opcode: QUERY, status: SERVFAIL, id: 28741
 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
 
 ;; QUESTION SECTION:
 ;blog.ycombinator.com.	IN	 A
 
 2019/03/23 08:24:44 5629#636 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).handleUDPPacket(): error handling DNS (udp) request: talking to dnsUpstream failed, cause: all upstreams failed to exchange, cause: Failed to create a new connection from TLSPool to 1.1.1.1:853, cause: 
 Failed to connect to 1.1.1.1, cause: write tcp 10.0.0.165:41006->1.1.1.1:853: i/o timeout (hidden: Failed to create a new connection from TLSPool to 9.9.9.9:853, cause: Failed to connect to 9.9.9.9, cause: write tcp 10.0.0.165:46310->9.9.9.9:853: i/o timeout, Failed to create a new connection from TLSPool to 1.0.0.1:853, cause: 
 Failed to connect to 1.0.0.1, cause: write tcp 10.0.0.165:33780->1.0.0.1:853: i/o timeout, Failed to create a new connection from TLSPool to 149.112.112.112:853, cause: 
 Failed to connect to 149.112.112.112, cause: write tcp 10.0.0.165:58572->149.112.112.112:853: i/o timeout)

After a while, the i/o timeout errors turn into this:

 2019/03/23 08:48:33 5629#6221 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).handleUDPPacket(): error handling DNS (udp) request: talking to dnsUpstream failed, cause: all upstreams failed to exchange, cause: 
 Failed to get a connection from TLSPool to 1.0.0.1:853, cause: Failed to connect to 1.0.0.1, cause: all dialers failed to initialize connection: , cause: dial tcp 1.0.0.1:853: socket: too many open files (hidden: Failed to get a connection from TLSPool to 9.9.9.9:853, cause: Failed to connect to 9.9.9.9, cause: all dialers failed to initialize connection: , cause: dial tcp 9.9.9.9:853: socket: too many open files, 
 Failed to get a connection from TLSPool to 1.1.1.1:853, cause: Failed to connect to 1.1.1.1, cause: all dialers failed to initialize connection: , cause: dial tcp 1.1.1.1:853: socket: too many open files, Failed to get a connection from TLSPool to 149.112.112.112:853, cause: 
 Failed to connect to 149.112.112.112, 
 cause: all dialers failed to initialize connection: , 
 cause: dial tcp 149.112.112.112:853: socket: too many open files)

My personal and probably wrong diagnosis without even looking at the code

The pooling thingy likely introduced to keep the number of upstream connections in check is broken.

Possible reasons:

The pool gets filled with closed/stalled/that_are_way_over_the_deadline_you_set connections that render the pool useless and that lead to an explosion of open connections toward the upstream dnses (will they react blacklisting us?).
The pool works correctly but is being filled with closed/broken connections that get removed right away (why?). But we are not getting rid of these connections correctly since the number of open files reach the allowed limit.

I would add a mechanism to keep the pool fresh removing closed/stalled/that_are_way_over_the_deadline_you_set/otherwise_dead connections periodically and limit the number of maximum connections that can be created.
Ideally, the pool should contain a number of connections proportional to the number of upstream servers (1,2,3 connection each?) and NOT allow the creation of additional connections. If the pool is full, you wait. Creating additional connections when the pool is full does not make sense.

Clients of the pool should be able to perform request only sequentially (request order must be preserved). The freshness mechanism should flush completely(or by ip) stuck pools (pool that have had all their connection leased to some client for too much time) when no client has been able to get a connection for a reasonable amount of time (e.g. few seconds?).
Also, loss of overall connectivity should be handled properly (stop the pool from creating connections and flush the old one it has? reply with failures to every dns request adguard receives?)

The text was updated successfully, but these errors were encountered:

ameshkov · 2019-03-23T11:30:11Z

@saltama hi! quick question: do you have any issues when you use DOH upstreams and not DOT? They rely on the native Go http implementation with it's own connections keep-alive mechanism.

saltama · 2019-03-23T12:52:02Z

With only DoH it either (50/50) works (I'm not sure for how long, but it seems more stable) or not work at all from the beginning with this result for every query:

    2019/03/23 12:19:20 7481#245 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).handleUDPPacket(): 
    error handling DNS (udp) request: talking to dnsUpstream failed, 
    cause: all upstreams failed to exchange, 
    cause: couldn't do a POST request to 'https://dns.quad9.net:443/dns-query', 
    cause: Post https://dns.quad9.net:443/dns-query: net/http: request canceled while waiting for connection 
    (Client.Timeout exceeded while awaiting headers) (hidden: couldn't do a POST request to 'https://cloudflare-dns.com:443/dns-query', 
    cause: Post https://cloudflare-dns.com:443/dns-query: net/http: request canceled while waiting for connection
    (Client.Timeout exceeded while awaiting headers))

Also, in this case I don't see the familiar "Returning existing connection" message anymore.
If connections are handled differently for DoH, this could be another additional issue.

ameshkov · 2019-03-25T06:33:15Z

Adguard started by systemd, doesn't resolve anything for 10 minutes or so, then starts working

Btw, this sounds like another issue, and it is Rapsbian-related. Please check the last comment in this thread:
#633

ameshkov · 2019-03-25T06:37:58Z

write tcp 10.0.0.165:58572->149.112.112.112:853: i/o timeout -- this one means that AGH failed to initialize a new connection for the pool because of a timeout error.

Tbh, I suppose the issue is not the pooling, but something related to #633. We just need to figure out what exactly is causing this behavior.

Could you please tell what device are you running, OS version, etc? We'll try to reproduce it on our side.

saltama · 2019-03-26T15:27:38Z

Raspberry Pi Zero W, latest Raspbian Stretch.
Yep, the second issue is the one described in #633, unresponding for 5 minutes regardless of DoH/DoT, and then it starts working.
Don't know about the other one since it seems newer.
Will post some comments on the other thread and try to see if I can discover something else.

saltama · 2019-03-27T09:01:03Z

I had it configured with quad9/cloudflare DoH and after a day or so i've noticed that one of the adguard threads(of 8 listed by Top) was using 100% cpu as usual.
This is the result of sending a manual SIGSEGV to that thread, could be useful to spot something unusual (i know, a bit crude ;D).
Please note that it was running since yesterday but 15 minutes or so before sending the SIGSEGV i changed the configuration from parallel to non parallel (this is reflected in the age of some of the goroutines).

adguardhome.log.txt

Edit:
In that file there are 238 active goroutines, and most of those are either waiting on a mutex or doing tls or http stuff. I've only 3 clients in that network, if it matters.

In addition to the errors i've already listed, this morning i've also seen an error related to the source of randomness being empty (something like "urandom has no other randoms, waiting 60s") that i had never seen before (omitted from the log, it happened a few hours before what's shown there).

Regarding the "too many open files" error, the pi zero has some limits that are half of what bigger Raspberry Pis have.

On a Pi Zero:

raspberrypi:~ $ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3780
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3780
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
raspberrypi:~ $ ulimit -Hn
1048576
raspberrypi:~ $ ulimit -Sn
1024
raspberrypi:~ $ cat /proc/sys/fs/file-max
48258

On a Pi3:

pi3:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7345
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 2
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 7345
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
pi3:~$ ulimit -Hn
1048576
pi3:~$ ulimit -Sn
1024
pi3:~$ cat /proc/sys/fs/file-max
93882

szolin · 2019-03-27T15:01:24Z

In that file there are 238 active goroutines, and most of those are either waiting on a mutex or doing tls or http stuff. I've only 3 clients in that network, if it matters.

There are several lines like goroutine 27503 [chan receive, 20 minutes] which means that those goroutines are waiting for TLS handshake to complete for about 20 minutes - and that's too much.

Anyway, I've made a patch that allows to set user-defined per-process limit in config file. Is it convenient for you if I send you a binary file for your platform so you can check whether it solves your problems?

saltama · 2019-03-27T15:57:23Z

Sure! Put it somewhere and I'll try it (arm, not arm64).

szolin · 2019-03-27T16:29:13Z

OK, so you should edit config file and add rlimit_nofile: <NUMBER> setting like this:

...
language: en
rlimit_nofile: 8096
dns:
...

On both your systems the default value is 1024, so 8096 should be more than enough.

AdGuardHome_v0.94-4-g0c86-dirty_linux_arm.tar.gz
SHA-1: 8e519c4d80397dbb4de1a266877b0399d6845730

saltama · 2019-03-27T17:37:42Z

Testing it now, have you added a 30s timeout somewhere too? The cpu usage drops back to normal levels after 30s when it stalls at 100% (if not, this is something new i noticed now that i look at it closely).

szolin · 2019-03-27T19:03:54Z

have you added a 30s timeout somewhere too?

No, it's definitely something else.

ameshkov · 2019-03-28T07:44:46Z

Well, I must say that in your log there were a lot of "too many open files" errors, and it's hard to predict what kind of consequences they might be causing. So that might be the real cause.

saltama · 2019-03-28T14:31:35Z

For now i'm not seeing improvements, i could try tweaking a bit the limit value (eg. 1000) while running it with verbose and see what happens.
Yeah, i agree that this kind of errors could cause a lot of different weird side-effects...

szolin · 2019-03-28T14:56:46Z

For now i'm not seeing improvements

What's your current rlimit_nofile setting value?

eg. 1000

This is way too low. You were experiencing problems with 1024 (default), setting this value to 1000 will make it worse.

saltama · 2019-03-28T15:27:32Z

What's your current rlimit_nofile setting value?

Your 8096, but i didn't check the log for specific errors.

I'm redoing the tests (starting with 8k and then increasing, even if increasing the limit is just a workaround) with this script that tries to resolve random domains:

#!/usr/bin/env python3
from time import sleep
import socket
import random
import string
import argparse
import functools


if __name__ == "__main__":
    print = functools.partial(print, flush=True)
    
    parser = argparse.ArgumentParser(description='Simple stress test for DNS server')
    parser.add_argument('delay', metavar='N', type=float, 
                   help='delay between each attempt') #even 0.5 is enough


    args = parser.parse_args()
    delay = args.delay
    random.seed()
    i=0
    print(f"Resolving random domains:")
    while True:
        i +=1
        addr = ''.join(random.sample(string.ascii_letters + string.digits,16)) + "dnstest.com"
        try:
            res = socket.gethostbyname(addr)
        except socket.gaierror:
            pass
        print(".",end="")
        if i%20 == 0:
            print("")
        sleep(delay)

The fact is that it's hard to replicate... at some point it stops resolving, it could be after a few minutes or after hours, something that makes me think about concurrency/resource/stuff_not_closed issues.

ameshkov · 2019-03-28T20:08:09Z

Your 8096, but i didn't check the log for specific errors.

It'd be interesting to check if there are any "too many open files" errors now.

The fact is that it's hard to replicate... at some point it stops resolving, it could be after a few minutes or after hours, something that makes me think about concurrency/resource/stuff_not_closed issues.

I'll try running this script on my VPS with AGH, maybe I'll see something

ameshkov · 2019-03-29T07:26:57Z

So it has been running for 11 hours and I have no issues so far:
https://uploads.adguard.com/up04_2fx7x_AdGuard_Home.png

I am guessing this must be something with your Rapsbian, and to reproduce we should have the very same configuration.

Questions:

RPi version?
What packets are installed? What are their versions?
How is your RPi connected to the network? Wi-Fi, ethernet?

Btw, could it be that your RPi may be losing connectivity at some point (networking restart for instance)? This might explain timeout errors.

saltama · 2019-03-29T08:07:58Z

So it has been running for 11 hours and I have no issues so far
If you are using a VPS i have no problem believing that :)

Raspberry Pi Zero W, Raspbian Stretch (the only OS readily available), package list.
Raspberry connected via onboard wifi.

Btw, could it be that your RPi may be losing connectivity at some point (networking restart for instance)? This might explain timeout errors.
Possible, and what if adguardhome wasn't able to recover correctly from those errors or from too many errors while communicating with the upstream dns?

Btw, booted up the pi 2 hours ago (my 4 usual tls upstreams+parallel) and checked now, still blocked at 100% cpu with hundred of threads.

ameshkov · 2019-03-29T08:24:53Z

Btw, booted up the pi 2 hours ago (my 4 usual tls upstreams+parallel) and checked now, still blocked at 100% cpu with hundred of threads.

Well, it just means that queries to upstreams are still getting stuck on i/o timeout errors.

It sounds as if you have some kind of a rate limiter that starts dropping outgoing packets at some point.

Could you please check iptables just in case?

iptables --list
iptables -t nat --list

@szolin am I missing anything here?

ameshkov · 2019-03-29T09:09:26Z

@saltama and one more thing to try.

Here I slightly modified your script, now it just makes requests to cloudflare's DOH server directly:
https://gist.github.com/ameshkov/d03f0e4f3014cb642d8766169a6855a6

Could you please try running it for some time and see if there're any issues with timeout errors?

saltama · 2019-03-29T09:09:55Z

pi@raspberrypi:~ $ sudo iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

pi@raspberrypi:~ $ sudo iptables -t nat --list
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination

I'm trying to record a video of it, but as expected, when you look at it the issue doesn't happen...

ameshkov · 2019-03-29T09:35:14Z

In addition to #659 (comment)

There's one more place where rate limiting might be -- this is your router. Try checking iptables rules there as well.

However, if my script works okay, then the problem is not some rate limiting, and it means something is wrong with AGH (hopefully, it's not some golang issue).

saltama · 2019-03-29T20:13:14Z

Sure, will keep that running to verify that i still have connectivity toward clouflare.
The router is a dsl modem with a minimal cli/os, don't think it supports complex rules like rate limiting.

ameshkov · 2019-04-01T11:02:17Z

Sure, will keep that running to verify that i still have connectivity toward clouflare.

How it goes, are there any issues with my python script? (check the stdout for errors)

saltama · 2019-04-01T15:52:20Z

I won't be able to test anything for a few days, but don't despair, I'll report back soon.

ameshkov · 2019-04-02T07:59:02Z

Thanks, we'll be waiting!

szolin · 2019-06-03T15:52:45Z

@saltama Regarding i/o timeout errors you are experiencing - please take a look at what tcpdump shows for a connection that is hanging. Or send your tcpdump data and I can try to get the related information from it. We need to link the AGH logs with TCP connection and see what is really taking so long there.

ameshkov · 2019-07-02T13:30:05Z

See #633

saltama · 2019-07-02T19:46:34Z

Sorry for not reporting back... I'm verifying if this issue disappeared in Buster too.
For now it seems possible that's the case.

ameshkov · 2019-07-04T21:35:15Z

Awesome, thank you!

ameshkov added the needs investigation Needs to be reproduced reliably. label Mar 25, 2019

ameshkov assigned szolin Mar 25, 2019

ameshkov added this to the v0.95 milestone Mar 25, 2019

ameshkov removed this from the v0.95 milestone Apr 23, 2019

ameshkov added this to the v0.96 milestone Apr 23, 2019

szolin modified the milestones: v0.96, v0.97 Jun 4, 2019

ameshkov removed this from the v0.97 milestone Jul 2, 2019

ameshkov closed this as completed Jul 2, 2019

BallistiX09 mentioned this issue Aug 14, 2020

Query Log files causing 100% CPU usage on Raspberry Pi #2015

Closed

ainar-g removed the needs investigation Needs to be reproduced reliably. label Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream dns connection pooling broken. #659

Upstream dns connection pooling broken. #659

saltama commented Mar 23, 2019 •

edited

Loading

ameshkov commented Mar 23, 2019

saltama commented Mar 23, 2019

ameshkov commented Mar 25, 2019

ameshkov commented Mar 25, 2019

saltama commented Mar 26, 2019

saltama commented Mar 27, 2019 •

edited

Loading

szolin commented Mar 27, 2019

saltama commented Mar 27, 2019

szolin commented Mar 27, 2019

saltama commented Mar 27, 2019 •

edited

Loading

szolin commented Mar 27, 2019 •

edited

Loading

ameshkov commented Mar 28, 2019 •

edited

Loading

saltama commented Mar 28, 2019

szolin commented Mar 28, 2019

saltama commented Mar 28, 2019

ameshkov commented Mar 28, 2019

ameshkov commented Mar 29, 2019

saltama commented Mar 29, 2019

ameshkov commented Mar 29, 2019

ameshkov commented Mar 29, 2019 •

edited

Loading

saltama commented Mar 29, 2019 •

edited

Loading

ameshkov commented Mar 29, 2019 •

edited

Loading

saltama commented Mar 29, 2019

ameshkov commented Apr 1, 2019

saltama commented Apr 1, 2019

ameshkov commented Apr 2, 2019

szolin commented Jun 3, 2019

ameshkov commented Jul 2, 2019

saltama commented Jul 2, 2019

ameshkov commented Jul 4, 2019

Upstream dns connection pooling broken. #659

Upstream dns connection pooling broken. #659

Comments

saltama commented Mar 23, 2019 • edited Loading

0.92:

0.93

0.94

My personal and probably wrong diagnosis without even looking at the code

ameshkov commented Mar 23, 2019

saltama commented Mar 23, 2019

ameshkov commented Mar 25, 2019

ameshkov commented Mar 25, 2019

saltama commented Mar 26, 2019

saltama commented Mar 27, 2019 • edited Loading

szolin commented Mar 27, 2019

saltama commented Mar 27, 2019

szolin commented Mar 27, 2019

saltama commented Mar 27, 2019 • edited Loading

szolin commented Mar 27, 2019 • edited Loading

ameshkov commented Mar 28, 2019 • edited Loading

saltama commented Mar 28, 2019

szolin commented Mar 28, 2019

saltama commented Mar 28, 2019

ameshkov commented Mar 28, 2019

ameshkov commented Mar 29, 2019

saltama commented Mar 29, 2019

ameshkov commented Mar 29, 2019

ameshkov commented Mar 29, 2019 • edited Loading

saltama commented Mar 29, 2019 • edited Loading

ameshkov commented Mar 29, 2019 • edited Loading

saltama commented Mar 29, 2019

ameshkov commented Apr 1, 2019

saltama commented Apr 1, 2019

ameshkov commented Apr 2, 2019

szolin commented Jun 3, 2019

ameshkov commented Jul 2, 2019

saltama commented Jul 2, 2019

ameshkov commented Jul 4, 2019

saltama commented Mar 23, 2019 •

edited

Loading

saltama commented Mar 27, 2019 •

edited

Loading

saltama commented Mar 27, 2019 •

edited

Loading

szolin commented Mar 27, 2019 •

edited

Loading

ameshkov commented Mar 28, 2019 •

edited

Loading

ameshkov commented Mar 29, 2019 •

edited

Loading

saltama commented Mar 29, 2019 •

edited

Loading

ameshkov commented Mar 29, 2019 •

edited

Loading