Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream dns connection pooling broken. #659

Closed
saltama opened this issue Mar 23, 2019 · 30 comments
Closed

Upstream dns connection pooling broken. #659

saltama opened this issue Mar 23, 2019 · 30 comments
Assignees

Comments

@saltama
Copy link

saltama commented Mar 23, 2019

Hi, sorry for ignoring the issue template but let me just describe what has been broken with the last few releases.
I really want to use adguard as my main dns considering all the features it has but while 0.92-hotfixX was barely usable for me, I can't use the last two releases without restarting adguard every hour or so (cpu at 100%, stops responding) and sometimes multiple restarts are needed to make it resolve something.

To make it short, with 0.94 i'm seeing a huge number of i/o timeout errors, that start appearing sometimes at startup, other times after a while (sooner if parallel enabled) and in extremely(10%) rare occasions don't manifest themselves at all.
Describing/debugging this kind of issues is a bit hard, so bear with me if sometimes the description below will seem a bit incoherent, but I think the root cause can be found in how the connection pool has been implemented.

This should be quite easy to replicate even on a laptop or docker container, enabling parallel to reach the breaking point sooner. Testing this on a raspberry with 512mb of ram made seeing these issues easier i guess.

Setup: I'm running it on a raspberry pi zero (debian stretch, 512mb ram), using the binaries you provide. AdGuardHome configured as a systemd service to start automatically at boot, no other significant services are active.

Let me recap what i was seeing in the latest releases:

0.92:

Issues with some type of upstream servers, fixed with the hotfixes.
Adguard started by systemd, doesn't resolve anything for 10 minutes or so, then starts working. Most of the times restarting it fixes the issue. Looking at the verbose log i noticed that the first few upstream connections were failing with high RTT (60-120 seconds)

0.93

Still not responding for 10 minutes. After half an hour the process takes 100% of the cpu and stops responding. Restarting helps, but only for another 30 minutes. Didn't test it much in this case, i was just waiting for the next release.

0.94

Did you introduce connection pooling in addition to the parallel query mechanism?
Because now I can see messages hinting to the fact that an upstream dns connection has been reused.
But the pooling mechanism appears to be broken.
Under various circumstances, I can't see connections being reused anymore and a new connection is created every time AdGuardHome receives a new connection. Something that makes adguard unresponsive after a while.

This is what i'm seeing in a typical session, with parallel enabled to make it easier to reach the point where the errors appear, but as said above, even with parallel off, sometimes the i/o timeout error you see below will start right away:

  1. It looks like there is some sort of pooling mechanism for upstream dns server connections since sometimes I see these messages:

     2019/03/23 08:12:03 5629#372 [debug] github.com/AdguardTeam/dnsproxy/upstream.(*TLSPool).Get(): 
     Returning existing connection to 9.9.9.9:853 with updated deadLine
    

    BUT after a short while the log is full of normal successful connection attempts without reuse,like if the pool just didn't have any reusable connection available anymore:

     2019/03/23 07:56:03 3787#4781 [debug] github.com/AdguardTeam/dnsproxy/upstream.createDialContext.func1(): 
     Dialing to 1.0.0.1:853 2019/03/23 07:56:04 3787#4722 [debug] github.com/AdguardTeam/dnsproxy/upstream.createDialContext.func1(): 
     dialer successfully initialize connection to 1.0.0.1:853 in 553 milliseconds
    
  2. After a while adguard stops resolving at all, RTTs increase to 60,100+ seconds, i still can't see those "Returning existing connection" or "successfully initialize connection" messages anymore, and every query fails:

     2019/03/23 08:24:44 5629#636 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).Resolve(): RTT: 62833 ms
     2019/03/23 08:24:44 5629#636 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).logDNSMessage(): OUT: ;; opcode: QUERY, status: SERVFAIL, id: 28741
     ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
     
     ;; QUESTION SECTION:
     ;blog.ycombinator.com.	IN	 A
     
     2019/03/23 08:24:44 5629#636 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).handleUDPPacket(): error handling DNS (udp) request: talking to dnsUpstream failed, cause: all upstreams failed to exchange, cause: Failed to create a new connection from TLSPool to 1.1.1.1:853, cause: 
     Failed to connect to 1.1.1.1, cause: write tcp 10.0.0.165:41006->1.1.1.1:853: i/o timeout (hidden: Failed to create a new connection from TLSPool to 9.9.9.9:853, cause: Failed to connect to 9.9.9.9, cause: write tcp 10.0.0.165:46310->9.9.9.9:853: i/o timeout, Failed to create a new connection from TLSPool to 1.0.0.1:853, cause: 
     Failed to connect to 1.0.0.1, cause: write tcp 10.0.0.165:33780->1.0.0.1:853: i/o timeout, Failed to create a new connection from TLSPool to 149.112.112.112:853, cause: 
     Failed to connect to 149.112.112.112, cause: write tcp 10.0.0.165:58572->149.112.112.112:853: i/o timeout)
    
  3. After a while, the i/o timeout errors turn into this:

     2019/03/23 08:48:33 5629#6221 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).handleUDPPacket(): error handling DNS (udp) request: talking to dnsUpstream failed, cause: all upstreams failed to exchange, cause: 
     Failed to get a connection from TLSPool to 1.0.0.1:853, cause: Failed to connect to 1.0.0.1, cause: all dialers failed to initialize connection: , cause: dial tcp 1.0.0.1:853: socket: too many open files (hidden: Failed to get a connection from TLSPool to 9.9.9.9:853, cause: Failed to connect to 9.9.9.9, cause: all dialers failed to initialize connection: , cause: dial tcp 9.9.9.9:853: socket: too many open files, 
     Failed to get a connection from TLSPool to 1.1.1.1:853, cause: Failed to connect to 1.1.1.1, cause: all dialers failed to initialize connection: , cause: dial tcp 1.1.1.1:853: socket: too many open files, Failed to get a connection from TLSPool to 149.112.112.112:853, cause: 
     Failed to connect to 149.112.112.112, 
     cause: all dialers failed to initialize connection: , 
     cause: dial tcp 149.112.112.112:853: socket: too many open files)
    

My personal and probably wrong diagnosis without even looking at the code

The pooling thingy likely introduced to keep the number of upstream connections in check is broken.

Possible reasons:

  • The pool gets filled with closed/stalled/that_are_way_over_the_deadline_you_set connections that render the pool useless and that lead to an explosion of open connections toward the upstream dnses (will they react blacklisting us?).
  • The pool works correctly but is being filled with closed/broken connections that get removed right away (why?). But we are not getting rid of these connections correctly since the number of open files reach the allowed limit.

I would add a mechanism to keep the pool fresh removing closed/stalled/that_are_way_over_the_deadline_you_set/otherwise_dead connections periodically and limit the number of maximum connections that can be created.
Ideally, the pool should contain a number of connections proportional to the number of upstream servers (1,2,3 connection each?) and NOT allow the creation of additional connections. If the pool is full, you wait. Creating additional connections when the pool is full does not make sense.

Clients of the pool should be able to perform request only sequentially (request order must be preserved). The freshness mechanism should flush completely(or by ip) stuck pools (pool that have had all their connection leased to some client for too much time) when no client has been able to get a connection for a reasonable amount of time (e.g. few seconds?).
Also, loss of overall connectivity should be handled properly (stop the pool from creating connections and flush the old one it has? reply with failures to every dns request adguard receives?)

@ameshkov
Copy link
Member

@saltama hi! quick question: do you have any issues when you use DOH upstreams and not DOT? They rely on the native Go http implementation with it's own connections keep-alive mechanism.

@saltama
Copy link
Author

saltama commented Mar 23, 2019

With only DoH it either (50/50) works (I'm not sure for how long, but it seems more stable) or not work at all from the beginning with this result for every query:

    2019/03/23 12:19:20 7481#245 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).handleUDPPacket(): 
    error handling DNS (udp) request: talking to dnsUpstream failed, 
    cause: all upstreams failed to exchange, 
    cause: couldn't do a POST request to 'https://dns.quad9.net:443/dns-query', 
    cause: Post https://dns.quad9.net:443/dns-query: net/http: request canceled while waiting for connection 
    (Client.Timeout exceeded while awaiting headers) (hidden: couldn't do a POST request to 'https://cloudflare-dns.com:443/dns-query', 
    cause: Post https://cloudflare-dns.com:443/dns-query: net/http: request canceled while waiting for connection
    (Client.Timeout exceeded while awaiting headers))

Also, in this case I don't see the familiar "Returning existing connection" message anymore.
If connections are handled differently for DoH, this could be another additional issue.

@ameshkov ameshkov added the needs investigation Needs to be reproduced reliably. label Mar 25, 2019
@ameshkov
Copy link
Member

Adguard started by systemd, doesn't resolve anything for 10 minutes or so, then starts working

Btw, this sounds like another issue, and it is Rapsbian-related. Please check the last comment in this thread:
#633

@ameshkov ameshkov added this to the v0.95 milestone Mar 25, 2019
@ameshkov
Copy link
Member

write tcp 10.0.0.165:58572->149.112.112.112:853: i/o timeout -- this one means that AGH failed to initialize a new connection for the pool because of a timeout error.

Tbh, I suppose the issue is not the pooling, but something related to #633. We just need to figure out what exactly is causing this behavior.

Could you please tell what device are you running, OS version, etc? We'll try to reproduce it on our side.

@saltama
Copy link
Author

saltama commented Mar 26, 2019

Raspberry Pi Zero W, latest Raspbian Stretch.
Yep, the second issue is the one described in #633, unresponding for 5 minutes regardless of DoH/DoT, and then it starts working.
Don't know about the other one since it seems newer.
Will post some comments on the other thread and try to see if I can discover something else.

@saltama
Copy link
Author

saltama commented Mar 27, 2019

I had it configured with quad9/cloudflare DoH and after a day or so i've noticed that one of the adguard threads(of 8 listed by Top) was using 100% cpu as usual.
This is the result of sending a manual SIGSEGV to that thread, could be useful to spot something unusual (i know, a bit crude ;D).
Please note that it was running since yesterday but 15 minutes or so before sending the SIGSEGV i changed the configuration from parallel to non parallel (this is reflected in the age of some of the goroutines).

adguardhome.log.txt

Edit:
In that file there are 238 active goroutines, and most of those are either waiting on a mutex or doing tls or http stuff. I've only 3 clients in that network, if it matters.

In addition to the errors i've already listed, this morning i've also seen an error related to the source of randomness being empty (something like "urandom has no other randoms, waiting 60s") that i had never seen before (omitted from the log, it happened a few hours before what's shown there).

Regarding the "too many open files" error, the pi zero has some limits that are half of what bigger Raspberry Pis have.

On a Pi Zero:

raspberrypi:~ $ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3780
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3780
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
raspberrypi:~ $ ulimit -Hn
1048576
raspberrypi:~ $ ulimit -Sn
1024
raspberrypi:~ $ cat /proc/sys/fs/file-max
48258

On a Pi3:

pi3:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7345
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 2
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 7345
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
pi3:~$ ulimit -Hn
1048576
pi3:~$ ulimit -Sn
1024
pi3:~$ cat /proc/sys/fs/file-max
93882

@szolin
Copy link
Contributor

szolin commented Mar 27, 2019

In that file there are 238 active goroutines, and most of those are either waiting on a mutex or doing tls or http stuff. I've only 3 clients in that network, if it matters.

There are several lines like goroutine 27503 [chan receive, 20 minutes] which means that those goroutines are waiting for TLS handshake to complete for about 20 minutes - and that's too much.

Anyway, I've made a patch that allows to set user-defined per-process limit in config file. Is it convenient for you if I send you a binary file for your platform so you can check whether it solves your problems?

@saltama
Copy link
Author

saltama commented Mar 27, 2019

Sure! Put it somewhere and I'll try it (arm, not arm64).

@szolin
Copy link
Contributor

szolin commented Mar 27, 2019

OK, so you should edit config file and add rlimit_nofile: <NUMBER> setting like this:

...
language: en
rlimit_nofile: 8096
dns:
...

On both your systems the default value is 1024, so 8096 should be more than enough.

AdGuardHome_v0.94-4-g0c86-dirty_linux_arm.tar.gz
SHA-1: 8e519c4d80397dbb4de1a266877b0399d6845730

@saltama
Copy link
Author

saltama commented Mar 27, 2019

Testing it now, have you added a 30s timeout somewhere too? The cpu usage drops back to normal levels after 30s when it stalls at 100% (if not, this is something new i noticed now that i look at it closely).

@szolin
Copy link
Contributor

szolin commented Mar 27, 2019

have you added a 30s timeout somewhere too?

No, it's definitely something else.

@ameshkov
Copy link
Member

ameshkov commented Mar 28, 2019

Well, I must say that in your log there were a lot of "too many open files" errors, and it's hard to predict what kind of consequences they might be causing. So that might be the real cause.

@saltama
Copy link
Author

saltama commented Mar 28, 2019

For now i'm not seeing improvements, i could try tweaking a bit the limit value (eg. 1000) while running it with verbose and see what happens.
Yeah, i agree that this kind of errors could cause a lot of different weird side-effects...

@szolin
Copy link
Contributor

szolin commented Mar 28, 2019

For now i'm not seeing improvements

What's your current rlimit_nofile setting value?

eg. 1000

This is way too low. You were experiencing problems with 1024 (default), setting this value to 1000 will make it worse.

@saltama
Copy link
Author

saltama commented Mar 28, 2019

What's your current rlimit_nofile setting value?

Your 8096, but i didn't check the log for specific errors.

I'm redoing the tests (starting with 8k and then increasing, even if increasing the limit is just a workaround) with this script that tries to resolve random domains:

#!/usr/bin/env python3
from time import sleep
import socket
import random
import string
import argparse
import functools


if __name__ == "__main__":
    print = functools.partial(print, flush=True)
    
    parser = argparse.ArgumentParser(description='Simple stress test for DNS server')
    parser.add_argument('delay', metavar='N', type=float, 
                   help='delay between each attempt') #even 0.5 is enough


    args = parser.parse_args()
    delay = args.delay
    random.seed()
    i=0
    print(f"Resolving random domains:")
    while True:
        i +=1
        addr = ''.join(random.sample(string.ascii_letters + string.digits,16)) + "dnstest.com"
        try:
            res = socket.gethostbyname(addr)
        except socket.gaierror:
            pass
        print(".",end="")
        if i%20 == 0:
            print("")
        sleep(delay) 

The fact is that it's hard to replicate... at some point it stops resolving, it could be after a few minutes or after hours, something that makes me think about concurrency/resource/stuff_not_closed issues.

@ameshkov
Copy link
Member

Your 8096, but i didn't check the log for specific errors.

It'd be interesting to check if there are any "too many open files" errors now.

The fact is that it's hard to replicate... at some point it stops resolving, it could be after a few minutes or after hours, something that makes me think about concurrency/resource/stuff_not_closed issues.

I'll try running this script on my VPS with AGH, maybe I'll see something

@ameshkov
Copy link
Member

So it has been running for 11 hours and I have no issues so far:
https://uploads.adguard.com/up04_2fx7x_AdGuard_Home.png

I am guessing this must be something with your Rapsbian, and to reproduce we should have the very same configuration.

Questions:

  1. RPi version?
  2. What packets are installed? What are their versions?
  3. How is your RPi connected to the network? Wi-Fi, ethernet?

Btw, could it be that your RPi may be losing connectivity at some point (networking restart for instance)? This might explain timeout errors.

@saltama
Copy link
Author

saltama commented Mar 29, 2019

So it has been running for 11 hours and I have no issues so far
If you are using a VPS i have no problem believing that :)

Raspberry Pi Zero W, Raspbian Stretch (the only OS readily available), package list.
Raspberry connected via onboard wifi.

Btw, could it be that your RPi may be losing connectivity at some point (networking restart for instance)? This might explain timeout errors.
Possible, and what if adguardhome wasn't able to recover correctly from those errors or from too many errors while communicating with the upstream dns?

Btw, booted up the pi 2 hours ago (my 4 usual tls upstreams+parallel) and checked now, still blocked at 100% cpu with hundred of threads.

@ameshkov
Copy link
Member

Btw, booted up the pi 2 hours ago (my 4 usual tls upstreams+parallel) and checked now, still blocked at 100% cpu with hundred of threads.

Well, it just means that queries to upstreams are still getting stuck on i/o timeout errors.

It sounds as if you have some kind of a rate limiter that starts dropping outgoing packets at some point.

Could you please check iptables just in case?

  • iptables --list
  • iptables -t nat --list

@szolin am I missing anything here?

@ameshkov
Copy link
Member

ameshkov commented Mar 29, 2019

@saltama and one more thing to try.

Here I slightly modified your script, now it just makes requests to cloudflare's DOH server directly:
https://gist.github.com/ameshkov/d03f0e4f3014cb642d8766169a6855a6

Could you please try running it for some time and see if there're any issues with timeout errors?

@saltama
Copy link
Author

saltama commented Mar 29, 2019

pi@raspberrypi:~ $ sudo iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
pi@raspberrypi:~ $ sudo iptables -t nat --list
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination

I'm trying to record a video of it, but as expected, when you look at it the issue doesn't happen...

@ameshkov
Copy link
Member

ameshkov commented Mar 29, 2019

In addition to #659 (comment)

There's one more place where rate limiting might be -- this is your router. Try checking iptables rules there as well.

However, if my script works okay, then the problem is not some rate limiting, and it means something is wrong with AGH (hopefully, it's not some golang issue).

@saltama
Copy link
Author

saltama commented Mar 29, 2019

Sure, will keep that running to verify that i still have connectivity toward clouflare.
The router is a dsl modem with a minimal cli/os, don't think it supports complex rules like rate limiting.

@ameshkov
Copy link
Member

ameshkov commented Apr 1, 2019

Sure, will keep that running to verify that i still have connectivity toward clouflare.

How it goes, are there any issues with my python script? (check the stdout for errors)

@saltama
Copy link
Author

saltama commented Apr 1, 2019

I won't be able to test anything for a few days, but don't despair, I'll report back soon.

@ameshkov
Copy link
Member

ameshkov commented Apr 2, 2019

Thanks, we'll be waiting!

@ameshkov ameshkov removed this from the v0.95 milestone Apr 23, 2019
@ameshkov ameshkov added this to the v0.96 milestone Apr 23, 2019
@szolin
Copy link
Contributor

szolin commented Jun 3, 2019

@saltama Regarding i/o timeout errors you are experiencing - please take a look at what tcpdump shows for a connection that is hanging. Or send your tcpdump data and I can try to get the related information from it. We need to link the AGH logs with TCP connection and see what is really taking so long there.

@szolin szolin modified the milestones: v0.96, v0.97 Jun 4, 2019
@ameshkov ameshkov removed this from the v0.97 milestone Jul 2, 2019
@ameshkov
Copy link
Member

ameshkov commented Jul 2, 2019

See #633

@ameshkov ameshkov closed this as completed Jul 2, 2019
@saltama
Copy link
Author

saltama commented Jul 2, 2019

Sorry for not reporting back... I'm verifying if this issue disappeared in Buster too.
For now it seems possible that's the case.

@ameshkov
Copy link
Member

ameshkov commented Jul 4, 2019

Awesome, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants