-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream dns connection pooling broken. #659
Comments
@saltama hi! quick question: do you have any issues when you use DOH upstreams and not DOT? They rely on the native Go http implementation with it's own connections keep-alive mechanism. |
With only DoH it either (50/50) works (I'm not sure for how long, but it seems more stable) or not work at all from the beginning with this result for every query:
Also, in this case I don't see the familiar "Returning existing connection" message anymore. |
Btw, this sounds like another issue, and it is Rapsbian-related. Please check the last comment in this thread: |
Tbh, I suppose the issue is not the pooling, but something related to #633. We just need to figure out what exactly is causing this behavior. Could you please tell what device are you running, OS version, etc? We'll try to reproduce it on our side. |
Raspberry Pi Zero W, latest Raspbian Stretch. |
I had it configured with quad9/cloudflare DoH and after a day or so i've noticed that one of the adguard threads(of 8 listed by Top) was using 100% cpu as usual. Edit: In addition to the errors i've already listed, this morning i've also seen an error related to the source of randomness being empty (something like "urandom has no other randoms, waiting 60s") that i had never seen before (omitted from the log, it happened a few hours before what's shown there). Regarding the "too many open files" error, the pi zero has some limits that are half of what bigger Raspberry Pis have. On a Pi Zero:
On a Pi3:
|
There are several lines like Anyway, I've made a patch that allows to set user-defined per-process limit in config file. Is it convenient for you if I send you a binary file for your platform so you can check whether it solves your problems? |
Sure! Put it somewhere and I'll try it (arm, not arm64). |
OK, so you should edit config file and add
On both your systems the default value is 1024, so 8096 should be more than enough. AdGuardHome_v0.94-4-g0c86-dirty_linux_arm.tar.gz |
Testing it now, have you added a 30s timeout somewhere too? The cpu usage drops back to normal levels after 30s when it stalls at 100% (if not, this is something new i noticed now that i look at it closely). |
No, it's definitely something else. |
Well, I must say that in your log there were a lot of "too many open files" errors, and it's hard to predict what kind of consequences they might be causing. So that might be the real cause. |
For now i'm not seeing improvements, i could try tweaking a bit the limit value (eg. 1000) while running it with verbose and see what happens. |
What's your current
This is way too low. You were experiencing problems with 1024 (default), setting this value to 1000 will make it worse. |
Your 8096, but i didn't check the log for specific errors. I'm redoing the tests (starting with 8k and then increasing, even if increasing the limit is just a workaround) with this script that tries to resolve random domains:
The fact is that it's hard to replicate... at some point it stops resolving, it could be after a few minutes or after hours, something that makes me think about concurrency/resource/stuff_not_closed issues. |
It'd be interesting to check if there are any "too many open files" errors now.
I'll try running this script on my VPS with AGH, maybe I'll see something |
So it has been running for 11 hours and I have no issues so far: I am guessing this must be something with your Rapsbian, and to reproduce we should have the very same configuration. Questions:
Btw, could it be that your RPi may be losing connectivity at some point (networking restart for instance)? This might explain timeout errors. |
Raspberry Pi Zero W, Raspbian Stretch (the only OS readily available), package list.
Btw, booted up the pi 2 hours ago (my 4 usual tls upstreams+parallel) and checked now, still blocked at 100% cpu with hundred of threads. |
Well, it just means that queries to upstreams are still getting stuck on i/o timeout errors. It sounds as if you have some kind of a rate limiter that starts dropping outgoing packets at some point. Could you please check iptables just in case?
@szolin am I missing anything here? |
@saltama and one more thing to try. Here I slightly modified your script, now it just makes requests to cloudflare's DOH server directly: Could you please try running it for some time and see if there're any issues with timeout errors? |
I'm trying to record a video of it, but as expected, when you look at it the issue doesn't happen... |
In addition to #659 (comment) There's one more place where rate limiting might be -- this is your router. Try checking iptables rules there as well. However, if my script works okay, then the problem is not some rate limiting, and it means something is wrong with AGH (hopefully, it's not some golang issue). |
Sure, will keep that running to verify that i still have connectivity toward clouflare. |
How it goes, are there any issues with my python script? (check the stdout for errors) |
I won't be able to test anything for a few days, but don't despair, I'll report back soon. |
Thanks, we'll be waiting! |
@saltama Regarding i/o timeout errors you are experiencing - please take a look at what tcpdump shows for a connection that is hanging. Or send your tcpdump data and I can try to get the related information from it. We need to link the AGH logs with TCP connection and see what is really taking so long there. |
See #633 |
Sorry for not reporting back... I'm verifying if this issue disappeared in Buster too. |
Awesome, thank you! |
Hi, sorry for ignoring the issue template but let me just describe what has been broken with the last few releases.
I really want to use adguard as my main dns considering all the features it has but while 0.92-hotfixX was barely usable for me, I can't use the last two releases without restarting adguard every hour or so (cpu at 100%, stops responding) and sometimes multiple restarts are needed to make it resolve something.
To make it short, with 0.94 i'm seeing a huge number of i/o timeout errors, that start appearing sometimes at startup, other times after a while (sooner if parallel enabled) and in extremely(10%) rare occasions don't manifest themselves at all.
Describing/debugging this kind of issues is a bit hard, so bear with me if sometimes the description below will seem a bit incoherent, but I think the root cause can be found in how the connection pool has been implemented.
This should be quite easy to replicate even on a laptop or docker container, enabling parallel to reach the breaking point sooner. Testing this on a raspberry with 512mb of ram made seeing these issues easier i guess.
Setup: I'm running it on a raspberry pi zero (debian stretch, 512mb ram), using the binaries you provide. AdGuardHome configured as a systemd service to start automatically at boot, no other significant services are active.
Let me recap what i was seeing in the latest releases:
0.92:
Issues with some type of upstream servers, fixed with the hotfixes.
Adguard started by systemd, doesn't resolve anything for 10 minutes or so, then starts working. Most of the times restarting it fixes the issue. Looking at the verbose log i noticed that the first few upstream connections were failing with high RTT (60-120 seconds)
0.93
Still not responding for 10 minutes. After half an hour the process takes 100% of the cpu and stops responding. Restarting helps, but only for another 30 minutes. Didn't test it much in this case, i was just waiting for the next release.
0.94
Did you introduce connection pooling in addition to the parallel query mechanism?
Because now I can see messages hinting to the fact that an upstream dns connection has been reused.
But the pooling mechanism appears to be broken.
Under various circumstances, I can't see connections being reused anymore and a new connection is created every time AdGuardHome receives a new connection. Something that makes adguard unresponsive after a while.
This is what i'm seeing in a typical session, with parallel enabled to make it easier to reach the point where the errors appear, but as said above, even with parallel off, sometimes the i/o timeout error you see below will start right away:
It looks like there is some sort of pooling mechanism for upstream dns server connections since sometimes I see these messages:
BUT after a short while the log is full of normal successful connection attempts without reuse,like if the pool just didn't have any reusable connection available anymore:
After a while adguard stops resolving at all, RTTs increase to 60,100+ seconds, i still can't see those "Returning existing connection" or "successfully initialize connection" messages anymore, and every query fails:
After a while, the i/o timeout errors turn into this:
My personal and probably wrong diagnosis without even looking at the code
The pooling thingy likely introduced to keep the number of upstream connections in check is broken.
Possible reasons:
I would add a mechanism to keep the pool fresh removing closed/stalled/that_are_way_over_the_deadline_you_set/otherwise_dead connections periodically and limit the number of maximum connections that can be created.
Ideally, the pool should contain a number of connections proportional to the number of upstream servers (1,2,3 connection each?) and NOT allow the creation of additional connections. If the pool is full, you wait. Creating additional connections when the pool is full does not make sense.
Clients of the pool should be able to perform request only sequentially (request order must be preserved). The freshness mechanism should flush completely(or by ip) stuck pools (pool that have had all their connection leased to some client for too much time) when no client has been able to get a connection for a reasonable amount of time (e.g. few seconds?).
Also, loss of overall connectivity should be handled properly (stop the pool from creating connections and flush the old one it has? reply with failures to every dns request adguard receives?)
The text was updated successfully, but these errors were encountered: