Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOH requests hang after a network reconnection #3217

Closed
spiralshapeturtle opened this issue Jun 1, 2021 · 52 comments
Closed

DOH requests hang after a network reconnection #3217

spiralshapeturtle opened this issue Jun 1, 2021 · 52 comments
Assignees
Milestone

Comments

@spiralshapeturtle
Copy link

spiralshapeturtle commented Jun 1, 2021

  • Version of AdGuard Home server:
    0.106.3
  • How did you install AdGuard Home:
    docker
  • How did you setup DNS configuration:
    Adguard -> router -> ISP
  • If it's a router or IoT, please write device model:
    OPNsense
  • CPU architecture:
    Intel
  • Operating system and version:
    Ubuntu running with Docker containers

Expected Behavior

adguard should always keeps processing DNS requests.

Actual Behavior

Since the latest updates, AdGuard stops processing DNS requests. Even in the same vlan (excluding the router) the software stops listing to DNS requests. Logs keep empty, webgui and ping still works (no network issue)

Screenshots

PRTG polling my DNS setup, also pihole is deployed in the same setup an never show an alarm. It seems for now only related tot adguard.
image

Additional Information

sudo docker run -d --name adguard --hostname adguard --net=hostnetwork --ip=172.16.1.253 --cap-add=NET_ADMIN --cap-add=CAP_NET_RAW --cap-add=CAP_NET_BIND_SERVICE -e TZ=Europe/Amsterdam -v /home/rodeho/docker/adguard/workdir:/opt/adguardhome/work -v /home/rodeho/docker/adguard/confdir:/opt/adguardhome/ --restart unless-stopped adguard/adguardhome

@ainar-g
Copy link
Contributor

ainar-g commented Jun 1, 2021

Hi. Is this really the full command that you use to run the container? Because it doesn't have any volumes attached, so any restart probably removes all previous configuration. See the quick start example on our Docker Hub page.

@ainar-g ainar-g added docker Docker-related issues waiting for data Waiting for users to provide more data. labels Jun 1, 2021
@spiralshapeturtle
Copy link
Author

#2657

could be related?

No thats not the full docker command, I removed the volumes due to privacy reasons:
-v /home/foobar/docker/adguard/workdir:/opt/adguardhome/work -v /home/foobar/docker/adguard/confdir:/opt/adguardhome/

@spiralshapeturtle
Copy link
Author

I might thing that when the internet upload is fully congested, the problem occurs faster. Fully utilized like 100% I mean.

@ainar-g
Copy link
Contributor

ainar-g commented Jun 1, 2021

I see. It will be hard to understand what is going on without the logs. Can you configure AGH to make verbose logs and also configure it to write logs to a permanent location using log_file and either post the log here as an attachment or send it to devteam@adguard.com with the subject line “AdGuard Home Issue 3217”? Thanks.

@spiralshapeturtle
Copy link
Author

sure will do

@spiralshapeturtle
Copy link
Author

spiralshapeturtle commented Jun 1, 2021


;; QUESTION SECTION:
;www.waze.com.	IN	 A

2021/06/01 22:56:31.435547 1#20385 [debug] github.com/AdguardTeam/dnsproxy/proxy.(*Proxy).udpHandlePacket(): error handling DNS (udp) request: talking to dnsUpstream failed, cause: all upstreams failed to respond, cause: couldn't do a GET request to 'https://149.112.112.11:443/dns-query', cause: Get "https://149.112.112.11:443/dns-query?dns=i9ABAAABAAAAAAABA3d3dwR3YXplA2NvbQAAAQABAAApCAAAAIAAAAA": context deadline exceeded (hidden: couldn't do a GET request to 'https://146.112.41.2:443/dns-query', cause: Get "https://146.112.41.2:443/dns-query?dns=i9ABAAABAAAAAAABA3d3dwR3YXplA2NvbQAAAQABAAApCAAAAIAAAAA": net/http: request canceled (Client.Timeout exceeded while awaiting headers), couldn't do a GET request to 'https://9.9.9.11:443/dns-query', cause: Get "https://9.9.9.11:443/dns-query?dns=i9ABAAABAAAAAAABA3d3dwR3YXplA2NvbQAAAQABAAApCAAAAIAAAAA": net/http: request canceled (Client.Timeout exceeded while awaiting headers))

This is the issue, it's a dual ISP network pihole and windows DNS does not report this issue. It seems almost impossible there is no internet. Let me show you.

user@MacBook-Pro-van-user ~ % dig @172.16.1.253 github.com

...timeout...

Switching to the docker container itself, Adguard to be specific.

user@dockerstable:~$ sudo docker exec -it adguard ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: seq=0 ttl=55 time=10.323 ms
64 bytes from 1.1.1.1: seq=1 ttl=55 time=8.896 ms
64 bytes from 1.1.1.1: seq=2 ttl=55 time=10.350 ms
user@dockerstable:~$ sudo docker exec -it adguard ping github.com
PING github.com (140.82.121.4): 56 data bytes
64 bytes from 140.82.121.4: seq=0 ttl=52 time=17.865 ms
64 bytes from 140.82.121.4: seq=1 ttl=52 time=17.533 ms
64 bytes from 140.82.121.4: seq=2 ttl=52 time=17.141 ms
^C

Seems like a software issue regarding https DNS upstreams?

upstream config

https://146.112.41.2/dns-query
https://9.9.9.11/dns-query
https://149.112.112.11/dns-query
[/my.domain/]10.0.0.1

@ameshkov
Copy link
Member

ameshkov commented Jun 1, 2021

Client.Timeout exceeded while awaiting headers means that the TCP connection timed out.

Does the issue go away with time or is it necessary to restart the container?

When this happens, could you please record a tcpdump?
tcpdump dst 9.9.9.11 -w out.pcap

Also, it'd be interesting to see if you have any issues when you use a plain DNS upstream.

@spiralshapeturtle
Copy link
Author

spiralshapeturtle commented Jun 2, 2021

Client.Timeout exceeded while awaiting headers means that the TCP connection timed out.

Does the issue go away with time or is it necessary to restart the container?

When this happens, could you please record a tcpdump?
tcpdump dst 9.9.9.11 -w out.pcap

Also, it'd be interesting to see if you have any issues when you use a plain DNS upstream.

OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: \"tcpdump\": executable file not found in $PATH": unknown

user@dockerstable:~$ sudo docker exec -it adguard find / -name tcpdump
user@dockerstable:~$

Could you guide me where you hide tcpdump in the docker? Any way to run bash from the docker host, i.e. like you have SSH-ed into the docker?

@ameshkov
Copy link
Member

ameshkov commented Jun 2, 2021

@rodeho you don't need to run it inside the container, try running it on your host machine should be just okay.

@hplee0120
Copy link

I installed it on the OpenWrt router and usually ran normally. But every time Wan is reconnected, ADH does not resolve DNS, the device can't open some web pages, and then check the logs to confirm that no device requests DNS have been received in the ADH background. This phenomenon returns to normal work after turning off the ADH and reopening the operation.

@hplee0120
Copy link

hplee0120 commented Jun 3, 2021

What might be about the logs? Looks like it's a connectivity issue for DoH? But you actually need to close the program and restart it to get back to normal.
#3138

2021/06/03 05:37:06.267815 [info] Couldn't get version check json from https://static.adguard.com/adguardhome/release/version.json: *fmt.wrapError updater: HTTP GET https://static.adguard.com/adguardhome/release/version.json: Get "https://static.adguard.com/adguardhome/release/version.json": all upstreams failed to exchange request, cause: couldn't do a GET request to 'https://doh.pub:443/dns-query', cause: Get "https://doh.pub:443/dns-query?dns=M0EBAAABAAAAAAAABnN0YXRpYwdhZGd1YXJkA2NvbQAAAQAB": net/http: request canceled (Client.Timeout exceeded while awaiting headers) (hidden: couldn't do a GET request to 'https://dns.alidns.com:443/dns-query', cause: Get "https://dns.alidns.com:443/dns-query?dns=M0EBAAABAAAAAAAABnN0YXRpYwdhZGd1YXJkA2NvbQAAAQAB": net/http: request canceled (Client.Timeout exceeded while awaiting headers))

@spiralshapeturtle
Copy link
Author

@ameshkov there are 3 tickets complaining about ADH stop processing requests when it detects an internet down situation. All my other DNS servers is run keeps running normal, no issue. Only ADH with DOH is having issue, whit a client I can browse normal during the "internet down alarms" from ADH.

It really has something to do with the TCP implementation of ADH.

@ameshkov
Copy link
Member

ameshkov commented Jun 3, 2021

Okay, I have one idea. Our DOH client has a limit of 1 active connection to the remote host. Maybe, in your case when the network change occurs, the connection is not closed and hangs in some semi-open state. This would block further requests.

Let's try lifting that connections limit and see if it helps.

adguard pushed a commit to AdguardTeam/dnsproxy that referenced this issue Jun 3, 2021
See here for details: AdguardTeam/AdGuardHome#3217
Also, some minor cosmetic changes.
adguard pushed a commit to AdguardTeam/dnsproxy that referenced this issue Jun 3, 2021
Merge in DNS/dnsproxy from lift_doh_limit to master

Squashed commit of the following:

commit a87df1e
Author: Andrey Meshkov <am@adguard.com>
Date:   Thu Jun 3 12:33:24 2021 +0300

    unnecessary nolint

commit 29cc79d
Author: Andrey Meshkov <am@adguard.com>
Date:   Thu Jun 3 12:26:59 2021 +0300

    added typecheck for dnscrypt

commit 97faa73
Author: Andrey Meshkov <am@adguard.com>
Date:   Thu Jun 3 12:25:34 2021 +0300

    Lift DOH 1 active connection limit

    See here for details: AdguardTeam/AdGuardHome#3217
    Also, some minor cosmetic changes.
adguard pushed a commit that referenced this issue Jun 3, 2021
Merge in DNS/adguard-home from 3217-fix-doh to master

Updates #3217.

Squashed commit of the following:

commit 29a81e0
Author: Eugene Burkov <e.burkov@adguard.com>
Date:   Thu Jun 3 12:58:01 2021 +0300

    all: upd dnsproxy
@EugeneOne1
Copy link
Member

@rodeho, the possible fix is implemented in the latest edge build (since 084564e). Could you please check, if it solves the problem for you?

@spiralshapeturtle
Copy link
Author

@rodeho, the possible fix is implemented in the latest edge build (since 084564e). Could you please check, if it solves the problem for you?

Cool let me check. Could you guide me how to run this edge build into docker?

@EugeneOne1
Copy link
Member

Running development builds for Docker is described here.

@spiralshapeturtle
Copy link
Author

Running development builds for Docker is described here.

And up and running:v0.107.0-a.62+084564e6

@hplee0120
Copy link

hplee0120 commented Jun 4, 2021

After testing, it seems that the problem only exists in DoH. The above problem does not occur with DoT or ordinary port dns.
How to use the repaired version, waiting for 107.b2 release?

@EugeneOne1
Copy link
Member

@hplee0120, we've got no information about how you installed your AdGuard Home. Could you please provide this info?

@hplee0120
Copy link

@hplee0120, we've got no information about how you installed your AdGuard Home. Could you please provide this info?

I added ADH luci when compiling OpenWrt. Use binary files at https://github.com/AdguardTeam/AdGuardHome/releases.

@spiralshapeturtle
Copy link
Author

spiralshapeturtle commented Jun 5, 2021

Client.Timeout exceeded while awaiting headers means that the TCP connection timed out.

Does the issue go away with time or is it necessary to restart the container?

When this happens, could you please record a tcpdump?
tcpdump dst 9.9.9.11 -w out.pcap

Also, it'd be interesting to see if you have any issues when you use a plain DNS upstream.

I don't think its fully solved yet. Its very fragile, when the TCP connection drops is running into issues. Could the auto recovery of the TCP session be optimized?

@spiralshapeturtle
Copy link
Author

Hi there,

Did some further testing, when the internet line is OK everything seems to be working well. When I change a FW rule, assume it will hit the current TCP table of the router (OPNsense) the issue kicks in.

The session or the re-connect implementation seems not be able to reconnect in a proper way, and the outage remains for minutes <10. After a minute or what, it resolves by itself.

@spiralshapeturtle
Copy link
Author

spiralshapeturtle commented Jun 7, 2021

Did some testing with 106.3 the stable one, with normal DNS resolving to unbound within the same network. It also drops when there was a minimal glitch in the internet connectivity.

The interesting part is that I monitor 4 DNS servers:
Adguard edge
Adguard stable
Windows
Pihole

And only the AdGuard servers does have issues, all the other DNS servers are not metioned once in the logs...in the past week. Always working fine, it has to be something in the AGH IP stack...

@ainar-g
Copy link
Contributor

ainar-g commented Jun 7, 2021

@rodeho, thank you for the information! We're still investigating the issues. Just as a side note, what is the value of dns.max_goroutines in your configuration file? Does the situation improve if you set it to 0?

The error you see on the stable version are caused by the fact that edge migrates its configuration version, and the stable releases don't know about the new versions. In your case, you should be able to simply replace schema_version: 11 with schema_version: 10 in your configuration file when you switch to the release version.

@spiralshapeturtle
Copy link
Author

spiralshapeturtle commented Jun 7, 2021

@rodeho, thank you for the information! We're still investigating the issues. Just as a side note, what is the value of dns.max_goroutines in your configuration file? Does the situation improve if you set it to 0?

The error you see on the stable version are caused by the fact that edge migrates its configuration version, and the stable releases don't know about the new versions. In your case, you should be able to simply replace schema_version: 11 with schema_version: 10 in your configuration file when you switch to the release version.

Hi, the dns.max_goroutines is not configured on the stable one, lets configure it on the edge, running a load of servers in the homelab. So there should be a higher load of DNS than regulair.

@spiralshapeturtle
Copy link
Author

spiralshapeturtle commented Jun 16, 2021

This seems to be an adguard code issue, the OS is unable to call the DOH servers, but on CLI I'm able to?

@rodeho I see you're getting wget: error getting response: Connection reset by peer on dockerstable, so there was still a networking issue.

No no netwerk issue, connection reset by peer is the other side. There is a connection but SSL seems to be failing, could that lead to the root cause? Will show you some other dockers in the same network segment and their output.


root@dockerstable:~$ sudo docker exec -it pihole  wget https://dns-family.adguard.com
--2021-06-16 21:23:00--  https://dns-family.adguard.com/
Resolving dns-family.adguard.com (dns-family.adguard.com)... 94.140.14.15, 94.140.15.16
Connecting to dns-family.adguard.com (dns-family.adguard.com)|94.140.14.15|:443... connected.
GnuTLS: The TLS connection was non-properly terminated.
Unable to establish SSL connection.

root@dockerstable:~$ sudo docker exec -it adguard  wget https://dns-family.adguard.com
Connecting to dns-family.adguard.com (94.140.14.15:443)
ssl_client: dns-family.adguard.com: handshake failed: No error information
wget: error getting response: Connection reset by peer

root@dockerstable:~$ sudo docker exec -it adguard2  wget https://dns-family.adguard.com
Connecting to dns-family.adguard.com (94.140.14.15:443)
ssl_client: dns-family.adguard.com: handshake failed: No error information
wget: error getting response: Connection reset by peer

root@dockerstable:~$ sudo docker exec -it smokeping  wget https://dns-family.adguard.com
Connecting to dns-family.adguard.com (94.140.14.15:443)
Connecting to adguard.com (172.67.3.157:443)
saving to 'index.html'
index.html           100% |*********************************************************************************************************************************************************************************************|  423k  0:00:00 ETA
'index.html' saved

@ameshkov ameshkov changed the title 0.106.3 adguard stops processing DNS reqests DOH requests hang after a network reconnection Jun 21, 2021
@ameshkov
Copy link
Member

Renaming the issue and merging #3270 here.

I tend to think the problem is specific to DOH and may be caused by a bug in golang http2 implementation. The problem is that we still cannot reproduce this.

We probably can add some workaround (re-create the HTTP client on every timeout error for instance), but I'd like to figure out what's really causing this first.

@spiralshapeturtle
Copy link
Author

@ameshkov thanks, is there anyway we can validate if its specific to the DOH servers of adguard cloud, or specific to DOH upstream servers?

When trying to simulate in your lab, try to add some packet loss. Seems to be more frequent when the upload is heavy loaded.

adguard pushed a commit to AdguardTeam/dnsproxy that referenced this issue Jun 21, 2021
@ameshkov
Copy link
Member

@rodeho IMO, it does not seem to be specific to particular DOH servers, rather to golang's http client in general (which we use internally). If we're not able to reproduce it internally, we'll push another build to the edge build that should in theory solve this by re-creating the http client on timeout errors.

@fizsef
Copy link

fizsef commented Jun 21, 2021

I have observed this problem twice, but when I specifically wanted to reproduce it in order to post the log here, this time there were no problems and everything worked fine. Although the settings were the same. I use a DoH server https://family.cloudflare-dns.com/dns-query

@spiralshapeturtle
Copy link
Author

@rodeho IMO, it does not seem to be specific to particular DOH servers, rather to golang's http client in general (which we use internally). If we're not able to reproduce it internally, we'll push another build to the edge build that should in theory solve this by re-creating the http client on timeout errors.

Just to inform you, I moved to a local Unbound forwarder. Therefore 95% of my network using port 53 with UDP. Only some clients 5% where specifically configure with extra checking for parental control, and they failed an hour ago. Analyzed the logs only the logs DOH checks failed, which also means that the rest of the IP stack remains functional for at least UDP packets.

request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

adguard pushed a commit to AdguardTeam/dnsproxy that referenced this issue Jun 22, 2021
Updates AdguardTeam/AdGuardHome#3217.

Squashed commit of the following:

commit e0bdfd9
Merge: b34be01 5bf778a
Author: Ainar Garipov <A.Garipov@AdGuard.COM>
Date:   Tue Jun 22 13:09:57 2021 +0300

    Merge branch 'master' into fix/doh_timeout

commit b34be01
Author: Andrey Meshkov <am@adguard.com>
Date:   Mon Jun 21 20:07:44 2021 +0300

    added TODO

commit 18e1ad9
Author: Andrey Meshkov <am@adguard.com>
Date:   Mon Jun 21 20:01:46 2021 +0300

    fix review comments

commit a7725e8
Author: Andrey Meshkov <am@adguard.com>
Date:   Mon Jun 21 17:23:50 2021 +0300

    Recreate HTTP client on timeout error
    Trying to solve AdguardTeam/AdGuardHome#3217
adguard pushed a commit that referenced this issue Jun 22, 2021
Updates #3217.

Squashed commit of the following:

commit c0208c8
Author: Ainar Garipov <A.Garipov@AdGuard.COM>
Date:   Tue Jun 22 13:23:17 2021 +0300

    all: upd dnsproxy
@ainar-g
Copy link
Contributor

ainar-g commented Jun 22, 2021

@rodeho, @hplee0120, @fizsef, we've pushed some fixes that may improve the situation to the edge channel in revision 90a8536. Could any of you check if the situation is better now? Thanks!

@hplee0120
Copy link

@rodeho, @hplee0120, @fizsef, we've pushed some fixes that may improve the situation to the edge channel in revision 90a8536. Could any of you check if the situation is better now? Thanks!

From my personal use, after downloading the latest version of edge, the problem seems to have been solved.!

@spiralshapeturtle
Copy link
Author

@rodeho, @hplee0120, @fizsef, we've pushed some fixes that may improve the situation to the edge channel in revision 90a8536. Could any of you check if the situation is better now? Thanks!

I upgraded to this edge build within my test setup, my primary DNS moved to PiHole during this debugging period AGH was causing to much outages.

@fizsef
Copy link

fizsef commented Jun 27, 2021

and how to install, where to download, this corrected version?

@ainar-g
Copy link
Contributor

ainar-g commented Jun 28, 2021

@fizsef, there are instructions here in our readme.

@spiralshapeturtle
Copy link
Author

@rodeho, @hplee0120, @fizsef, we've pushed some fixes that may improve the situation to the edge channel in revision 90a8536. Could any of you check if the situation is better now? Thanks!

Last night my stable build went down, and the edge build remains up. Seems to be fixed! Thanks

@ainar-g ainar-g closed this as completed Jun 30, 2021
@ainar-g ainar-g added bug P3: Medium and removed docker Docker-related issues waiting for data Waiting for users to provide more data. labels Jun 30, 2021
@ainar-g ainar-g self-assigned this Jun 30, 2021
@ainar-g ainar-g added this to the v0.107.0 milestone Jun 30, 2021
heyxkhoa pushed a commit to heyxkhoa/AdGuardHome that referenced this issue Mar 20, 2023
Merge in DNS/adguard-home from 3217-fix-doh to master

Updates AdguardTeam#3217.

Squashed commit of the following:

commit 29a81e0
Author: Eugene Burkov <e.burkov@adguard.com>
Date:   Thu Jun 3 12:58:01 2021 +0300

    all: upd dnsproxy
heyxkhoa pushed a commit to heyxkhoa/AdGuardHome that referenced this issue Mar 20, 2023
Updates AdguardTeam#3217.

Squashed commit of the following:

commit c0208c8
Author: Ainar Garipov <A.Garipov@AdGuard.COM>
Date:   Tue Jun 22 13:23:17 2021 +0300

    all: upd dnsproxy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants