Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.107.31 stops responding to DNS requests, does not terminate #5896

Closed
3 tasks done
timkgh opened this issue Jun 12, 2023 · 65 comments
Closed
3 tasks done

v0.107.31 stops responding to DNS requests, does not terminate #5896

timkgh opened this issue Jun 12, 2023 · 65 comments
Assignees
Milestone

Comments

@timkgh
Copy link

timkgh commented Jun 12, 2023

Prerequisites

  • I have checked the Wiki and Discussions and found no answer

  • I have searched other issues and found no duplicates

  • I want to report a bug and not ask a question

Operating system type

Other (please mention in the description)

CPU architecture

64-bit MIPS

Installation

Other (please mention in the description)

Setup

On a router, DHCP is handled by the router

AdGuard Home version

v0.107.31

Description

v0.107.31 running on EdgeRouter 12, mips64_softfloat

Sometimes it just hangs, it doesn't respond to DNS requests anymore, 1 CPU is at 100% and it also does not gracefully restart via systemctl restart, it needs to be force killed.

I know this is not a lot to go on, these random hangs are hard to troubleshoot. It was stable for a while, I believe this started with the updates to .30 and .31

Jun 11 21:46:01 router AdGuardHome[15089]: 2023/06/11 21:46:01.705395 [info] Received signal "terminated"
Jun 11 21:46:01 router AdGuardHome[15089]: 2023/06/11 21:46:01.705503 [info] stopping AdGuard Home
Jun 11 21:46:01 router AdGuardHome[15089]: 2023/06/11 21:46:01.705544 [info] stopping http server...
Jun 11 21:46:01 router AdGuardHome[15089]: 2023/06/11 21:46:01.705984 [info] stopped http server
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: State 'stop-sigterm' timed out. Killing.
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: Killing process 15089 (AdGuardHome) with signal SIGKILL.
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: Main process exited, code=killed, status=9/KILL
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: Unit entered failed state.
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: Failed with result 'timeout'.
@ainar-g
Copy link
Contributor

ainar-g commented Jun 13, 2023

Could you enable the verbose logging and see if there are any clues there? Enabling debug profiling (debug_pprof: true in the configuration file), getting the profile from http://127.0.0.1:6060/debug/pprof/goroutine?debug=1, and sending them to devteam@adguard.com could also help us see if there are any connection leaks.

@ainar-g ainar-g added the waiting for data Waiting for users to provide more data. label Jun 13, 2023
@timkgh timkgh mentioned this issue Jun 13, 2023
3 tasks
@timkgh
Copy link
Author

timkgh commented Jun 15, 2023

Unfortunately I don't have enough space on my router for verbose logging and I'm also worried about flash wear.

@poolski
Copy link

poolski commented Jun 15, 2023

Have you checked that it's not a memory issue? If you're on a ER12, that only has 1GB of RAM. It's conceivable that once RAM is used up, it goes to swap.

Would you be able to show the load values from uptime to see whether there's something going on there? When the system starts swapping heavily, it will slow right to a crawl, and services become unresponsive.

As another aside, is the router still behaving as expected? Or is the CPU bottleneck causing router issues too?

@timkgh
Copy link
Author

timkgh commented Jun 15, 2023

It is not a memory problem, the memory use is normally < 300MB (right now 185MB). The router is fine when this happens, except 1 CPU core is at 100% taken by AGH. Under normal use the CPU cores are at < 5%. The AGH web interface works, AGH just stops responding to DNS requests. I don't need to reboot the router to recover, I just need to kill AGH.

# free -h
              total        used        free      shared  buff/cache   available
Mem:           973M        173M        549M         11M        250M        773M
Swap:            0B          0B          0B

uptime: 07:01:30 up 142 days, 19:26, 1 user, load average: 0.25, 0.25, 0.16

I've reverted to .29 for now because that's been stable for me. Problems started with the recent .30

@copacetic82
Copy link

I am having a simiular random issue and am running the latest v0.107.32 on centos-release-7-9.2009.1.el7.centos.x86_64. AGH stops responding to DNS requests, and still able to access the GUI. Restarting the AdGuardHome service does not work i.e. have to force a reboot. Was not able to check CPU/Memory usage yet, but never had issues before the last 1-2 releases. I did observe the logs below, and have increaed the UDP receiver buffer as described here (https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size). Am waiting to see if the issue stops...

[root@dns1 ~]# systemctl status AdGuardHome
● AdGuardHome.service - AdGuard Home: Network-level blocker
Loaded: loaded (/etc/systemd/system/AdGuardHome.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2023-06-16 06:40:36 CEST; 13h ago
Main PID: 734 (AdGuardHome)
CGroup: /system.slice/AdGuardHome.service
└─734 /root/AdGuardHome/AdGuardHome -s run

Jun 16 17:18:11 dns1 AdGuardHome[734]: 2023/06/16 17:18:11.470190 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 17:18:14 dns1 AdGuardHome[734]: 2023/06/16 17:18:14.472600 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 17:21:11 dns1 AdGuardHome[734]: 2023/06/16 17:21:11.454448 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 17:21:14 dns1 AdGuardHome[734]: 2023/06/16 17:21:14.464525 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 17:21:40 dns1 AdGuardHome[734]: 2023/06/16 17:21:40.538860 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 17:21:43 dns1 AdGuardHome[734]: 2023/06/16 17:21:43.802327 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 17:40:34 dns1 AdGuardHome[734]: 2023/06/16 17:40:34.338152 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 17:40:37 dns1 AdGuardHome[734]: 2023/06/16 17:40:37.342884 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 18:00:19 dns1 AdGuardHome[734]: 2023/06/16 18:00:19.746670 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
Jun 16 18:00:22 dns1 AdGuardHome[734]: 2023/06/16 18:00:22.771022 [error] dnsproxy: unpacking udp packet: dns: buffer size too small
[root@dns1 ~]#

@copacetic82
Copy link

I am having a simiular random issue and am running the latest v0.107.32 on centos-release-7-9.2009.1.el7.centos.x86_64. AGH stops responding to DNS requests, and still able to access the GUI. Restarting the AdGuardHome service does not work i.e. have to force a reboot. Was not able to check CPU/Memory usage yet, but never had issues before the last 1-2 releases. I did observe the logs below, and have increaed the UDP receiver buffer as described here (https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size). Am waiting to see if the issue stops...

[root@dns1 ~]# systemctl status AdGuardHome ● AdGuardHome.service - AdGuard Home: Network-level blocker Loaded: loaded (/etc/systemd/system/AdGuardHome.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2023-06-16 06:40:36 CEST; 13h ago Main PID: 734 (AdGuardHome) CGroup: /system.slice/AdGuardHome.service └─734 /root/AdGuardHome/AdGuardHome -s run

Jun 16 17:18:11 dns1 AdGuardHome[734]: 2023/06/16 17:18:11.470190 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:18:14 dns1 AdGuardHome[734]: 2023/06/16 17:18:14.472600 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:11 dns1 AdGuardHome[734]: 2023/06/16 17:21:11.454448 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:14 dns1 AdGuardHome[734]: 2023/06/16 17:21:14.464525 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:40 dns1 AdGuardHome[734]: 2023/06/16 17:21:40.538860 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:43 dns1 AdGuardHome[734]: 2023/06/16 17:21:43.802327 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:40:34 dns1 AdGuardHome[734]: 2023/06/16 17:40:34.338152 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:40:37 dns1 AdGuardHome[734]: 2023/06/16 17:40:37.342884 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 18:00:19 dns1 AdGuardHome[734]: 2023/06/16 18:00:19.746670 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 18:00:22 dns1 AdGuardHome[734]: 2023/06/16 18:00:22.771022 [error] dnsproxy: unpacking udp packet: dns: buffer size too small [root@dns1 ~]#

Did not solve the issue...CPU is at 100%, enough free memory. Reverted back to v0.107.29.

@ainar-g
Copy link
Contributor

ainar-g commented Jun 19, 2023

@copacetic82, those errors are likely signalling that some plain-UDP clients are sending invalid DNS packets. We'd appreciate if you could reproduce the situation with verbose logging and profile collecting, as described in #5896 (comment), if you decide to give it another go.

@copacetic82
Copy link

copacetic82 commented Jun 19, 2023

@copacetic82, those errors are likely signalling that some plain-UDP clients are sending invalid DNS packets. We'd appreciate if you could reproduce the situation with verbose logging and profile collecting, as described in #5896 (comment), if you decide to give it another go.

Hi...thanks for replying...unfortunately i cant reproduce it at the moment as I am away on vacation. But a revert to v0.107.29 did solve the issue for me...no problems since, and using the same clients. Maybe I have time to debug once I get back.

@Porygon31
Copy link

It's happening to me since the last update.
I will try the verbose logging

@TheWoodsy
Copy link

Same issue.
AGH on Ubuntu 22.04 snap v0.107.32
Headless 8700k 32gb
Pegged a core at 100%
Nothing in logs around time of death in syslog/journal/AdGuardHome.err
WebUI works, DNS doesn't.
Yikes...

@simonregn
Copy link

I face the same problem. Never had issues with AdguardHome until a few weeks. I agree this issue must have started with the last releases.

@ainar-g
Copy link
Contributor

ainar-g commented Jun 29, 2023

We are still not able to reproduce this properly, and the data we've received so far unfortunately doesn't shed any light on what is happening and why. Anyone still having issues, please help us collect the data by doing the following:

  1. Update to v0.108.0-b.38, which has more debug logging, or a later Edge release.

  2. Enable the verbose logging.

  3. Enable debug profiling (debug_pprof: true in the configuration file).

  4. Wait for the issue to occur.

  5. Save the profile from http://127.0.0.1:6060/debug/pprof/goroutine?debug=1, the verbose log, and send both devteam@adguard.com with “AdGuard Home Issue 5896” in the subject line.

@bartounet16
Copy link

Hello same problem this morning . whith my adguard docker version: v0.108.0-a.586+2069eddf
Stop to listen and respond DNS request.

Web UI is working

This problem is during since 1 month !!!

@bigwookie
Copy link

#5962 Had the same problem here on 0.107.33 have reverted back to 0.107.29 as suggested.

@bigwookie
Copy link

We are still not able to reproduce this properly, and the data we've received so far unfortunately doesn't shed any light on what is happening and why. Anyone still having issues, please help us collect the data by doing the following:

  1. Update to v0.108.0-b.38, which has more debug logging, or a later Edge release.
  2. Enable the verbose logging.
  3. Enable debug profiling (debug_pprof: true in the configuration file).
  4. Wait for the issue to occur.
  5. Save the profile from http://127.0.0.1:6060/debug/pprof/goroutine?debug=1, the verbose log, and send both devteam@adguard.com with “AdGuard Home Issue 5896” in the subject line.

Have set this up and left it running.

adguard pushed a commit that referenced this issue Jul 4, 2023
Updates #5896.

Squashed commit of the following:

commit 81ac59e
Author: Ainar Garipov <A.Garipov@AdGuard.COM>
Date:   Tue Jul 4 16:07:33 2023 +0300

    hashprefix: fix cache coding
@ainar-g
Copy link
Contributor

ainar-g commented Jul 4, 2023

Thanks to the logs sent by @bigwookie in #5962, we were able to pinpoint a bug in the code which could cause such issues. We've fixed it in version v0.108.0-a.600+c4651647 on the Edge channel. Could you please check if the fix improves the situation?

@bigwookie
Copy link

Thanks to the logs sent by @bigwookie in #5962, we were able to pinpoint a bug in the code which could cause such issues. We've fixed it in version v0.108.0-a.600+c4651647 on the Edge channel. Could you please check if the fix improves the situation?

@ainar-g I have downloaded this and left it running.

@bigwookie
Copy link

@ainar-g I can confirm it's running better.

@erdmanbr
Copy link

erdmanbr commented Jul 5, 2023

I've been having the same issue I believe and managed to sort out how to update to the latest Edge release over the weekend. It stalled again last night (running v0.108.0-a.600+c4651647) and I was able to grab and send a log of the event. Will follow up if it happens again since I neglected to enable the pprof option this time.

@bigwookie
Copy link

@ainar-g Just failed again will get the logs and send on.

@up2you1
Copy link

up2you1 commented Jul 5, 2023

just to confirm, i am also getting the same on v0.107.33

  1. LXC Container in Proxmox - the CPU randomly spikes to 100% which causes it to malfunction and not pass any dns through
  2. Docker Container in Unraid - CPU also spikes here too and stops it from working

A reboot of both fixes is temporarily

LXC ADH set to Primary DNS and Docker is Secondary in DHCP

Ive had to switch the LXC Container off as ive no idea how to downgrade it or even use the Edge channel, but ive set the Docker to use Edge and its pulled v0.108.0-a.600+c4651647.

il monitor it

@ainar-g
Copy link
Contributor

ainar-g commented Jul 10, 2023

Yes and yes.

@ainar-g ainar-g self-assigned this Jul 10, 2023
@ainar-g ainar-g added bug P2: High and removed waiting for data Waiting for users to provide more data. labels Jul 10, 2023
@ainar-g ainar-g added this to the v0.107.34 milestone Jul 10, 2023
adguard pushed a commit that referenced this issue Jul 10, 2023
Updates #5896.

Squashed commit of the following:

commit 4934054
Author: Ainar Garipov <A.Garipov@AdGuard.COM>
Date:   Mon Jul 10 17:59:45 2023 +0300

    hashprefix: fix loop pointer
@ainar-g
Copy link
Contributor

ainar-g commented Jul 10, 2023

Version v0.108.0-a.605+c02a1411 on the Edge channel contains the new fix. @bigwookie, could you please try that one out as well?

@bigwookie
Copy link

Version v0.108.0-a.605+c02a1411 on the Edge channel contains the new fix. @bigwookie, could you please try that one out as well?

I am now running Version:
v0.108.0-a.605+c02a1411

@ainar-g
Copy link
Contributor

ainar-g commented Jul 11, 2023

For those who follow the Beta channel, v0.108.0-b.40 contains the same fixes.

@Freebase394
Copy link

Nice Work @ainar-g

@monsm
Copy link

monsm commented Jul 12, 2023

Version v0.108.0-a.605+c02a1411 on the Edge channel contains the new fix. @bigwookie, could you please try that one out as well?

I am now running Version: v0.108.0-a.605+c02a1411

@bigwookie How does the new version work

@bigwookie
Copy link

Version v0.108.0-a.605+c02a1411 on the Edge channel contains the new fix. @bigwookie, could you please try that one out as well?

I am now running Version: v0.108.0-a.605+c02a1411

@bigwookie How does the new version work

So far it is running smoothly with no crashes or errors as before. @ainar-g thank you for this, it's looking good.

@ainar-g ainar-g unpinned this issue Jul 12, 2023
@ainar-g ainar-g modified the milestones: v0.107.35, v0.107.34 Jul 12, 2023
@ainar-g
Copy link
Contributor

ainar-g commented Jul 12, 2023

The fixes have now been released to v0.107.34. Once again, thanks to @bigwookie, @TheCableGuy99, and everyone else who tried Edge and Beta builds!

@Baxxy13
Copy link

Baxxy13 commented Jul 13, 2023

Because of the constantly increasing memory-usage with the last versions i rebooted the Pi3b+ almost once a week.
Let's see if this is now history...
AdGuard-Home_phys_Memory_usage

@s1lviu
Copy link

s1lviu commented Jul 15, 2023

image I had the issue too, hope is fixed.

@TheCableGuy99
Copy link

That looks to be HomeAssistant?

@OlliZabal
Copy link

I'm having some error messages frequently in the docker image log using v0.107.34:
2023/07/17 18:51:59.297889 [error] dnsproxy: unpacking udp packet: dns: buffer size too small

@TheCableGuy99
Copy link

I'm having some error messages frequently in the docker image log using v0.107.34:

2023/07/17 18:51:59.297889 [error] dnsproxy: unpacking udp packet: dns: buffer size too small

Unrelated! Open your own thread!

@bartounet16
Copy link

hello the problem is it resolved.
1 week without problem with version: v0.108.0-a.616+2adc8624

@timkgh
Copy link
Author

timkgh commented Jul 17, 2023

v0.107.34 looks good 👍

@bigwookie
Copy link

Yes v0.107.34 looks good from my end, it has been running efficiently and without error since the release.

@ainar-g ainar-g closed this as completed Jul 18, 2023
@bartounet16
Copy link

WE Can use no edge version now ?

@bartounet16
Copy link

The fixes have now been released to v0.107.34. Once again, thanks to @bigwookie, @TheCableGuy99, and everyone else who tried Edge and Beta builds!

Could toi explain basically what was the problem ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests