Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats get corrupted after flood of queries #4358

Closed
cadusilva opened this issue Mar 4, 2022 · 53 comments
Closed

Stats get corrupted after flood of queries #4358

cadusilva opened this issue Mar 4, 2022 · 53 comments
Assignees
Labels
Milestone

Comments

@cadusilva
Copy link

Issue Details

  • Version of AdGuard Home server:

    • v0.107.4
  • How did you install AdGuard Home:

    • GitHub releases
  • How did you setup DNS configuration:

    • System
  • CPU architecture:

    • ARM64/aarch64
  • Operating system and version:

    • Debian 11

Expected Behavior

Stats page working normally showing all the info it shows.

Actual Behavior

From time to time the stats page shows zero everywhere and empty domains lists (most blocked, most queried, etc). When accessing the page at the time it happens, an error appears at the right bottom corner saying:

Error: control/clients/find?ip0=127.0.0.1&ip1=206.42.33.166&ip2=10.0.0.3&ip3=45.164.223.2&ip4=motog&ip5=joaouerj&ip6=131.159.24.242&ip7=179.190.172.16&ip8=220.180.241.26&ip9=177.25.155.98&ip10=177.25.153.160&ip11=177.25.150.57&ip12=154.202.55.217&ip13=177.25.153.142&ip14=45.205.48.180&ip15=185.249.221.238&ip16=183.90.186.240&ip17=177.25.156.157&ip18=177.25.144.182&ip19=154.198.205.174&ip20=45.205.35.196&ip21=156.229.9.224&ip22=154.198.219.21&ip23=177.25.153.187&ip24=177.25.144.89&ip25=177.25.150.41&ip26=10.0.0.1&ip27=177.25.153.233&ip28=177.25.159.130&ip29=177.25.159.162&ip30=191.47.16.69&ip31=177.25.153.174&ip32=177.25.155.148&ip33=177.25.144.221&ip34=177.25.153.216&ip35=177.25.156.1&ip36=177.25.159.132&ip37=177.25.155.166&ip38=177.25.145.82&ip39=177.25.151.167&ip40=177.25.150.32&ip41=88.80.186.137&ip42=177.25.156.159&ip43=177.25.153.172&ip44=177.25.155.73&ip45=177.25.145.50&ip46=177.25.157.85&ip47=14.1.112.177&ip48=170.106.176.49&ip49=177.25.150.119&ip50=131.159.25.7&ip51=146.88.240.4&ip52=103.203.59.3&ip53=209.141.45.192&ip54=39.129.8.129&ip55=45.83.67.25&ip56=177.25.156.143&ip57=129.250.206.86&ip58=174.138.40.30&ip59=162.142.125.133&ip60=71.6.232.7&ip61=141.22.28.227&ip62=27.98.224.20&ip63=184.105.139.117&ip64=64.62.197.37&ip65=45.79.15.228&ip66=37.44.239.30&ip67=159.89.194.175&ip68=202.112.238.56&ip69=177.25.156.77&ip70=185.180.143.142&ip71=185.180.143.73&ip72=162.142.125.212&ip73=54.173.29.204&ip74=185.180.143.76&ip75=146.88.240.12&ip76=141.212.123.193 | Network Error

AdGuard continues to work normally apparently and resolving names with other pages working also, but the stats page gets broken. My guess is after a flood of queries from a bunch of CN/HK/SG addresses, the stats info file gets corrupt somehow.

The last batch was repeatedly composed of queries like that:

{"T":"2022-03-04T02:39:59.960278275-03:00","QH":"microsoft.com","QT":"TXT","QC":"IN","CP":"","Answer":"APyBgAABABAAAAABCW1pY3Jvc29mdANjb20AABAAAcAMABAAAQAADR8AKyphcHBsZS1kb21haW4tdmVyaWZpY2F0aW9uPTBnTWVhWXlZeTZHTFZpR2/ADAAQAAEAAA0fAEVEZ29vZ2xlLXNpdGUtdmVyaWZpY2F0aW9uPXBqUE9hdVNQY3JmWE9aUzlqblBQYTVheG93Y0hHQ0RBbDFfODZkQ3FGcGvADAAQAAEAAA0fABsaZmcydDBnb3Y5NDI0cDJ0ZGN1bzk0Z29lOWrADAAQAAEAAA0fABsadDdzZWJlZTUxanJqN3ZtOTMyazUzMWhpcGHADAAQAAEAAA0fAEVEZ29vZ2xlLXNpdGUtdmVyaWZpY2F0aW9uPU0tLUNWZm5fWXdzVi0yRkdiQ3BfSEZhRWoyM0JtVDBjVEY0bDhoWGdwdk3ADAAQAAEAAA0fACEgcGJjcGN3ODRzZms3dzRuaG03ZHd5ZzJrM2d4MHQ0eHLADAAQAAEAAA0fAC4tZG9jdXNpZ249ZDVhMzczN2MtYzIzYy00YmQwLTkwOTUtZDJmZjYyMWYyODQwwAwAEAABAAANHwC+vXY9c3BmMSBpbmNsdWRlOl9zcGYtYS5taWNyb3NvZnQuY29tIGluY2x1ZGU6X3NwZi1iLm1pY3Jvc29mdC5jb20gaW5jbHVkZTpfc3BmLWMubWljcm9zb2Z0LmNvbSBpbmNsdWRlOl9zcGYtc3NnLWEubWljcm9zb2Z0LmNvbSBpbmNsdWRlOnNwZi1hLmhvdG1haWwuY29tIGluY2x1ZGU6X3NwZjEtbWVvLm1pY3Jvc29mdC5jb20gLWFsbMAMABAAAQAADR8AODdhZG9iZS1zaWduLXZlcmlmaWNhdGlvbj1jMWZlYTliNGNkZDRkZjBkNTc3ODUxN2YyOWUwOTM0wAwAEAABAAANHwAuLWRvY3VzaWduPTUyOTk4NDgyLTM5M2QtNDZmNy05NWQ0LTE1YWM2NTA5YmZkZMAMABAAAQAADR8AXVxhZG9iZS1pZHAtc2l0ZS12ZXJpZmljYXRpb249OGFhMzVjNTI4YWY1ZDcyYmViMTliMWJkM2VkOWI4NmQ4N2VhN2YyNGIyYmEzYzk5ZmZjZDAwYzI3ZTlkODA5Y8AMABAAAQAADR8AJSRkMzY1bWt0a2V5PTRkOGJueWN4NDBmeTM1ODFwZXR0YTRnc2bADAAQAAEAAA0fAFlYOFJQRFhqQnpCUzl0dTdQYnlzdTdxQ0FDcndYUG9EVjhadExmdGhUbkM0eTlWSkZMZDg0aXQ1c1FsRUlUZ1NMSjRLT0lBOHBCWnhteXZQdWp1VXZoT2c9PcAMABAAAQAADR8ARURnb29nbGUtc2l0ZS12ZXJpZmljYXRpb249MVRlSzhxME96aUZsNFQxdEYtUVI2NUprekhaMXJjZGdOY2NERnA3OGlUa8AMABAAAQAADR8AJSRkMzY1bWt0a2V5PTN1YzFjZjgyY3B2NzUwbHprNzB2OWJ2ZjLADAAQAAEAAA0fADw7ZmFjZWJvb2stZG9tYWluLXZlcmlmaWNhdGlvbj1md3p3aGJiendtZzVmemdvdGMyZ281MW9sYzM1NjYAACkQAAAAAAAAAA==","Result":{},"Upstream":"127.0.0.1:5300","IP":"156.229.9.224","Elapsed":1031547,"Cached":true}

Additional Information

The upstream is Unbound 1.13.1. If possible, please provide an e-mail where I can forward the more sensitive info like query and stats file for analysis if needed.

@EugeneOne1
Copy link
Member

EugeneOne1 commented Apr 5, 2022

@cadusilva, hello and apologies for a late response. To troubleshoot the issue we'd like to check your verbose log. Could you please reproduce the issue and collect it? You may send it to devteam@adguard.com with something like "Issue 4358" in the subject. It'd also be really helpful if you attach the query log file and the corrupted stats.db file as well.

@EugeneOne1 EugeneOne1 added the waiting for data Waiting for users to provide more data. label Apr 5, 2022
@cadusilva
Copy link
Author

hello @EugeneOne1, currently my installation is set to answer queries only from a limited list of CIDR ranges (from my country). Since then, I didn't see the problem surface again. I'll clear the list of CIDRs and wait to see if the problem comes back again. Also, there's already an e-mail that I sent some weeks ago that contains part of the info requested in your reply. As soon as the problem appears again, I'll send another e-mail. Thank you!

@cadusilva
Copy link
Author

Hello again, here's a follow-up: as soon as the CIDR list was cleared, the issue surfaced again and the e-mail with the files is on its way. Today I saw that the stats were corrupted again. The only catch is that the query log only contains the data from the last 6 hours as I forgot to expand the retention window. But I hope it helps anyway. Any extra step or additional info, just say the word.

@EugeneOne1
Copy link
Member

@cadusilva, we've received and investigating it, thanks.

@EugeneOne1
Copy link
Member

EugeneOne1 commented Apr 25, 2022

@cadusilva, hello again. Unfortunatelly, we can't reproduce the issue. Could you please answer a couple of questions to shed some light on the problem:

  1. What kind of machine is running AGH? Is it a router?
  2. What do you mean by "gets corrupt somehow"? Did the problem occured strictly after the flood of queries?

Also, could you please collect the browser's logs next time you'll catch it? This would really help up. Thanks.

@cadusilva
Copy link
Author

@cadusilva, hello again. Unfortunatelly, we can't reproduce the issue. Could you please answer a couple of questions to shed some light on the problem:

1. What kind of machine is running AGH?  Is it a router?

2. What do you mean by "gets corrupt somehow"?  Did the problem occured strictly after the flood of queries?

Also, could you please collect the browser's logs next time you'll catch it? This would really help up. Thanks.

Hello @EugeneOne1,

  1. AGH is currently running in a Raspberry Pi 4B 8 GB RAM with the official Raspberry Pi OS x64 based on Debian 11.3.

  2. This is because I'm not quite sure how the queries get corrupted, hence the "somehow". Initially, when this corruption first occurred, there was a flood of queries type TXT to microsoft.com from APAC IPs, but since then I'm not so sure how it keep getting corrupted after some time and who is to blame.

Currently, even limiting the CIDRs that can query my server to accept only IPs from Brazil isn't preventing the stats from being corrupted, as some foreing queries shows up in the stats.

All I know is, from time to time, when I access the AGH WebUI, everything is zeroed and there's an error message at the corner. At the moment there's no CIDR filtering in place, so I'm just waiting for the next time the stats get corrupted. Then I'll send the browser console logs to you guys.

This can be confusing and I'm not sure how it all happens, but it does happen. What I can send in the form of logs will be sent.

@cadusilva
Copy link
Author

Hello @EugeneOne1, I just sent a new e-mail with an also new set of files including the console log from Chrome. I noticed few minutes ago that the problem happened again so I immediately gathered the files and sent the new e-mail. Hope this helps. Thank you.

image

@EugeneOne1
Copy link
Member

@cadusilva, we've received the data, many thanks. The issue seems kind of nontrivial, we'll dig further.

@EugeneOne1 EugeneOne1 added needs investigation Needs to be reproduced reliably. and removed waiting for data Waiting for users to provide more data. labels May 30, 2022
@EugeneOne1
Copy link
Member

@cadusilva, hello again and apologies for the long wait. Are the sent logs has been recorded while the issue occured? And also if flushing the statistics fixes the issue for some noticable period of time?

@cadusilva
Copy link
Author

@cadusilva, hello again and apologies for the long wait. Are the sent logs has been recorded while the issue occured? And also if flushing the statistics fixes the issue for some noticable period of time?

Hello there! Yes, I set the verbose log on and waited for the problem to happen. I don't know if flushing the statistics from the webui fixes it, but I just delete the stats.db when the problem happens. It occurred again once or twice since the last post.

@cadusilva
Copy link
Author

Update: flushing the statistics from the webui General Settings also works after the problem occurs, and that's what just happened.

@ainar-g ainar-g added this to the v0.107.9 milestone Jul 25, 2022
@ainar-g ainar-g modified the milestones: v0.107.9, v0.107.10 Aug 3, 2022
adguard pushed a commit that referenced this issue Aug 4, 2022
Merge in DNS/adguard-home from 4358-fix-stats to master

Updates #4358.
Updates #4342.

Squashed commit of the following:

commit 5683cb3
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 18:20:54 2022 +0300

    stats: rm races test

commit 63dd676
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 17:13:36 2022 +0300

    stats: try to imp test

commit 59a0f24
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 16:38:57 2022 +0300

    stats: fix nil ptr deref

commit 7fc3ff1
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Apr 7 16:02:51 2022 +0300

    stats: fix races finally, imp tests

commit c63f5f4
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 00:56:49 2022 +0300

    aghhttp: add register func

commit 61adc7f
Merge: edbdb2d 9b3adac
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 00:36:01 2022 +0300

    Merge branch 'master' into 4358-fix-stats

commit edbdb2d
Merge: a91e4d7 a481ff4
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 21:00:42 2022 +0300

    Merge branch 'master' into 4358-fix-stats

commit a91e4d7
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 18:46:19 2022 +0300

    stats: imp code, docs

commit c5f3814
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 18:16:13 2022 +0300

    all: log changes

commit 5e6caaf
Merge: 091ba75 eb8e816
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 18:09:10 2022 +0300

    Merge branch 'master' into 4358-fix-stats

commit 091ba75
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 18:07:39 2022 +0300

    stats: imp docs, code

commit f2b2de7
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Tue Aug 2 17:09:30 2022 +0300

    all: refactor stats & add mutexes

commit b3f11c4
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Apr 27 15:30:09 2022 +0300

    WIP
@EugeneOne1
Copy link
Member

@cadusilva, hello again and apologies for delayed response. We've finally improved the concurrent logic in the statistics module to make it work with shared memory more carefully. Could you please check the last build in the edge channel and tell if it works properly and doesn't spoil the database file? Thanks.

FIY, our tests didn't show any significant performance losses relatively to the old version, but we'd like to get your feedback as well.

@EugeneOne1 EugeneOne1 added bug P3: Medium and removed needs investigation Needs to be reproduced reliably. labels Aug 4, 2022
@cadusilva
Copy link
Author

Hello @EugeneOne1, no problem. I just installed version v0.108.0-a.189+4293cf59 and will report back if the issue happens again. I'll keep an eye on the performance as well. Thank you!

@cadusilva
Copy link
Author

Bad news @EugeneOne1: it happened again.

image

Yesterday everything was fine, then a few minutes ago I accessed the WebUI and the stats were corrupted again. Unfortunatelly, the verbose log wasn't enabled this time.

@EugeneOne1
Copy link
Member

@cadusilva, thanks for the time you're contributing. We're going to investigate it further.

@cadusilva
Copy link
Author

@cadusilva, thanks for the time you're contributing. We're going to investigate it further.

Thank you Eugene for looking into it. I have now enabled verbose log so when it happens again I'll have more details for you guys to debug the issue.

@cadusilva
Copy link
Author

Hello @EugeneOne1, I've just sent another e-mail with a tarball containing everything including logs so you guys can take a look into the latest occurrence. Thank you!

@EugeneOne1
Copy link
Member

@cadusilva, we've received it and looking into, thanks.

@EugeneOne1
Copy link
Member

@cadusilva, could you please also try to access the Web UI directly (bypassing the Nginx) while using the same "bad" file? If the issue reproduces? If yes, if it reproduces within another browsers?

@cadusilva
Copy link
Author

@cadusilva, could you please also try to access the Web UI directly (bypassing the Nginx) while using the same "bad" file? If the issue reproduces? If yes, if it reproduces within another browsers?

Sure, tonight I'll make the tests and when I get the results, I'll will post them here @EugeneOne1.

@cadusilva
Copy link
Author

cadusilva commented Sep 6, 2022

Hello @EugeneOne1, I just did some tests and found interesting results. By the way, the issue just happened so I downloaded the "corrupted" installation and cleaned the zeroed statistics via dashboard so it started to count queries again.

Great, so next thing I did was to rename stats.db to stats.db.bak and to upload the "corrupt" stats.db file I had downloaded previously.

Then I stopped nginx and AdGuardHome services and changed the AGH ports so it would listen on 80 and 443, besides the other default ports for DNS over TLS and DNS over QUIC.

Next I restarted AdGuardHome to apply the new port settings and to use the "corrupt" stats.db file. For my surprise, all the stats were back! Then I undone the port settings in the YAML file and started nginx again.

Then everything was zeroed. To be sure, I stopped nginx one more time and restarted AGH set to use ports 80 and 443 directly instead of being reverse-proxied by nginx. And there was all the stats from the "corrupt" stats.db file.

For closure, I did these steps for a third time and yes, the "corrupt" stats.db file works just fine when AGH dashboard is accessed directly (AGH uses ports 80 and 443) and shows as zeroed when nginx is the middleman.

Here are some pictures:

AdGuard Home directly:

image

AdGuard Home via nginx:

image

@EugeneOne1
Copy link
Member

@cadusilva, it seems, we've finally found a source of the issue. I'd say you may want to revisit your Nginx configuration. I'm not really familiar with it, but we've received a couple of related issues (e.g. #4727).

Perhaps, you may configure Nginx to collect some logs and possibly let us see it so that we could enhance the documentation about using the reverse proxy.

@cadusilva
Copy link
Author

cadusilva commented Sep 6, 2022

@cadusilva, it seems, we've finally found a source of the issue. I'd say you may want to revisit your Nginx configuration. I'm not really familiar with it, but we've received a couple of related issues (e.g. #4727).

Perhaps, you may configure Nginx to collect some logs and possibly let us see it so that we could enhance the documentation about using the reverse proxy.

Hello @EugeneOne1, what kind of logs do you guys want? Currently there's already an error log going on. Would it suffice? Next time the issue happens, I'll send both AGH and nginx logs so you guys can investigate. I'm also not very familiar with nginx configs to pinpoint the problem and set things right but maybe with these logs a direction can be found.

I'll also do a little research about how I can improve the communication between nginx and AGH and see if things improve.

Thank you.

@cadusilva
Copy link
Author

cadusilva commented Sep 6, 2022

First thing I did was to change my configuration.

The line proxy_pass https://dns_doh_servers; became proxy_pass http://127.0.0.1:4430;. Also, in the AGH YAML file, I changed allow_unencrypted_doh to true, so nginx can handle the HTTPS stuff alone.

To test, I set Firefox to use encrypted DNS and everything works fine and quicker so far. By the way, as a result, AGH log doesn't see the query as encrypted anymore and shows it as "Plain DNS". But I think this is expected, as nginx is the one handling encryption right now and communicating with AGH without encrypting anything in localhost. Let's see what happens.

Updates:

  • Google Chrome didn't like the new setup as much as Firefox and so it can't resolve any address when DoH is enabled.
  • There's a lot of erros like this one in nginx log: recv() failed (104: Connection reset by peer) while reading upstream, client: xxx.xxx.xxx.xxx , server: dns.alto.win, request: "POST /dns-query HTTP/2.0", upstream: "http://127.0.0.1:4430/dns-query", host: "dns.alto.win"
  • The dnslookup tool says: Cannot make the DNS request: got status code 400 from https://dns.alto.win:443/dns-query. I don't know how Firefox doesn't care and just works but some clients cares a lot and do not work.
  • Android's Intra DNS app also can't communicate via DoH with the settings changes.

@cadusilva
Copy link
Author

@EugeneOne1 so these are the initial findings, it seems that most clients and the server itself don't go along with the changes in the configuration, including AGH reseting the connection, it seems. Firefox, for some reason, is one of a kind and doesn't complain.

@cadusilva
Copy link
Author

cadusilva commented Sep 6, 2022

Based on this blog post by Nginx staff, I did some other changes.

dns.conf:
The new entry for /dns-query is as follows:

	location /dns-query {
		proxy_http_version		1.1;
		proxy_set_header		Connection "";
		proxy_set_header		Host			$http_host;
		proxy_set_header		X-Real-IP		$realip_remote_addr;
		proxy_set_header		X-Forwarded-For		$proxy_add_x_forwarded_for;
		proxy_cache			doh_cache;
		proxy_cache_key			$scheme$proxy_host$uri$is_args$args$request_body;
		proxy_cache_methods		GET POST;
		proxy_pass			https://dohloop;
	}

nginx.conf:
The upstream directive is now as follows:

	upstream dohloop {
		zone dohloop			64k;
		server				127.0.0.1:4430;
		keepalive_timeout		60s;
		keepalive_requests		100;
		keepalive			10;
	}

There's also this cache thingy but the specified folder remains empty.

	proxy_cache_path /mnt/ram/nginx/doh_cache levels=1:2 keys_zone=doh_cache:10m;

I'll keep an eye now to see how everything works. The proxy_pass uses https:// again, so there's currently no problem with clients.

@cadusilva
Copy link
Author

cadusilva commented Sep 10, 2022

Hello @EugeneOne1, unfortunately the changes didn't solve the problem. I just became aware that it happened again. It always happens not too far beyond 100k queries. I'll check the nginx log and send everything to the devs e-mail. Thank you.

@EugeneOne1
Copy link
Member

@cadusilva, hello. Just to ensure, is it happened without Nginx proxy? We'll dig further then.

@cadusilva
Copy link
Author

@cadusilva, hello. Just to ensure, is it happened without Nginx proxy? We'll dig further then.

It happened with nginx as middleman, I cannot bypass nginx or the other sites will become offline. I was testing the new nginx settings to see if it'd solve the problem without taking nginx out of the equation.

@EugeneOne1
Copy link
Member

@cadusilva, I've looked through the Nginx docs came up with a few suggestions:

  • the large_client_header_buffers defaults to 4 buffers 8k each. Perhaps, setting it to something about 4 16k may help;
  • enabling proxy_buffering also may help, but the sizes of buffers should be chosen properly.

Also, I've just noticed you've mentioned the Nginx log. Is it possible somehow to get the part of it with the error occurence?

@cadusilva
Copy link
Author

cadusilva commented Sep 12, 2022

Hello @EugeneOne1, I've just applied your suggestions. Here are the relevant bits:

nginx.conf

	proxy_buffering				on;
	proxy_request_buffering			on;
	proxy_buffers				8 4k;
	proxy_buffer_size			4k; 
	proxy_busy_buffers_size			16k;

In this file there's also proxy_set_header Early-Data $ssl_early_data;. Do you think it may help break the stats when viewing them with nginx as middleman?

dns.conf

	location / {
		proxy_pass			http://127.0.0.1:8081;
		proxy_set_header		X-Forwarded-For		$proxy_add_x_forwarded_for;
#		proxy_buffering			off;
#		proxy_redirect			off;
	}

	location /dns-query {
		proxy_http_version		1.1;
		proxy_set_header		Connection "";
		proxy_set_header		Host			$http_host;
		proxy_set_header		X-Real-IP		$realip_remote_addr;
		proxy_set_header		X-Forwarded-For		$proxy_add_x_forwarded_for;
		proxy_cache			doh_cache;
		proxy_cache_key			$scheme$proxy_host$uri$is_args$args$request_body;
		proxy_cache_methods		GET POST;
		proxy_pass			https://dohloop;
	}

About the end of your last message, I sent an e-mail two days ago with a few files including the nginx log but I couldn't find myself any relevant bit about the issue we're digging. I guess I'll check the log level and watch to see what happens now with the edits to the files.

Thank you.

@EugeneOne1
Copy link
Member

In this file there's also proxy_set_header Early-Data $ssl_early_data;. Do you think it may help break the stats when viewing them with nginx as middleman?

I'm afraid I can't tell for sure. Actually, the main suspect for the moment is the GET /control/stats endpoint's API. It intends the huge number of paramemters which significantly increases the URL length.

@cadusilva
Copy link
Author

In this file there's also proxy_set_header Early-Data $ssl_early_data;. Do you think it may help break the stats when viewing them with nginx as middleman?

I'm afraid I can't tell for sure. Actually, the main suspect for the moment is the GET /control/stats endpoint's API. It intends the huge number of paramemters which significantly increases the URL length.

@EugeneOne1 I'm still monitoring to see if the issue happens again. But I'm not sure if it will, as I am now running AdGuardHome in a machine way more powerfull than previous Raspberry Pi (now sold to someone else). It's now a Ryzen 5 3550H, soon-to-have 24 GB of DDR4 2400 MHz RAM.

If the problem doesn't come back, maybe the issue has to do with the RPi not being powerful enough to deal with this GET /control/stats thingy, alongside with a big list of clients (as it was and is my case).

I'll keep watching and will comment here if something new happens.

@EugeneOne1
Copy link
Member

EugeneOne1 commented Sep 15, 2022

@cadusilva, I'd be surprised if this was the actual cause, since the out-of-resources kind of problem usually causes issues in all parts of the system. Although, extra usage data never harms, so you're always welcome to share your findings.

Besides, there is a quick way to check it out by simply replacing the stats and query log data files in the current setup, creating backups of the existing data beforehand.

@cadusilva
Copy link
Author

@cadusilva, I'd be surprised if this was the actual cause, since the out-of-resources kind of problem usually causes issues in all parts of the system. Although, extra usage data never harms, so you're always welcome to share your findings.

Besides, there is a quick way to check it out by simply replacing the stats and query log data files in the current setup, creating backups of the existing data beforehand.

@EugeneOne1 I mean, there's now a lot more processing power than before. This is where the RPi struggles. It is a very competent piece of hardware and I hosted all kinds of stuff using it, but it's not powerful and it's not even meant to outpower a game rig or something.

My SBC was the 4B version with 8 GB of RAM, so it had a lot of resources and used less than 2/8 of them (at least speaking of RAM usage). So processing is the Achilles' heel of this little computer, and it sometimes struggled with things like refreshing all the blocking lists I use to see if there's an update and then processing it, for example.

And I also have a lot of clientes with a lot of CIDRs, so when you mentioned the GET part, I thought (as a layman of course) maybe it is about processing power.

But here's another piece of information: I replaced the current stats.db file with the one "corrupted" I last sent you via e-mail. Restarted AdGuard service and then... It worked. There was the old stats right in the dashboard. Just to be sure, I opened an older e-mail with a previous "corrupted" stats.db file and tried this one too.

Again, there was all the stats. Currently, I'm running the edge release v0.108.0-a.287+fc62796e Linux amd64. I just updated to this version, but the previous one also showed the same behaviour. The older stats.db file worked. Maybe there's something going on with the arm64 release? Maybe is something with the RPi processing power? Maybe is something else.

But these are my latest findings so you guys can analyse. So far, there's no new occurrencies. They used to happen shortly after hitting 100k queries, I couldn't get to 200k before the zeroed dashboard happened. I'll keep watching.

Thank you.

@cadusilva
Copy link
Author

Hello @EugeneOne1, so far there's no new occurrences. I'm almost at 300.000 queries on record, but zero new issues. Maybe the number of clients and its CIDRs, blocklists and the weight of the dashboard are too much for the Raspberry Pi to handle when these three factors come together. It's just a theory.

@EugeneOne1
Copy link
Member

@cadusilva, I didn't recall if you've told already, but does Nginx runs on the same machine as AGH?

@cadusilva
Copy link
Author

@cadusilva, I didn't recall if you've told already, but does Nginx runs on the same machine as AGH?

Yes, everything is in the same machine and nginx acts as a reverse proxy.

@ainar-g
Copy link
Contributor

ainar-g commented Sep 29, 2022

If there were no new occurrences, I'd say that it was an HTTP proxy issue. We'll close this issue, if you don't mind.

@cadusilva
Copy link
Author

If there were no new occurrences, I'd say that it was an HTTP proxy issue. We'll close this issue, if you don't mind.

It's okay, there's really no new occurrences so far. They were gone after I moved from a Raspberry Pi, 8 GB, SSD to a Ryzen 5 3550H, 16 GB RAM and NVMe (as of now). The setup is the same, with HTTP proxy and everything in between, but the hardware has changed to a much more powerful one and the problem is now gone.

Thank you guys for looking into it, if it ever happens again I'll let you know.

heyxkhoa pushed a commit to heyxkhoa/AdGuardHome that referenced this issue Mar 20, 2023
Merge in DNS/adguard-home from 4358-fix-stats to master

Updates AdguardTeam#4358.
Updates AdguardTeam#4342.

Squashed commit of the following:

commit 5683cb3
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 18:20:54 2022 +0300

    stats: rm races test

commit 63dd676
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 17:13:36 2022 +0300

    stats: try to imp test

commit 59a0f24
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 16:38:57 2022 +0300

    stats: fix nil ptr deref

commit 7fc3ff1
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Apr 7 16:02:51 2022 +0300

    stats: fix races finally, imp tests

commit c63f5f4
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 00:56:49 2022 +0300

    aghhttp: add register func

commit 61adc7f
Merge: edbdb2d 9b3adac
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 4 00:36:01 2022 +0300

    Merge branch 'master' into 4358-fix-stats

commit edbdb2d
Merge: a91e4d7 a481ff4
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 21:00:42 2022 +0300

    Merge branch 'master' into 4358-fix-stats

commit a91e4d7
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 18:46:19 2022 +0300

    stats: imp code, docs

commit c5f3814
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 18:16:13 2022 +0300

    all: log changes

commit 5e6caaf
Merge: 091ba75 eb8e816
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 18:09:10 2022 +0300

    Merge branch 'master' into 4358-fix-stats

commit 091ba75
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 3 18:07:39 2022 +0300

    stats: imp docs, code

commit f2b2de7
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Tue Aug 2 17:09:30 2022 +0300

    all: refactor stats & add mutexes

commit b3f11c4
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Apr 27 15:30:09 2022 +0300

    WIP
heyxkhoa pushed a commit to heyxkhoa/AdGuardHome that referenced this issue Mar 20, 2023
Merge in DNS/adguard-home from 4358-stats-races to master

Updates AdguardTeam#4358

Squashed commit of the following:

commit 162d17b
Merge: 17732cf d4c3a43
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 17 14:04:20 2022 +0300

    Merge branch 'master' into 4358-stats-races

commit 17732cf
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 17 13:53:42 2022 +0300

    stats: imp docs, locking

commit 4ee0908
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Tue Aug 16 20:26:19 2022 +0300

    stats: revert const

commit a7681a1
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Tue Aug 16 20:23:00 2022 +0300

    stats: imp concurrency

commit a6c6c1a
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Tue Aug 16 19:51:30 2022 +0300

    stats: imp code, tests, docs

commit 954196b
Merge: 281e00d 6e63757
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Tue Aug 16 13:07:32 2022 +0300

    Merge branch 'master' into 4358-stats-races

commit 281e00d
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Fri Aug 12 16:22:18 2022 +0300

    stats: imp closing

commit ed036d9
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Fri Aug 12 16:11:12 2022 +0300

    stats: imp tests more

commit f848a12
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Fri Aug 12 13:54:19 2022 +0300

    stats: imp tests, code

commit 60e11f0
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 11 16:36:07 2022 +0300

    stats: fix test

commit 6d97f1d
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Thu Aug 11 14:53:21 2022 +0300

    stats: imp code, docs

commit 20c70c2
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 10 20:53:36 2022 +0300

    stats: imp shared memory safety

commit 8b39456
Author: Eugene Burkov <E.Burkov@AdGuard.COM>
Date:   Wed Aug 10 17:22:55 2022 +0300

    stats: imp code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants