Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability limits? #104

Closed
dchidell opened this issue Apr 15, 2021 · 13 comments
Closed

Scalability limits? #104

dchidell opened this issue Apr 15, 2021 · 13 comments

Comments

@dchidell
Copy link
Contributor

dchidell commented Apr 15, 2021

I'm monitoring around 50-55 services with gatus, most are HTTP and 29 of them are using the pat keyword (with wildcards, so about as expensive a query as it can get). All using the default poll interval of 60s.

I am starting to see some responses of context deadline exceeded (Client.Timeout exceeded while awaiting headers) in the body check. I have manually checked the services in question and they're healthy. Restarting gatus, or just waiting a few minutes seems to resolve this. This does not occur continuously, but have seen it twice in the space of a few hours.

I can only assume that this is due to a concurrency issue, as it's more than possible that the combination of service times takes longer than 60s to respond. I do not know enough of the gatus architecture to know if this is a problem or not.

I am running v2.3.0 with #100 changed locally (as I've not updated since I tested it). I will repeat the test with 2.4.0 and report the results here.

@TwiN
Copy link
Owner

TwiN commented Apr 15, 2021

In fact, unless you're using disable-monitoring-lock, Gatus should never run into concurrency issue.

Otherwise, there's a lock that prevents two services from being evaluated at the same time. The reason why this is the default behavior is so that the accuracy of the measured response time is as high as possible.

The context deadline exceeded (Client.Timeout exceeded while awaiting headers) you're seeing is most likely caused by the default client timeout, which is 10s.

I occasionally use Gatus for stress testing, and in average, 50 services with an interval of 1s each usually ends up consuming roughly 0.1 vCPU (100m), and given that your interval is 60s rather than 1s, I doubt CPU is to blame, even if you're using pat. That leaves us with networking.
When the issue happen, is it all services that are affected, or just one? If there's multiple services affected, are they all targeting the same application?

It's a bit hard to help you without being able to reproduce it, but it has never happened to me and if I had to guess, I'd say it's either the service that is unreachable, or the service being tested is really timing out.

@dchidell
Copy link
Contributor Author

I'm not using disable-monitoring-lock, so that can be ruled out.

The networking is convoluted, it tests the service as a user would, and gets proxied via cloudflare. You're right about it hitting the 10s timeout, but when manually checking the service - it's fine, and restarting gatus solves it (whilst restarting the service does not).

I've been running on 2.4.0 all day now, without issue - so it does seem tricky to reproduce. I opened this after seeing it twice in an hour about 20 minutes or so after adding a new service. Typically, I've not seen it since.

Is there any additional verbosity I can turn on to collect useful logging output? I may run a test and scale this up to see if I can more reliably reproduce this behaviour.

@TwiN
Copy link
Owner

TwiN commented Apr 15, 2021

Thing is, if it's timing out, there's no extra verbosity that could be provided, since it's actually timing out..
An option I can think of would be to just completely disable the client timeout, which might give us some insight on the issue.

It's not currently possible to modify the timeout, but it's not really complicated to implement.

If the issue happens again, let me know and I'll implement so that you can test it on latest.

@dchidell
Copy link
Contributor Author

Not seen this since! Closing until / unless I have more evidence to support my initial observations.

@dchidell
Copy link
Contributor Author

I've added another 4 services (DNS checks) and I have got a number of new instances of this popping up on other HTTP services.

If there's an option to increase the timeout, that would be great to test. What would happen in this event if the response time exceeded poll interval? (I.e. how do I know if it's failing still).

@TwiN
Copy link
Owner

TwiN commented May 1, 2021

@dchidell Done.

Once the build is done, it'll be available on latest.

To use it, all you need to do is set the HTTP_CLIENT_TIMEOUT_IN_SECONDS environment variable to a numerical value (e.g. 60)

@Meldiron
Copy link

I can see this issue shifted a little bit, so is it possible to scale Gatus? Let's say I needed to test 1000, or even 10 000 websites and have accurate response times, is it possible?

I can see that Gatus uses file to store data permanently when server stops, and everything is in RAM while the server is running. If we had two types of Gatus (master and worker) where master simply manages workers and store data, meanwhile workers do the work (ping), we could use websockets or Redis to sync data between all of these.

With all of these changes, we should be able to start one master, 100 wokers and they should be able to ping "unlimited" about of websites.


Anyone think this update would be useful?

@TwiN
Copy link
Owner

TwiN commented Jun 15, 2021

@Meldiron I've got a few large instances of Gatus running with ~100 services, and thanks to the monitoring lock, by default, there will never be two services monitored at the exact same time, therefore, the response time should remain accurate.

Furthermore, the memory usage should remain fairly low even if you were monitoring 1000 services, granted the UI might be a bit hard to navigate with that many services 🤔

A distributed approach is being discussed in #64 (and #124 may further contribute to making that easier), I just haven't had the time to work on said issue yet.


@dchidell Do you have any news? Has the issue happened again?

@dchidell
Copy link
Contributor Author

dchidell commented Jun 16, 2021

Nope - I've seen absolutely no problems, I ran the new build with the HTTP header set to 120 seconds and saw no problems, so reverted back and still have not seen anything of concequence.

@TwiN
Copy link
Owner

TwiN commented Jun 17, 2021

@dchidell How long ago did you revert back?

@dchidell
Copy link
Contributor Author

Was running that build for around a week - so ~3 weeks ago I reverted.

@TwiN
Copy link
Owner

TwiN commented Jun 18, 2021

Alright, sweet.

I'll get rid of HTTP_CLIENT_TIMEOUT_IN_SECONDS. I just released v2.7.0, and that version will be the last version with that hidden feature.

@TwiN
Copy link
Owner

TwiN commented Jun 18, 2021

Glad to know that the issue no longer happens!

@TwiN TwiN closed this as completed Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants