Scalability limits? #104

dchidell · 2021-04-15T12:22:58Z

I'm monitoring around 50-55 services with gatus, most are HTTP and 29 of them are using the pat keyword (with wildcards, so about as expensive a query as it can get). All using the default poll interval of 60s.

I am starting to see some responses of context deadline exceeded (Client.Timeout exceeded while awaiting headers) in the body check. I have manually checked the services in question and they're healthy. Restarting gatus, or just waiting a few minutes seems to resolve this. This does not occur continuously, but have seen it twice in the space of a few hours.

I can only assume that this is due to a concurrency issue, as it's more than possible that the combination of service times takes longer than 60s to respond. I do not know enough of the gatus architecture to know if this is a problem or not.

I am running v2.3.0 with #100 changed locally (as I've not updated since I tested it). I will repeat the test with 2.4.0 and report the results here.

The text was updated successfully, but these errors were encountered:

TwiN · 2021-04-15T22:04:23Z

In fact, unless you're using disable-monitoring-lock, Gatus should never run into concurrency issue.

Otherwise, there's a lock that prevents two services from being evaluated at the same time. The reason why this is the default behavior is so that the accuracy of the measured response time is as high as possible.

The context deadline exceeded (Client.Timeout exceeded while awaiting headers) you're seeing is most likely caused by the default client timeout, which is 10s.

I occasionally use Gatus for stress testing, and in average, 50 services with an interval of 1s each usually ends up consuming roughly 0.1 vCPU (100m), and given that your interval is 60s rather than 1s, I doubt CPU is to blame, even if you're using pat. That leaves us with networking.
When the issue happen, is it all services that are affected, or just one? If there's multiple services affected, are they all targeting the same application?

It's a bit hard to help you without being able to reproduce it, but it has never happened to me and if I had to guess, I'd say it's either the service that is unreachable, or the service being tested is really timing out.

dchidell · 2021-04-15T22:14:34Z

I'm not using disable-monitoring-lock, so that can be ruled out.

The networking is convoluted, it tests the service as a user would, and gets proxied via cloudflare. You're right about it hitting the 10s timeout, but when manually checking the service - it's fine, and restarting gatus solves it (whilst restarting the service does not).

I've been running on 2.4.0 all day now, without issue - so it does seem tricky to reproduce. I opened this after seeing it twice in an hour about 20 minutes or so after adding a new service. Typically, I've not seen it since.

Is there any additional verbosity I can turn on to collect useful logging output? I may run a test and scale this up to see if I can more reliably reproduce this behaviour.

TwiN · 2021-04-15T22:22:28Z

Thing is, if it's timing out, there's no extra verbosity that could be provided, since it's actually timing out..
An option I can think of would be to just completely disable the client timeout, which might give us some insight on the issue.

It's not currently possible to modify the timeout, but it's not really complicated to implement.

If the issue happens again, let me know and I'll implement so that you can test it on latest.

dchidell · 2021-04-23T18:01:36Z

Not seen this since! Closing until / unless I have more evidence to support my initial observations.

dchidell · 2021-04-29T19:18:40Z

I've added another 4 services (DNS checks) and I have got a number of new instances of this popping up on other HTTP services.

If there's an option to increase the timeout, that would be great to test. What would happen in this event if the response time exceeded poll interval? (I.e. how do I know if it's failing still).

TwiN · 2021-05-01T03:00:42Z

@dchidell Done.

Once the build is done, it'll be available on latest.

To use it, all you need to do is set the HTTP_CLIENT_TIMEOUT_IN_SECONDS environment variable to a numerical value (e.g. 60)

Meldiron · 2021-05-11T08:16:02Z

I can see this issue shifted a little bit, so is it possible to scale Gatus? Let's say I needed to test 1000, or even 10 000 websites and have accurate response times, is it possible?

I can see that Gatus uses file to store data permanently when server stops, and everything is in RAM while the server is running. If we had two types of Gatus (master and worker) where master simply manages workers and store data, meanwhile workers do the work (ping), we could use websockets or Redis to sync data between all of these.

With all of these changes, we should be able to start one master, 100 wokers and they should be able to ping "unlimited" about of websites.

Anyone think this update would be useful?

TwiN · 2021-06-15T22:45:35Z

@Meldiron I've got a few large instances of Gatus running with ~100 services, and thanks to the monitoring lock, by default, there will never be two services monitored at the exact same time, therefore, the response time should remain accurate.

Furthermore, the memory usage should remain fairly low even if you were monitoring 1000 services, granted the UI might be a bit hard to navigate with that many services 🤔

A distributed approach is being discussed in #64 (and #124 may further contribute to making that easier), I just haven't had the time to work on said issue yet.

@dchidell Do you have any news? Has the issue happened again?

dchidell · 2021-06-16T08:49:08Z

Nope - I've seen absolutely no problems, I ran the new build with the HTTP header set to 120 seconds and saw no problems, so reverted back and still have not seen anything of concequence.

TwiN · 2021-06-17T01:34:11Z

@dchidell How long ago did you revert back?

dchidell · 2021-06-17T10:45:59Z

Was running that build for around a week - so ~3 weeks ago I reverted.

TwiN · 2021-06-18T14:04:10Z

Alright, sweet.

I'll get rid of HTTP_CLIENT_TIMEOUT_IN_SECONDS. I just released v2.7.0, and that version will be the last version with that hidden feature.

TwiN · 2021-06-18T14:04:44Z

Glad to know that the issue no longer happens!

dchidell closed this as completed Apr 23, 2021

dchidell reopened this Apr 29, 2021

TwiN added a commit that referenced this issue May 1, 2021

#104: Add support for HTTP_CLIENT_TIMEOUT_IN_SECONDS (undocumented)

857ad58

TwiN closed this as completed Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability limits? #104

Scalability limits? #104

dchidell commented Apr 15, 2021 •

edited

Loading

TwiN commented Apr 15, 2021

dchidell commented Apr 15, 2021

TwiN commented Apr 15, 2021

dchidell commented Apr 23, 2021

dchidell commented Apr 29, 2021

TwiN commented May 1, 2021

Meldiron commented May 11, 2021

TwiN commented Jun 15, 2021

dchidell commented Jun 16, 2021 •

edited

Loading

TwiN commented Jun 17, 2021

dchidell commented Jun 17, 2021

TwiN commented Jun 18, 2021

TwiN commented Jun 18, 2021

Scalability limits? #104

Scalability limits? #104

Comments

dchidell commented Apr 15, 2021 • edited Loading

TwiN commented Apr 15, 2021

dchidell commented Apr 15, 2021

TwiN commented Apr 15, 2021

dchidell commented Apr 23, 2021

dchidell commented Apr 29, 2021

TwiN commented May 1, 2021

Meldiron commented May 11, 2021

TwiN commented Jun 15, 2021

dchidell commented Jun 16, 2021 • edited Loading

TwiN commented Jun 17, 2021

dchidell commented Jun 17, 2021

TwiN commented Jun 18, 2021

TwiN commented Jun 18, 2021

dchidell commented Apr 15, 2021 •

edited

Loading

dchidell commented Jun 16, 2021 •

edited

Loading