Today we had our monitoring systems ring the alarms with the server the availability WS is running on (in docker - on 1.0.3 at the time) running full on RAM and SWAP completely to the extent that I couldn't ssh into the machine for a good while (before the OS in the docker container realized how bad it is and started handing out SIGKILL red cards to processes).
It seems some of the issues have gotten better with 1.0.5 (which I just earlier updated to again to check the issue again before reporting) but I don't think the situation is exactly ideal yet, so here goes.
After going through the Apache logs of our main outward facing server and searching for a while, the problem was a user sending of a very problematic request and then actually doing the same request three times in short succession, probably because not getting a response immediately (which probably made things even worse threefold).
Basically the request was: fdsnws/availability/1/extent?network=BW so just an insane amount of matching rows in the DB.
On 1.0.3 this actually runs for about 20 mins filling up ~6GB of RAM (up to the point of zero RAM/SWAP left on the host) and it looks like it was matching some 4 Mio rows in the DB.
On 1.0.5 it seems to blow up much faster and then eventually it gets stopped and it shows either something like "Proxy Error" (I didnt save the exact content of the reply on the first try) in the browser or I also got this (what I got on the second debug attempt):
Error 413: Request too large.
The request exceeds the limit of 2_500_000 rows.
Usage details are available from http://www.fdsn.org/webservices/fdsnws-availability-1.0.pdf
Request:
/extent?network=BW
Request Submitted:
2026-Apr-22 14:03:54 UTC
Service version:
Service: fdsnws-availability version:1.0.5
But nevertheless, before it errors out it fills up almost the complete RAM and SWAP.
I think maybe that limit of 2.5 Mio rows (that is in place now it seems), should be further lowered, since I dont think it's acceptable for a single request to be able to take up several GB of RAM. I think users should be forced to make sane requests that can be handled within a reasonably short amount of time.
Also: Maybe the service should explicitly require "starttime", "endtime", "network", "station", "location" and "channel" to be specified by the user. I think it's very weird and unfortunate that the official FDSN WS specs are specifying all of these as "mandatory" but at the same time saying the default (how does a mandatory parameter need a default????) should be "any". (edit: see comment below) I think explicitly asking the user to specify wildcards if they want everything might at least have a chance of them thinking about what they are asking for.
Not really sure how to fix this, but I think a graceful error should happen at a much lower bar than what is set currently, since measures only kick in when already ~ 8 GB of RAM+SWAP are used up by single malign requests.
Today we had our monitoring systems ring the alarms with the server the availability WS is running on (in docker - on 1.0.3 at the time) running full on RAM and SWAP completely to the extent that I couldn't ssh into the machine for a good while (before the OS in the docker container realized how bad it is and started handing out
SIGKILLred cards to processes).It seems some of the issues have gotten better with 1.0.5 (which I just earlier updated to again to check the issue again before reporting) but I don't think the situation is exactly ideal yet, so here goes.
After going through the Apache logs of our main outward facing server and searching for a while, the problem was a user sending of a very problematic request and then actually doing the same request three times in short succession, probably because not getting a response immediately (which probably made things even worse threefold).
Basically the request was:
fdsnws/availability/1/extent?network=BWso just an insane amount of matching rows in the DB.On 1.0.3 this actually runs for about 20 mins filling up ~6GB of RAM (up to the point of zero RAM/SWAP left on the host) and it looks like it was matching some 4 Mio rows in the DB.
On 1.0.5 it seems to blow up much faster and then eventually it gets stopped and it shows either something like "Proxy Error" (I didnt save the exact content of the reply on the first try) in the browser or I also got this (what I got on the second debug attempt):
But nevertheless, before it errors out it fills up almost the complete RAM and SWAP.
I think maybe that limit of 2.5 Mio rows (that is in place now it seems), should be further lowered, since I dont think it's acceptable for a single request to be able to take up several GB of RAM. I think users should be forced to make sane requests that can be handled within a reasonably short amount of time.
Also: Maybe the service should explicitly require "starttime", "endtime", "network", "station", "location" and "channel" to be specified by the user.
I think it's very weird and unfortunate that the official FDSN WS specs are specifying all of these as "mandatory" but at the same time saying the default (how does a mandatory parameter need a default????) should be "any".(edit: see comment below) I think explicitly asking the user to specify wildcards if they want everything might at least have a chance of them thinking about what they are asking for.Not really sure how to fix this, but I think a graceful error should happen at a much lower bar than what is set currently, since measures only kick in when already ~ 8 GB of RAM+SWAP are used up by single malign requests.