Server message processing blocked in some cases when index optimization occurs #6965
Message processing should not be blocked when indices are optimized or if processing is temporarily blocked processing should recover when the reason for block is resolved.
When index optimization starts in some cases it causes message processing to stop and currently only known way to resolve this is to restart graylog service.
When this bug is triggered it seems that Graylog has issues opening network connections to Elasticsearch. Normally there should always be some number of established connections to Elasticsearch but when this issue is active there is zero established connections from Graylog to Elasticsearch.
As you can see from the screenshot, there is only one active system job which is started on the Master node, but it is different node which can't output any messages.
When bug is active following command returns 0.
After service restart there are always some number of established connections even in our smallest test envinronment.
The outgoing message metrics from the applicable node.
About at the same time there is following event in the logs
Steps to Reproduce (for bugs)
Currently I don't have steps for reproducing this. For example today this has happened in our largest cluster two times in four hours, both at different nodes. This does not happen every time and this is not related to the Graylog Master node or multiple index optimization tasks as per #4637 . We have also updated the connection limits, but it did not help, because this issue is not caused by the limit because Graylog doesn't open any connections.
Currently Graylog is unreliable because message processing gets jammed multiple times per day and requires service restart. We have implemented monitoring for this so we notice can react to it immediately.
As you can see from the screenshot, we have quite large volumes incoming - constantly over 10.000 msg/sec and over 1 terabyte of logs per day.
The text was updated successfully, but these errors were encountered:
This is definitely related to Elasticsearch responding 413 Request Entity Too Large.
When this issue is active I took following tcpdump
Request (Content length header over 100MB, but Elasticsearch responded after ~100kB data stream)
During the issue, there are following events in the logs
At 14:46:10 the output still worked but at 14:46:20 the output was zero.
This bug has been reported over a year ago and this has not yet been fixed and this is related to very basic functionality of http. Please get this fixed asap!