New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server message processing blocked in some cases when index optimization occurs #6965
Comments
This might be related to #5091. I'll try to get thread dump and tcpdump next time when this occurs. |
@hulkk Thank you for the update! |
This is definitely related to Elasticsearch responding 413 Request Entity Too Large. When this issue is active I took following tcpdump Request (Content length header over 100MB, but Elasticsearch responded after ~100kB data stream)
Response
Request Response
During the issue, there are following events in the logs
At 14:46:10 the output still worked but at 14:46:20 the output was zero. This bug has been reported over a year ago and this has not yet been fixed and this is related to very basic functionality of http. Please get this fixed asap! |
Any updates? |
We are currently working on a pull-request to fix this: #7071 Once it's merged it will be in the upcoming 3.2 release and we will also backport it to 3.1. |
hulkk commentedDec 10, 2019
•
edited
Expected Behavior
Message processing should not be blocked when indices are optimized or if processing is temporarily blocked processing should recover when the reason for block is resolved.
Current Behavior
When index optimization starts in some cases it causes message processing to stop and currently only known way to resolve this is to restart graylog service.
When this bug is triggered it seems that Graylog has issues opening network connections to Elasticsearch. Normally there should always be some number of established connections to Elasticsearch but when this issue is active there is zero established connections from Graylog to Elasticsearch.
As you can see from the screenshot, there is only one active system job which is started on the Master node, but it is different node which can't output any messages.
When bug is active following command returns 0.
sudo netstat -antpu | grep 9200 | grep ESTABLISHED | wc -l
After service restart there are always some number of established connections even in our smallest test envinronment.
The outgoing message metrics from the applicable node.
About at the same time there is following event in the logs
2019-12-10T14:13:21.729+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #88).
Steps to Reproduce (for bugs)
Currently I don't have steps for reproducing this. For example today this has happened in our largest cluster two times in four hours, both at different nodes. This does not happen every time and this is not related to the Graylog Master node or multiple index optimization tasks as per #4637 . We have also updated the connection limits, but it did not help, because this issue is not caused by the limit because Graylog doesn't open any connections.
Context
Currently Graylog is unreliable because message processing gets jammed multiple times per day and requires service restart. We have implemented monitoring for this so we notice can react to it immediately.
As you can see from the screenshot, we have quite large volumes incoming - constantly over 10.000 msg/sec and over 1 terabyte of logs per day.
Your Environment
The text was updated successfully, but these errors were encountered: