Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server message processing blocked in some cases when index optimization occurs #6965

Closed
hulkk opened this issue Dec 10, 2019 · 5 comments · Fixed by #7071
Closed

Server message processing blocked in some cases when index optimization occurs #6965

hulkk opened this issue Dec 10, 2019 · 5 comments · Fixed by #7071
Assignees
Labels
bug

Comments

@hulkk
Copy link

@hulkk hulkk commented Dec 10, 2019

Expected Behavior

Message processing should not be blocked when indices are optimized or if processing is temporarily blocked processing should recover when the reason for block is resolved.

Current Behavior

When index optimization starts in some cases it causes message processing to stop and currently only known way to resolve this is to restart graylog service.

When this bug is triggered it seems that Graylog has issues opening network connections to Elasticsearch. Normally there should always be some number of established connections to Elasticsearch but when this issue is active there is zero established connections from Graylog to Elasticsearch.

Screenshot 2019-12-10 at 14 42 01 copy

As you can see from the screenshot, there is only one active system job which is started on the Master node, but it is different node which can't output any messages.

When bug is active following command returns 0.
sudo netstat -antpu | grep 9200 | grep ESTABLISHED | wc -l

After service restart there are always some number of established connections even in our smallest test envinronment.

The outgoing message metrics from the applicable node.

Screenshot 2019-12-10 at 15 30 17

About at the same time there is following event in the logs
2019-12-10T14:13:21.729+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #88).

Steps to Reproduce (for bugs)

Currently I don't have steps for reproducing this. For example today this has happened in our largest cluster two times in four hours, both at different nodes. This does not happen every time and this is not related to the Graylog Master node or multiple index optimization tasks as per #4637 . We have also updated the connection limits, but it did not help, because this issue is not caused by the limit because Graylog doesn't open any connections.

Context

Currently Graylog is unreliable because message processing gets jammed multiple times per day and requires service restart. We have implemented monitoring for this so we notice can react to it immediately.

As you can see from the screenshot, we have quite large volumes incoming - constantly over 10.000 msg/sec and over 1 terabyte of logs per day.

Your Environment

  • Graylog Version: 3.0.2
  • Elasticsearch Version: 6.8.4
  • MongoDB Version: 4.0.13
  • Operating System: CentOS 7.7
@hulkk hulkk added the bug label Dec 10, 2019
@hulkk

This comment has been minimized.

Copy link
Author

@hulkk hulkk commented Dec 11, 2019

This might be related to #5091. I'll try to get thread dump and tcpdump next time when this occurs.

@bernd

This comment has been minimized.

Copy link
Member

@bernd bernd commented Dec 11, 2019

@hulkk Thank you for the update!

@hulkk

This comment has been minimized.

Copy link
Author

@hulkk hulkk commented Dec 11, 2019

This is definitely related to Elasticsearch responding 413 Request Entity Too Large.

When this issue is active I took following tcpdump tcpdump -i ens192 -s0 port 9200 -w issue.pcap and inspected it using Wireshark.

Request (Content length header over 100MB, but Elasticsearch responded after ~100kB data stream)

POST /_bulk HTTP/1.1
Content-Length: 110621997
Content-Type: application/json; charset=UTF-8
Host: <redacted>:9200
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.4 (Java/1.8.0_232)
Accept-Encoding: gzip,deflate

{"index":{"_id":"<redacted>","_index":"graylog_deflector","_type":"message"}}
<redacted>
...
{"index":{"_id":"<redacted>","_index":"graylog_deflector","_type":"message"}}
<redacted>
{"index":{"_id":"<redacted>","_index":"graylog_deflector","_type":"message"}}
<redacted, truncated message>

Response

HTTP/1.1 413 Request Entity Too Large
content-length: 0

Request
<redacted, truncated message>

Response

HTTP/1.0 400 Bad Request
content-type: application/json; charset=UTF-8
content-length: 203

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"text is empty (possibly HTTP/0.9)"}],"type":"illegal_argument_exception","reason":"text is empty (possibly HTTP/0.9)"},"status":400}

During the issue, there are following events in the logs

2019-12-11T14:45:41.398+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #1).
2019-12-11T14:45:44.345+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #2).
2019-12-11T14:45:47.220+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #3).
2019-12-11T14:45:49.781+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #4).
2019-12-11T14:45:52.364+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #5).
2019-12-11T14:45:55.276+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #6).
2019-12-11T14:45:57.757+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #7).
2019-12-11T14:46:00.839+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #8).
2019-12-11T14:46:03.912+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #9).
2019-12-11T14:46:07.740+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #10).
2019-12-11T14:46:12.396+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #11).
2019-12-11T14:46:16.789+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #12).
2019-12-11T14:46:23.392+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #13).
2019-12-11T14:46:34.023+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #14).
2019-12-11T14:46:52.709+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #15).
2019-12-11T14:47:25.077+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #16).
2019-12-11T14:47:57.485+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #17).
2019-12-11T14:48:29.790+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #18).
2019-12-11T14:49:02.264+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #19).
2019-12-11T14:49:34.694+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #20).
2019-12-11T14:50:07.026+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #21).
2019-12-11T14:50:39.508+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #22).
2019-12-11T14:51:11.841+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #23).
2019-12-11T14:51:44.201+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #24).
2019-12-11T14:52:16.591+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #25).

At 14:46:10 the output still worked but at 14:46:20 the output was zero.

Screenshot 2019-12-11 at 16 57 43

This bug has been reported over a year ago and this has not yet been fixed and this is related to very basic functionality of http. Please get this fixed asap!

@hulkk

This comment has been minimized.

Copy link
Author

@hulkk hulkk commented Jan 9, 2020

Any updates?

@bernd

This comment has been minimized.

Copy link
Member

@bernd bernd commented Jan 9, 2020

Any updates?

We are currently working on a pull-request to fix this: #7071

Once it's merged it will be in the upcoming 3.2 release and we will also backport it to 3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.