Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graylog doesn't handle Elasticsearch "413 Request Entity Too Large" reply #5091

Closed
macko003 opened this issue Sep 10, 2018 · 1 comment · Fixed by #7071
Closed

Graylog doesn't handle Elasticsearch "413 Request Entity Too Large" reply #5091

macko003 opened this issue Sep 10, 2018 · 1 comment · Fixed by #7071
Assignees
Labels

Comments

@macko003
Copy link

@macko003 macko003 commented Sep 10, 2018

One graylog node lost the connection with elastic cluster.
It seems sometimes, somehow the graylog make bigger batch (based on output_batch_size), then the elasticsearch can allow (ES bulk_max_body_size default 100mb). The graylog can't recognize the received error from ES.
After a graylog service restart can send the logs out to elasticsearch.

After time the GL lost the connection with all ES nodes, and can't store logs. Journal start to grow.
Issue details: https://community.graylog.org/t/elastic-cluster-connection-lost/6189/8

Expected Behavior

GL create smaller size of batch then the ES limit, or handle the error, and refregment the batch if iit receive "413 Request Entity Too Large" error

Current Behavior

GL makes a batch, try to send to ES, ES denies it, ES send TCP reset, GL try another ES with the same batch.

GL: log:

2018-09-10T11:49:04.401+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Connection reset, retrying (attempt #28).

slice of TCPdump

12:05:45.204959 IP SUBNET.41.57818 > SUBNET.36.9200: Flags [P.], seq 18272:18503, ack 489, win 269, options [nop,nop,TS val 1985608753 ecr 1984697849], length 231
.......PV..d....E...p1@.@...dN.)dN.$..#.g.9..x@'...........
vY.1vL..POST /_bulk HTTP/1.1
Content-Length: 230363373
Content-Type: application/json; charset=UTF-8
Host: SUBNET.36:9200
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.4 (Java/1.8.0_181)
Accept-Encoding: gzip,deflate

12:05:45.205301 IP SUBNET.36.9200 > SUBNET.41.57818: Flags [P.], seq 489:549, ack 31535, win 752, options [nop,nop,TS val 1984716154 ecr 1985608753], length 60
.......PV.......E..p..@.@...dN.$dN.)#....x@'g.m.....md.....
vL]zvY.1HTTP/1.1 413 Request Entity Too Large
content-length: 0

12:05:45.205709 IP SUBNET.36.9200 > SUBNET.41.57818: Flags [R.], seq 848, ack 90903, win 1272, options [nop,nop,TS val 0 ecr 1985608754], length 0
.......PV.......E..4..@.@...dN.$dN.)#....xA.g.U{....bK.....
....vY.2{"index":{"_id":

12:05:45.967516 IP SUBNET.41.34322 > SUBNET.38.9200: Flags [P.], seq 95405:95636, ack 1034, win 245, options [nop,nop,TS val 1985609516 ecr 1983368835], length 231
.......PV..d....E.....@.@..TdN.)dN.&..#.....OW`............
vY.,v7..POST /_bulk HTTP/1.1
Content-Length: 230363373
Content-Type: application/json; charset=UTF-8
Host: SUBNET.38:9200
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.4 (Java/1.8.0_181)
Accept-Encoding: gzip,deflate

12:05:45.967917 IP SUBNET.38.9200 > SUBNET.41.34322: Flags [P.], seq 1034:1094, ack 108668, win 1432, options [nop,nop,TS val 1983383888 ecr 1985609516], length 60
.......PV.BS....E..p..@.@...dN.&dN.)#...OW`..........&.....
v8      PvY.,HTTP/1.1 413 Request Entity Too Large
content-length: 0

12:05:45.968792 IP SUBNET.38.9200 > SUBNET.41.34322: Flags [R], seq 1331126799, win 0, length 0
.......PV.BS....E..(..@.@.V3dN.&dN.)#...OWb.....P...{..........,v7............

12:05:46.708613 IP SUBNET.41.47776 > SUBNET.78.9200: Flags [P.], seq 33578:33809, ack 505, win 458, options [nop,nop,TS val 1985610257 ecr 1984247522], length 231
.......PV..d....E...AU@.@..udN.)dN.N..#..V.`ubS...... .....
vZ..vE6.POST /_bulk HTTP/1.1
Content-Length: 230363373
Content-Type: application/json; charset=UTF-8
Host: SUBNET.78:9200
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.4 (Java/1.8.0_181)
Accept-Encoding: gzip,deflate

12:05:46.709076 IP SUBNET.78.9200 > SUBNET.41.47776: Flags [P.], seq 505:565, ack 46841, win 6634, options [nop,nop,TS val 1984261327 ecr 1985610257], length 60
.......PV.3{....E..p..@.@...dN.NdN.)#...ubS..W./....eH.....
vEl.vZ..HTTP/1.1 413 Request Entity Too Large
content-length: 0

// it contains the GL uncommited log entries, sample every 5min
kep

Possible Solution

Recreate batch if ES denies with error 413 Request Entity Too Large
Chech the bulk_max_body_size before communication and don't create bigger batch then the bulk_max_body_size
Or make a new variable in server.conf to set the max size of batch size (in KB or MB)

Steps to Reproduce (for bugs)

  1. Set GL output_batch_size to higher (eg. 10000, def 1000)
  2. send lot of log under one sec
    log_numberlog_size > ES bulk_max_body_size && log_number < GL output_batch_size
    eg. 1000 piece
    1M log

Context

Your Environment

  • Graylog Version: v2.4.6+ceaa7e4
  • Elasticsearch Version: 5.6.10
  • MongoDB Version: NA
  • Operating System: CentOS 7, 3.10.0-862.11.6.el7.x86_64
  • Browser version: NA
@mpfz0r mpfz0r assigned mpfz0r and unassigned kroepke Dec 12, 2019
mpfz0r added a commit that referenced this issue Jan 2, 2020
If we try to bulk index a batch of messages that exceeds the
elastic search `bulk_max_body_size` setting. (default 100MB)
Elastic will respond with an HTTP 413 Entity Too Large error.

In this case we retry the request by splitting the message batch
in half.

When responding with an HTTP 413 error, the server is allowed to close the connection
immediatly. This means that our HTTP client (Jest) will simply report
with an IOException (Broken pipe) instead of the actual error.
This can be avoided by sending the request with an Expect-Continue
header, which also avoids sending data that will be discarded later on.

Fixes #5091
mpfz0r added a commit that referenced this issue Jan 2, 2020
If we try to bulk index a batch of messages that exceeds the
elastic search `bulk_max_body_size` setting. (default 100MB)
Elastic will respond with an HTTP 413 Entity Too Large error.

In this case we retry the request by splitting the message batch
in half.

When responding with an HTTP 413 error, the server is allowed to close the connection
immediately. This means that our HTTP client (Jest) will simply report
an IOException (Broken pipe) instead of the actual error.
This can be avoided by sending the request with an Expect-Continue
header, which also avoids sending data that will be discarded later on.

Fixes #5091
mpfz0r added a commit that referenced this issue Jan 3, 2020
If we try to bulk index a batch of messages that exceeds the
elastic search `bulk_max_body_size` setting. (default 100MB)
Elastic will respond with an HTTP 413 Entity Too Large error.

In this case we retry the request by splitting the message batch
in half.

When responding with an HTTP 413 error, the server is allowed to close the connection
immediately. This means that our HTTP client (Jest) will simply report
an IOException (Broken pipe) instead of the actual error.
This can be avoided by sending the request with an Expect-Continue
header, which also avoids sending data that will be discarded later on.

Fixes #5091
mpfz0r added a commit that referenced this issue Jan 10, 2020
If we try to bulk index a batch of messages that exceeds the
elastic search `http.max_content_length` setting. (default 100MB)
Elastic will respond with an HTTP 413 Entity Too Large error.

In this case we retry the request by splitting the message batch
in half.

When responding with an HTTP 413 error, the server is allowed to close the connection
immediately. This means that our HTTP client (Jest) will simply report
an IOException (Broken pipe) instead of the actual error.
This can be avoided by sending the request with an Expect-Continue
header, which also avoids sending data that will be discarded later on.

Fixes #5091
@bernd bernd closed this in #7071 Jan 10, 2020
bernd added a commit that referenced this issue Jan 10, 2020
If we try to bulk index a batch of messages that exceeds the
elastic search `http.max_content_length` setting. (default 100MB)
Elastic will respond with an HTTP 413 Entity Too Large error.

In this case we retry the request by splitting the message batch
in half.

When responding with an HTTP 413 error, the server is allowed to close the connection
immediately. This means that our HTTP client (Jest) will simply report
an IOException (Broken pipe) instead of the actual error.
This can be avoided by sending the request with an Expect-Continue
header, which also avoids sending data that will be discarded later on.

Fixes #5091

* Move JestClient execution with RequestConfig into JestUtils
* Please forbiddenapi checker
* Correctly handle batches with unevenly sized messages
  If we have a batch where only the messages at the end will
  exceed the Entity Too Large limit, we could end up duplicating
  messages.
  Thus keep track of the already indexed offset and report it within the
  EntityTooLargeException.
* Make use of Expect: 100-continue header configurable
@bernd

This comment has been minimized.

Copy link
Member

@bernd bernd commented Jan 10, 2020

Thank you for the report. This has been fixed for the upcoming 3.2 version and will also be backported into 3.1.

bernd added a commit that referenced this issue Jan 10, 2020
If we try to bulk index a batch of messages that exceeds the
elastic search `http.max_content_length` setting. (default 100MB)
Elastic will respond with an HTTP 413 Entity Too Large error.

In this case we retry the request by splitting the message batch
in half.

When responding with an HTTP 413 error, the server is allowed to close the connection
immediately. This means that our HTTP client (Jest) will simply report
an IOException (Broken pipe) instead of the actual error.
This can be avoided by sending the request with an Expect-Continue
header, which also avoids sending data that will be discarded later on.

Fixes #5091

* Move JestClient execution with RequestConfig into JestUtils
* Please forbiddenapi checker
* Correctly handle batches with unevenly sized messages
  If we have a batch where only the messages at the end will
  exceed the Entity Too Large limit, we could end up duplicating
  messages.
  Thus keep track of the already indexed offset and report it within the
  EntityTooLargeException.
* Make use of Expect: 100-continue header configurable

(cherry picked from commit 085930a)
mpfz0r added a commit that referenced this issue Jan 13, 2020
…#7148)

* Handle Request Entity Too Large errors in ElasticSearchOutput (#7071)

If we try to bulk index a batch of messages that exceeds the
elastic search `http.max_content_length` setting. (default 100MB)
Elastic will respond with an HTTP 413 Entity Too Large error.

In this case we retry the request by splitting the message batch
in half.

When responding with an HTTP 413 error, the server is allowed to close the connection
immediately. This means that our HTTP client (Jest) will simply report
an IOException (Broken pipe) instead of the actual error.
This can be avoided by sending the request with an Expect-Continue
header, which also avoids sending data that will be discarded later on.

Fixes #5091

* Move JestClient execution with RequestConfig into JestUtils
* Please forbiddenapi checker
* Correctly handle batches with unevenly sized messages
  If we have a batch where only the messages at the end will
  exceed the Entity Too Large limit, we could end up duplicating
  messages.
  Thus keep track of the already indexed offset and report it within the
  EntityTooLargeException.
* Make use of Expect: 100-continue header configurable

(cherry picked from commit 085930a)

* Adopt MessagesIT to old IT framwork

Also change test to run with a specific index

* Skip memory intensive MessagesIT tests on Travis

* use getenv not getProperty

Co-authored-by: Marco Pfatschbacher <marco@graylog.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.