Provide a way to debug indexer failures #4578

dploeger · 2018-02-14T08:06:49Z

Expected Behavior

Graylog is reporting indexer failures with an index and a letter id, but doesn't store the failed message anywhere. This makes it hard to debug and fix those failures.

Current Behavior

Currently there is no way of debugging indexer failures. It's more of a guessing game.

Possible Solution

One way would be to use the mongodb to also store the failed, raw message alongside the indexer failure. This would obviously increase the mongodb load. Another way would be a ring buffer on the disk.

Context

Everytime you switch the logging of a new application on to Graylog, you're somewhat guessing the things, that may get logged, but these guesses may be wrong and lead to indexer failures. If you have a lot of new applications in one go, you have a hard time finding the actual failed message.

Your Environment

Graylog Version: 2.2.3
Elasticsearch Version: 2.3.5
MongoDB Version: 3.4.0
Operating System: Ubuntu 16.04
Browser version: Chrome 64.0.3282.140 on macOS 10.13.3

joschi · 2018-02-14T08:33:45Z

@dploeger Indexing failures are stored in MongoDB (in the index_failures collection) and can be viewed in the Graylog web interface on the System/Overview page.

Storing the complete failed message might lead to problems with Graylog clusters having not enough resources, so we removed that feature in Graylog 2.0.0 (see #1498).

As an example, think of a 10 MiB message which is being sent multiple times (e. g. 1000) to Graylog and which cannot be indexed. This would make MongoDB explode pretty quickly.

FWIW, we could probably add some more details about the failed messages, such as the input and the remote IP address of the sender.

dploeger · 2018-02-14T08:37:13Z

@joschi Yes, the input and the source would be great as a start. Another possibility would be a UI to look inside the message processor. This way you could perhaps observe failed messages while they are processed. I understand, you're using Kafka there?

joschi · 2018-02-14T08:38:59Z

@dploeger No, Graylog is using the journal implementation of Apache Kafka internally (for persisting messages between the input and the processing buffers) but not the complete package.

dploeger · 2018-02-14T08:52:00Z

@joschi Hm, okay. Then the UI will be a bit more complicated. 😄

So:

good: source and input added to the index failures collection
best: view of the message process

Thanks!

twall · 2018-03-21T15:42:48Z

Even saved messages with a tunable length cap would be preferable to no information at all. With more than a trivial number of inputs and messages, it's nearly impossible to address those indexing errors. You also wouldn't need to keep but a single copy for a given error message key (I have nearly 1000 index failures with the exact same error message).

apjm · 2018-10-12T12:01:53Z

Think I'm coming up against this issue too. Have a large number of these failures:

{"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field="entity" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 105, 115, 116, 40, 79, 102, 102, 101, 114, 101, 100, 65, 112, 112, 111, 105, 110, 116, 109, 101, 110, 116, 40, 50, 48, 49, 56, 45, 49]...', original message: bytes can be at most 32766 in length; got 34590","caused_by":{"type":"max_bytes_length_exceeded_exception","reason":"bytes can be at most 32766 in length; got 34590"}}`

In the last five minutes I have 20 different sources logging messages with an 'entity' field. Would be great to be able to stop these.

dploeger · 2018-10-12T12:04:11Z

@apjm I guess, you can only strip the field "entity" in your installation. That's what I did with a field "stacktrace", which got too big. Just strip it down to 32766 characters max.

Jan-Zimmermann · 2019-02-14T06:14:07Z

Is it not possible to store these Messages? Only the last 5 and shortend to 32k? Or write fields to the Log defined in a Configuration? My Logs come from a web Farm, each "source" have multple web apps on board. So a source and input field in log would not helpful in my case. (differnt field types)

megamorf · 2019-07-12T10:13:16Z

I'm also affected by this and having no means whatsoever to even get an idea what the offending event is makes this way harder than it needs to be. Either being able to view real-time processing of inputs oder at least logging the first n characters of an event that caused an indexing failure would help tremendously.

juemue · 2019-10-28T15:38:48Z

Is there anything new about this issue?

Nklya · 2019-11-20T14:12:59Z

I have a related question.
Is it possible to get Indexer failures metric?

I tried org_graylog2_rest_resources_system_indexer_FailuresResource_count from Prometheus plugin, but looks like it's not the same as Indexer failures status in System->Overview

I use this plugin https://github.com/graylog-labs/graylog-plugin-metrics-reporter

P.S. For now, I could get this metric only from mongo.
This request gets amount of failed indexes for last 5 minutes:

mongo graylog --quiet --eval 'db.index_failures.count() - db.index_failures.find({timestamp: {$lt: new Date((new Date())-1000*60*5)}}).count()'

hulkk · 2020-04-29T08:44:40Z

Any updates regarding this?

In our scenario we have hundreds of log sources and dozens of those send logs in json-format where fields are extracted automatically. Currently it is near impossible to troubleshoot indexing failures.

Storing the complete failed message might lead to problems with Graylog clusters having not enough resources...

This is not acceptable explanation.

Amount of failed messages stored could be limited/configurable.
As a bare minimum the metadata of the failed message should be stored (source, input).

alitoufighi · 2020-06-17T11:26:37Z

I believe this is a really important feature that lots of users need to find what causes their logs to be incomplete.

In my case I'm seeing lots of max_bytes_length_exceeded_exception from ES side and there is absolutely no way to figure out what was that message and where it came from. I'm just losing my logs, that are probably important ones that had lots of data!

As @joschi said two years ago:

FWIW, we could probably add some more details about the failed messages, such as the input and the remote IP address of the sender.

I also agree that these can be a good and probably easy start to let users debug their usage.

hulkk · 2020-07-02T14:01:45Z

Currently it is impossible to troubleshoot log payload related errors.

I'm strugging with lots of following events

2020-07-02T16:56:21.210+03:00 WARN [GelfCodec] GELF message <redacted> (received from <loadbalancer-ip:26444>) is missing mandatory "host" field.

2020-07-02T16:56:20.549+03:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=redacted, journalOffset=41919444374, codec=gelf, payloadSize=637, timestamp=2020-07-02T13:56:20.549Z, remoteAddress=/loadbalancer-ip:28843}
java.lang.IllegalArgumentException: GELF message <redacted> (received from <loadbalancer-ip:28843>) has empty mandatory "short_message" field.
	at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:258) ~[graylog.jar:?]
	at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:140) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:150) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:91) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:90) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:47) [graylog.jar:?]
	at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

I don't understand why the message guid is logged because you can't use the guid anywhere. One should be able to configure content what to log in case these errors occur.

jasonrberk · 2022-06-22T18:16:20Z

@apjm I guess, you can only strip the field "entity" in your installation. That's what I did with a field "stacktrace", which got too big. Just strip it down to 32766 characters max.

do you have an example I could follow. still having this issue in mid 2022

jasonrberk · 2022-06-23T15:30:06Z

I figured this out for anyone still fighting this issue in 2022 on GL4

alitoufighi · 2022-06-23T15:45:14Z

Hey @jasonrberk that's great. Your workaround for truncating the log reminded me of an alternative way:
You can use pipelines, especially the substring() function to capture only the first 32766 characters. However, I had unsatisfactory results using this when my logs contained UTF-8 characters. This way, you can also debug() their complete message for more inspection, if such a large log can indicate that something is wrong.

jasonrberk · 2022-08-02T21:15:13Z

@alitoufighi

here's a pipeline rule for "trimming" enormous stack traces:

rule "truncate-stacks"
when    
    has_field("StackTrace")
    AND regex(
        pattern: "^.{32000,}$", 
        value: replace(
            value: to_string($message.StackTrace), 
            search: "\n",
            replacement: "|"
        )
    ).matches == true
then
    let trimmed = abbreviate(value: to_string($message.StackTrace), width: 32000);
    set_field("StackTrace", trimmed);
    set_field("StackTraceTruncated", true);
    // added debug message to be notified about the trimmed stacktrace
    debug( concat("truncated oversized StackTrace from ", to_string($message.source)));
end

boosty · 2022-08-04T11:55:29Z

We added a related feature in Graylog 4.2 (available in Graylog Operations and Graylog Cloud):

https://docs.graylog.org/docs/indexer-and-processing-failures

It allows for storing indexing and processing failure notifications in a dedicated Elasticsearch/OpenSearch failure index. With a dedicated stream for such failure notifications, these messages can be logged and aggregated meaningfully into a dashboard, used to set up alert notifications, and more. It is also possible to keep the full original message that failed.

dennisoelkers added triaged improvement feature and removed improvement triaged labels Mar 12, 2018

florianpopp added the triaged label Mar 22, 2018

shakalandy mentioned this issue Aug 27, 2019

Logging: Separate Graylog Server from Index Failure Logs #6358

Closed

HenryTheSir mentioned this issue Feb 17, 2021

Implement Dead Letter Queue for all kind of errors which today result in message loss #10083

Closed

boosty closed this as completed Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a way to debug indexer failures #4578

Provide a way to debug indexer failures #4578

dploeger commented Feb 14, 2018

joschi commented Feb 14, 2018 •

edited

dploeger commented Feb 14, 2018 •

edited

joschi commented Feb 14, 2018

dploeger commented Feb 14, 2018

twall commented Mar 21, 2018

apjm commented Oct 12, 2018

dploeger commented Oct 12, 2018

Jan-Zimmermann commented Feb 14, 2019

megamorf commented Jul 12, 2019

juemue commented Oct 28, 2019

Nklya commented Nov 20, 2019 •

edited

hulkk commented Apr 29, 2020

alitoufighi commented Jun 17, 2020

hulkk commented Jul 2, 2020

jasonrberk commented Jun 22, 2022

jasonrberk commented Jun 23, 2022

alitoufighi commented Jun 23, 2022

jasonrberk commented Aug 2, 2022

boosty commented Aug 4, 2022 •

edited

Provide a way to debug indexer failures #4578

Provide a way to debug indexer failures #4578

Comments

dploeger commented Feb 14, 2018

Expected Behavior

Current Behavior

Possible Solution

Context

Your Environment

joschi commented Feb 14, 2018 • edited

dploeger commented Feb 14, 2018 • edited

joschi commented Feb 14, 2018

dploeger commented Feb 14, 2018

twall commented Mar 21, 2018

apjm commented Oct 12, 2018

dploeger commented Oct 12, 2018

Jan-Zimmermann commented Feb 14, 2019

megamorf commented Jul 12, 2019

juemue commented Oct 28, 2019

Nklya commented Nov 20, 2019 • edited

hulkk commented Apr 29, 2020

alitoufighi commented Jun 17, 2020

hulkk commented Jul 2, 2020

jasonrberk commented Jun 22, 2022

jasonrberk commented Jun 23, 2022

alitoufighi commented Jun 23, 2022

jasonrberk commented Aug 2, 2022

boosty commented Aug 4, 2022 • edited

joschi commented Feb 14, 2018 •

edited

dploeger commented Feb 14, 2018 •

edited

Nklya commented Nov 20, 2019 •

edited

boosty commented Aug 4, 2022 •

edited