Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to debug indexer failures #4578

Closed
dploeger opened this issue Feb 14, 2018 · 19 comments
Closed

Provide a way to debug indexer failures #4578

dploeger opened this issue Feb 14, 2018 · 19 comments

Comments

@dploeger
Copy link

Expected Behavior

Graylog is reporting indexer failures with an index and a letter id, but doesn't store the failed message anywhere. This makes it hard to debug and fix those failures.

Current Behavior

Currently there is no way of debugging indexer failures. It's more of a guessing game.

Possible Solution

One way would be to use the mongodb to also store the failed, raw message alongside the indexer failure. This would obviously increase the mongodb load. Another way would be a ring buffer on the disk.

Context

Everytime you switch the logging of a new application on to Graylog, you're somewhat guessing the things, that may get logged, but these guesses may be wrong and lead to indexer failures. If you have a lot of new applications in one go, you have a hard time finding the actual failed message.

Your Environment

  • Graylog Version: 2.2.3
  • Elasticsearch Version: 2.3.5
  • MongoDB Version: 3.4.0
  • Operating System: Ubuntu 16.04
  • Browser version: Chrome 64.0.3282.140 on macOS 10.13.3
@joschi
Copy link
Contributor

joschi commented Feb 14, 2018

@dploeger Indexing failures are stored in MongoDB (in the index_failures collection) and can be viewed in the Graylog web interface on the System/Overview page.

Storing the complete failed message might lead to problems with Graylog clusters having not enough resources, so we removed that feature in Graylog 2.0.0 (see #1498).

As an example, think of a 10 MiB message which is being sent multiple times (e. g. 1000) to Graylog and which cannot be indexed. This would make MongoDB explode pretty quickly.

FWIW, we could probably add some more details about the failed messages, such as the input and the remote IP address of the sender.

@dploeger
Copy link
Author

dploeger commented Feb 14, 2018

@joschi Yes, the input and the source would be great as a start. Another possibility would be a UI to look inside the message processor. This way you could perhaps observe failed messages while they are processed. I understand, you're using Kafka there?

@joschi
Copy link
Contributor

joschi commented Feb 14, 2018

@dploeger No, Graylog is using the journal implementation of Apache Kafka internally (for persisting messages between the input and the processing buffers) but not the complete package.

@dploeger
Copy link
Author

@joschi Hm, okay. Then the UI will be a bit more complicated. 😄

So:

  • good: source and input added to the index failures collection
  • best: view of the message process

Thanks!

@twall
Copy link

twall commented Mar 21, 2018

Even saved messages with a tunable length cap would be preferable to no information at all. With more than a trivial number of inputs and messages, it's nearly impossible to address those indexing errors. You also wouldn't need to keep but a single copy for a given error message key (I have nearly 1000 index failures with the exact same error message).

@apjm
Copy link

apjm commented Oct 12, 2018

Think I'm coming up against this issue too. Have a large number of these failures:

{"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field="entity" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 105, 115, 116, 40, 79, 102, 102, 101, 114, 101, 100, 65, 112, 112, 111, 105, 110, 116, 109, 101, 110, 116, 40, 50, 48, 49, 56, 45, 49]...', original message: bytes can be at most 32766 in length; got 34590","caused_by":{"type":"max_bytes_length_exceeded_exception","reason":"bytes can be at most 32766 in length; got 34590"}}`

In the last five minutes I have 20 different sources logging messages with an 'entity' field. Would be great to be able to stop these.

@dploeger
Copy link
Author

@apjm I guess, you can only strip the field "entity" in your installation. That's what I did with a field "stacktrace", which got too big. Just strip it down to 32766 characters max.

@Jan-Zimmermann
Copy link

Is it not possible to store these Messages? Only the last 5 and shortend to 32k? Or write fields to the Log defined in a Configuration? My Logs come from a web Farm, each "source" have multple web apps on board. So a source and input field in log would not helpful in my case. (differnt field types)

@megamorf
Copy link

I'm also affected by this and having no means whatsoever to even get an idea what the offending event is makes this way harder than it needs to be. Either being able to view real-time processing of inputs oder at least logging the first n characters of an event that caused an indexing failure would help tremendously.

@juemue
Copy link

juemue commented Oct 28, 2019

Is there anything new about this issue?

@Nklya
Copy link

Nklya commented Nov 20, 2019

I have a related question.
Is it possible to get Indexer failures metric?

I tried org_graylog2_rest_resources_system_indexer_FailuresResource_count from Prometheus plugin, but looks like it's not the same as Indexer failures status in System->Overview

I use this plugin https://github.com/graylog-labs/graylog-plugin-metrics-reporter

P.S. For now, I could get this metric only from mongo.
This request gets amount of failed indexes for last 5 minutes:

mongo graylog --quiet --eval 'db.index_failures.count() - db.index_failures.find({timestamp: {$lt: new Date((new Date())-1000*60*5)}}).count()'

@hulkk
Copy link

hulkk commented Apr 29, 2020

Any updates regarding this?

In our scenario we have hundreds of log sources and dozens of those send logs in json-format where fields are extracted automatically. Currently it is near impossible to troubleshoot indexing failures.

Storing the complete failed message might lead to problems with Graylog clusters having not enough resources...

This is not acceptable explanation.

  • Amount of failed messages stored could be limited/configurable.
  • As a bare minimum the metadata of the failed message should be stored (source, input).

@alitoufighi
Copy link

I believe this is a really important feature that lots of users need to find what causes their logs to be incomplete.

In my case I'm seeing lots of max_bytes_length_exceeded_exception from ES side and there is absolutely no way to figure out what was that message and where it came from. I'm just losing my logs, that are probably important ones that had lots of data!

As @joschi said two years ago:

FWIW, we could probably add some more details about the failed messages, such as the input and the remote IP address of the sender.

I also agree that these can be a good and probably easy start to let users debug their usage.

@hulkk
Copy link

hulkk commented Jul 2, 2020

Currently it is impossible to troubleshoot log payload related errors.

I'm strugging with lots of following events

2020-07-02T16:56:21.210+03:00 WARN [GelfCodec] GELF message <redacted> (received from <loadbalancer-ip:26444>) is missing mandatory "host" field.

2020-07-02T16:56:20.549+03:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=redacted, journalOffset=41919444374, codec=gelf, payloadSize=637, timestamp=2020-07-02T13:56:20.549Z, remoteAddress=/loadbalancer-ip:28843}
java.lang.IllegalArgumentException: GELF message <redacted> (received from <loadbalancer-ip:28843>) has empty mandatory "short_message" field.
	at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:258) ~[graylog.jar:?]
	at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:140) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:150) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:91) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:90) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:47) [graylog.jar:?]
	at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

I don't understand why the message guid is logged because you can't use the guid anywhere. One should be able to configure content what to log in case these errors occur.

@jasonrberk
Copy link

@apjm I guess, you can only strip the field "entity" in your installation. That's what I did with a field "stacktrace", which got too big. Just strip it down to 32766 characters max.

do you have an example I could follow. still having this issue in mid 2022

@jasonrberk
Copy link

I figured this out for anyone still fighting this issue in 2022 on GL4

Screen Shot 2022-06-23 at 11 29 29 AM

@alitoufighi
Copy link

Hey @jasonrberk that's great. Your workaround for truncating the log reminded me of an alternative way:
You can use pipelines, especially the substring() function to capture only the first 32766 characters. However, I had unsatisfactory results using this when my logs contained UTF-8 characters. This way, you can also debug() their complete message for more inspection, if such a large log can indicate that something is wrong.

@jasonrberk
Copy link

@alitoufighi

here's a pipeline rule for "trimming" enormous stack traces:

rule "truncate-stacks"
when    
    has_field("StackTrace")
    AND regex(
        pattern: "^.{32000,}$", 
        value: replace(
            value: to_string($message.StackTrace), 
            search: "\n",
            replacement: "|"
        )
    ).matches == true
then
    let trimmed = abbreviate(value: to_string($message.StackTrace), width: 32000);
    set_field("StackTrace", trimmed);
    set_field("StackTraceTruncated", true);
    // added debug message to be notified about the trimmed stacktrace
    debug( concat("truncated oversized StackTrace from ", to_string($message.source)));
end

@boosty
Copy link
Contributor

boosty commented Aug 4, 2022

We added a related feature in Graylog 4.2 (available in Graylog Operations and Graylog Cloud):

https://docs.graylog.org/docs/indexer-and-processing-failures

It allows for storing indexing and processing failure notifications in a dedicated Elasticsearch/OpenSearch failure index. With a dedicated stream for such failure notifications, these messages can be logged and aggregated meaningfully into a dashboard, used to set up alert notifications, and more. It is also possible to keep the full original message that failed.

@boosty boosty closed this as completed Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests