New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing failures not reported #2633

Closed
JulioQc opened this Issue Aug 8, 2016 · 6 comments

Comments

Projects
None yet
4 participants
@JulioQc

JulioQc commented Aug 8, 2016

Expected Behavior

Indexing errors are reported and somehow handled.

Current Behavior

Even if current Graylog logs is filled with MapperParsingException errors, the "Indexer failures" dashboard under "Overview" isn't reporting any failures.
Logs: http://imgur.com/N1jIeKc
No error: http://imgur.com/Ikpyko9

Possible Solution

No idea.

Steps to Reproduce (for bugs)

Can really say, just noticed something was wrong on Monday morning.
Root cause was a poorly made input extractor and this got out of control really quick.
Had to remove the extactor, cycle deflector, manually clear disk journal and some internals logs, reboot the box.

Context

Message Processing stopped
Disk Journal filled to max
Input/Output buffer filled to max
Some logs (/var/log/graylog/elasticsearch/graylog.log.*) grew exponentially (8GB+ each although configure for rotation at 200MB...) eating up disk space thus making the problem even worst by filling the disk to max.
No warning or display of indexing failure
Only warning was when disk was close to max

Your Environment

  • Graylog Version: 2.0.3 (OVA image)
  • Elasticsearch Version:
  • MongoDB Version:
  • Operating System:
  • Browser version:
@tommymonk

This comment has been minimized.

tommymonk commented Aug 8, 2016

I believe that we encountered the same issue last week.
Our symptoms were the pipeline processing coming to a halt, a thread from the dump showed the Elasticsearch bulkindex thread blocked.

It turned out that one of the Elasticsearch nodes was misconfigured and had thrown an OOM exception in response to a heavy query. This caused the bulkindex to fail.

The failure of the bulkindex seems to have permanently blocked processing.

@JulioQc

This comment has been minimized.

JulioQc commented Aug 8, 2016

Where did you get this dump? Maybe I can check on my side

@tommymonk

This comment has been minimized.

tommymonk commented Aug 8, 2016

Whilst graylog was 'stuck' processing message, I used the "Actions > Get Thread Dump" feature on the "System > System/Nodes > Details" page.

The snipped I linked to was the interesting thread from the full dump that I retained.

@JulioQc

This comment has been minimized.

JulioQc commented Aug 8, 2016

Ok thanks I got it :)
It's filled with info so I'm not sure what to make of it... will likely need to read on the matter.

@joschi joschi added the bug label Aug 9, 2016

@joschi joschi added this to the 2.1.0 milestone Aug 9, 2016

@joschi joschi self-assigned this Aug 9, 2016

joschi added a commit that referenced this issue Aug 9, 2016

@bernd bernd closed this in #2644 Aug 10, 2016

bernd added a commit that referenced this issue Aug 10, 2016

@bernd

This comment has been minimized.

Member

bernd commented Aug 10, 2016

@JulioQc This will be fixed in the upcoming 2.1 release. Thank you for the report!

@JulioQc

This comment has been minimized.

JulioQc commented Aug 10, 2016

Always a pleasure to help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment