New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch Red cluster state triggered by index rotation under some conditions. #2429

Closed
jehuty0shift opened this Issue Jun 28, 2016 · 2 comments

Comments

Projects
None yet
4 participants
@jehuty0shift

jehuty0shift commented Jun 28, 2016

Problem description

When a node of the Elasticsearch cluster is down, A simple index rolling can put the cluster in red

  • One Elasticsearch node is down (on scheduled maintenance for example), so the cluster turns Yellow. Shard reallocation is delayed (with index.unassigned.node_left.delayed_timeout) since node is expected to return.
  • The cluster cycle one index so delete a previous one.
  • The node comes back and import the previous index that is now a dangling index.
  • The cluster turns to red, unable to retrieve previous shards that have been deleted.

Steps to reproduce the problem

To reproduce this problem, so the following

  • delay re-allocation of shards by using this index setting on the earliest index created :

{ "settings" : {"index.unassigned.node_left.delayed_timeout" : "15m" } }

  • stop one ES node with some shards of this index on it.
  • Cycle the index. It will delete the earliest index if max number of index is reached
  • Restart the ES node, it will import its cluster index, turning the cluster to Red.

Environment

  • Graylog Version: 2.0.3
  • Elasticsearch Version: 2.3.3
  • MongoDB Version: 3.2.7
  • Operating System: Ubuntu 14.04
  • Browser version: Chromium Version 50.0.2661.102 Ubuntu 15.10 (64-bit)

Additionnal Informations

  • The setting index.unassigned.node_left.delayed_timeout is very useful when you want to do maintenance on some nodes or in case of temporary failure (It will happen when you have a big ES cluster.), This prevent unneccesary reallocation of replicas, when you know that your node will come back. And If i recall correctly, in ES 2.x the default value is now 1 minute.
  • The real problem is that Elastic removed one setting from ES 2.0 :
    gateway.local.auto_import_dangled
    Now all the indices are forcefullty imported (see this issue elastic/elasticsearch#10016).
  • The issue could be mitigated by implementing this feature: #2371 and by alerting the administrators of the real state of the cluster.

@kroepke kroepke self-assigned this Jul 1, 2016

@jalogisch jalogisch added S2 P3 bug labels Jul 4, 2016

@kroepke kroepke added this to the 2.1.0 milestone Jul 4, 2016

@kroepke

This comment has been minimized.

Member

kroepke commented Jul 5, 2016

This sounds very reasonable to me and the first glance looks like it should be relatively easy to implement.
Scheduling for 2.1.

@bernd bernd self-assigned this Jul 5, 2016

bernd added a commit that referenced this issue Jul 13, 2016

Allow indexing when cluster state is red but write index state green/…
…yellow

The cluster state can turn red even though the current write index is
green. This might happen when running rotation/retention during an ES
maintenance window where not all nodes are present.

For index related tasks we are now checking the state of the current
write index (deflector) instead of the state for all Graylog managed
indices.

The logs and UI are still showing the red cluster state to make sure the
admin will be notified.

Refs #2429
Fixes #2371

joschi added a commit that referenced this issue Jul 14, 2016

Allow indexing when cluster health state is RED but write-active inde…
…x is healthy (#2477)

The cluster state can turn red even though the current write index is
green. This might happen when running rotation/retention during an ES
maintenance window where not all nodes are present.

For index related tasks we are now checking the state of the current
write index (deflector) instead of the state for all Graylog managed
indices.

The logs and UI are still showing the red cluster state to make sure the
admin will be notified.

Refs #2429
Fixes #2371
@bernd

This comment has been minimized.

Member

bernd commented Jul 14, 2016

This has been implemented in #2477 and will be part of the upcoming Graylog 2.1. Thank you for the report!

@bernd bernd closed this Jul 14, 2016

@kroepke kroepke added the triaged label Sep 21, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment