New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every odd index rolled too early #2194

Closed
hc4 opened this Issue May 6, 2016 · 11 comments

Comments

Projects
None yet
2 participants
@hc4
Contributor

hc4 commented May 6, 2016

Problem description

Every second index stays unfilled to configured condition.
image
image

Environment

  • Graylog Version: 2.0 GA
  • Elasticsearch Version: 2.3.1
@kroepke

This comment has been minimized.

Member

kroepke commented May 9, 2016

Could you append your Graylog server log as well as the corresponding elasticsearch log at the time when an index is wrongly cycled, please?

@kroepke kroepke self-assigned this May 9, 2016

@hc4

This comment has been minimized.

Contributor

hc4 commented May 10, 2016

81 indx is wrongly cycled:
image

There is roll from 80 to 81:

2016-05-10T08:23:12.458+03:00 INFO  [AbstractRotationStrategy] Deflector index <graylog2_80> should be rotated, Pointing deflector to new index now!
2016-05-10T08:23:12.458+03:00 INFO  [Deflector] Cycling deflector to next index now.
2016-05-10T08:23:12.499+03:00 INFO  [Deflector] Cycling from <graylog2_80> to <graylog2_81>
2016-05-10T08:23:12.499+03:00 INFO  [Deflector] Creating index target <graylog2_81>...
2016-05-10T08:23:12.531+03:00 INFO  [Indices] Created Graylog index template "graylog-internal" in Elasticsearch.
2016-05-10T08:23:12.758+03:00 INFO  [Deflector] Waiting for index allocation of <graylog2_81>
2016-05-10T08:23:12.899+03:00 INFO  [Deflector] Done!
2016-05-10T08:23:12.899+03:00 INFO  [Deflector] Pointing deflector to new target index....
2016-05-10T08:23:12.946+03:00 INFO  [SystemJobManager] Submitted SystemJob <4a8f1215-166f-11e6-848a-005056b37533> [org.graylog2.indexer.ranges.CreateNewSingleIndexRangeJob]
2016-05-10T08:23:12.946+03:00 INFO  [SystemJobManager] Submitted SystemJob <4a8f3922-166f-11e6-848a-005056b37533> [org.graylog2.indexer.SetIndexReadOnlyJob]
2016-05-10T08:23:12.946+03:00 INFO  [SystemJobManager] Submitted SystemJob <4a8f3923-166f-11e6-848a-005056b37533> [org.graylog2.indexer.ranges.CreateNewSingleIndexRangeJob]
2016-05-10T08:23:12.946+03:00 INFO  [Deflector] Done!
2016-05-10T08:23:12.946+03:00 INFO  [CreateNewSingleIndexRangeJob] Calculating ranges for index graylog2_80.
2016-05-10T08:23:12.946+03:00 INFO  [CreateNewSingleIndexRangeJob] Calculating ranges for index graylog2_81.
2016-05-10T08:23:12.952+03:00 INFO  [MongoIndexRangeService] Calculated range of [graylog2_81] in [2ms].
2016-05-10T08:23:12.952+03:00 INFO  [CreateNewSingleIndexRangeJob] Created ranges for index graylog2_81.

And roll from 81 to 82 (again checks condition for 80).

2016-05-10T09:35:02.797+03:00 INFO  [AbstractRotationStrategy] Deflector index <graylog2_80> should be rotated, Pointing deflector to new index now!
2016-05-10T09:35:02.797+03:00 INFO  [Deflector] Cycling deflector to next index now.
2016-05-10T09:35:02.968+03:00 INFO  [Deflector] Cycling from <graylog2_81> to <graylog2_82>
2016-05-10T09:35:02.968+03:00 INFO  [Deflector] Creating index target <graylog2_82>...
2016-05-10T09:35:03.142+03:00 INFO  [Indices] Created Graylog index template "graylog-internal" in Elasticsearch.
2016-05-10T09:35:03.306+03:00 INFO  [Deflector] Waiting for index allocation of <graylog2_82>
2016-05-10T09:35:03.389+03:00 INFO  [Deflector] Done!
2016-05-10T09:35:03.389+03:00 INFO  [Deflector] Pointing deflector to new target index....
2016-05-10T09:35:03.537+03:00 INFO  [SystemJobManager] Submitted SystemJob <53df3610-1679-11e6-848a-005056b37533> [org.graylog2.indexer.ranges.CreateNewSingleIndexRangeJob]
2016-05-10T09:35:03.537+03:00 INFO  [CreateNewSingleIndexRangeJob] Calculating ranges for index graylog2_81.
2016-05-10T09:35:03.537+03:00 INFO  [SystemJobManager] Submitted SystemJob <53df3611-1679-11e6-848a-005056b37533> [org.graylog2.indexer.SetIndexReadOnlyJob]
2016-05-10T09:35:03.537+03:00 INFO  [SystemJobManager] Submitted SystemJob <53df3612-1679-11e6-848a-005056b37533> [org.graylog2.indexer.ranges.CreateNewSingleIndexRangeJob]
2016-05-10T09:35:03.537+03:00 INFO  [Deflector] Done!
2016-05-10T09:35:03.538+03:00 INFO  [CreateNewSingleIndexRangeJob] Calculating ranges for index graylog2_82.
2016-05-10T09:35:03.547+03:00 INFO  [MongoIndexRangeService] Calculated range of [graylog2_82] in [3ms].
2016-05-10T09:35:03.548+03:00 INFO  [CreateNewSingleIndexRangeJob] Created ranges for index graylog2_82.
2016-05-10T09:35:03.548+03:00 INFO  [SystemJobManager] SystemJob <53df3612-1679-11e6-848a-005056b37533> [org.graylog2.indexer.ranges.CreateNewSingleIndexRangeJob] finished in 10ms.
2016-05-10T09:35:09.707+03:00 INFO  [MongoIndexRangeService] Calculated range of [graylog2_81] in [6164ms].
2016-05-10T09:35:09.708+03:00 INFO  [CreateNewSingleIndexRangeJob] Created ranges for index graylog2_81.
2016-05-10T09:35:09.708+03:00 INFO  [SystemJobManager] SystemJob <53df3610-1679-11e6-848a-005056b37533> [org.graylog2.indexer.ranges.CreateNewSingleIndexRangeJob] finished in 6170ms.
2016-05-10T09:35:33.538+03:00 INFO  [SetIndexReadOnlyJob] Flushing old index <graylog2_81>.
2016-05-10T09:35:35.302+03:00 INFO  [SetIndexReadOnlyJob] Setting old index <graylog2_81> to read-only.
2016-05-10T09:35:35.444+03:00 INFO  [SystemJobManager] Submitted SystemJob <66e3d543-1679-11e6-848a-005056b37533> [org.graylog2.indexer.indices.jobs.OptimizeIndexJob]
2016-05-10T09:35:35.444+03:00 INFO  [SystemJobManager] SystemJob <53df3611-1679-11e6-848a-005056b37533> [org.graylog2.indexer.SetIndexReadOnlyJob] finished in 1906ms.
2016-05-10T09:35:35.444+03:00 INFO  [OptimizeIndexJob] Optimizing index <graylog2_81>.
@kroepke

This comment has been minimized.

Member

kroepke commented May 10, 2016

Thanks!

It looks like the code wrongly sees an older index as the current write index and thus wrongly thinks it should rotate. From re-reading the code there's nothing that's obviously wrong, though.
The only guess we currently have is that somehow an incomplete list of indices comes back. Does your elasticsearch cluster show very high load during those times?
Are there any messages in the elasticsearch logs?

Thanks!

@hc4

This comment has been minimized.

Contributor

hc4 commented May 10, 2016

Sometimes I have such errors:

2016-05-10T09:46:42.348+03:00 ERROR [MessageCountRotationStrategy] Unknown index, cannot perform rotation
org.graylog2.indexer.IndexNotFoundException
    at org.graylog2.indexer.indices.Indices.numberOfMessages(Indices.java:170) ~[graylog.jar:?]
    at org.graylog2.indexer.rotation.strategies.MessageCountRotationStrategy.shouldRotate(MessageCountRotationStrategy.java:67) [graylog.jar:?]
    at org.graylog2.indexer.rotation.strategies.MessageCountRotationStrategy.shouldRotate(MessageCountRotationStrategy.java:33) [graylog.jar:?]
    at org.graylog2.indexer.rotation.strategies.AbstractRotationStrategy.rotate(AbstractRotationStrategy.java:55) [graylog.jar:?]
    at org.graylog2.periodical.IndexRotationThread.checkForRotation(IndexRotationThread.java:109) [graylog.jar:?]
    at org.graylog2.periodical.IndexRotationThread.doRun(IndexRotationThread.java:73) [graylog.jar:?]
    at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:77) [graylog.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_74]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_74]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_74]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_74]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_74]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_74]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_74]
2016-05-10T09:46:42.348+03:00 ERROR [AbstractRotationStrategy] Cannot perform rotation at this moment.
@hc4

This comment has been minimized.

Contributor

hc4 commented May 10, 2016

And yes, my ES node sometimes could be overloaded.

@kroepke

This comment has been minimized.

Member

kroepke commented May 10, 2016

Ok, then it seems like we are missing a check if some shards failed to report back.
We are looking into making this more robust for the next bugfix release.

Thanks

@kroepke kroepke added bug S3 P3 and removed cannot reproduce labels May 10, 2016

@kroepke kroepke added this to the 2.0.1 milestone May 10, 2016

kroepke added a commit that referenced this issue May 10, 2016

Getting index statistics could contain failed shards
During rotation and retention we requested all index statistics multiple times, which, in a overloaded cluster, could lead to shard failures due to timeouts.
This failure wasn't logged and could lead to using the wrong (older) index to base rotation decisions on, effectively rotating indices too early.

This change makes Graylog use a more lightweight API to determine all index names including their aliases, reducing the usage of the expensive Index Statistics to the indices page only.
The current rotation and retention strategies do not need to know all index statistics which require to touch every single shard in the cluster.

fixes #2194
@hc4

This comment has been minimized.

Contributor

hc4 commented May 10, 2016

Also maybe connected to this issue. I found such message in logs:
2016-05-10T14:16:32.011+03:00 WARN [IndexRotationThread] Deflector is pointing to [graylog2_82], not the newest one: [graylog2_81]. Re-pointing.

It seems it incorrectly repointing, beause index 82 actualy exists.

@kroepke

This comment has been minimized.

Member

kroepke commented May 10, 2016

Yup, that appears to be the same symptom. Getting the index stats is expensive because it needs to look at every shard in the cluster, but we don't actually need that level of detail.
The patch above changes the code to a more lightweight call that returns the relevant information only.

@hc4

This comment has been minimized.

Contributor

hc4 commented May 10, 2016

But what if new lightweight call will also fail? Will problem remain?

@kroepke

This comment has been minimized.

Member

kroepke commented May 10, 2016

The problem was that the index stats call could return incomplete information and didn't actually fail completely. The new call will either fail completely or return the entire list of indices, so it should fix this problem.

The underlying problem of course is a cluster that is too slow or otherwise broken (not accessible shards).
The change also logs a warning if not all index stats are returned from elasticsearch, but that is only used to display the System/Indices page now.

@hc4

This comment has been minimized.

Contributor

hc4 commented May 10, 2016

Thanks, I see now.

bernd added a commit that referenced this issue May 11, 2016

Use a more lightweight API to get all index names and aliases (#2210)
During rotation and retention we requested all index statistics multiple times, which,
in a overloaded cluster, could lead to shard failures due to timeouts.
This failure wasn't logged and could lead to using the wrong (older) index to base
rotation decisions on, effectively rotating indices too early.

This change makes Graylog use a more lightweight API to determine all index
names including their aliases, reducing the usage of the expensive Index Statistics
to the indices page only.
The current rotation and retention strategies do not need to know all index statistics
which require to touch every single shard in the cluster.

Fixes #2194

@bernd bernd closed this in d617434 May 11, 2016

@kroepke kroepke added the triaged label Sep 21, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment