Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster ends up in unstable state after heavy indexing activity but does not recover #12318

Closed
Srinathc opened this issue Jul 17, 2015 · 8 comments
Labels
:Core/Infra/Core Core issues without another label discuss

Comments

@Srinathc
Copy link

Setup:
3 nodes of i2.xlarge on AWS (4 cores, 30G ram with 15G for java heap, 800G SSD instance store)
Elasticsearch version 1.6.0
Indexing is done from 3 storm worker nodes each doing bulk imports into the elasticsearch at frequencies of 2 seconds (averages around 2000 documents with each bulk import call)

State:
8 indices of 3 shards each, with async replication factor of 2
400 Million records in the store to a total of ~80GB of data

Tests
Run 1: indexing documents with a consistent rate of 1k documents per second for over 8 hours
Observations:
- able to query aggregations without much latencies
- cluster status remained green all through
Run 2: indexing documents with consistent rate of 2k documents per second for over 6 hours
Observations:
- able to query aggregations without much latencies
- cluster status remained green all through
Run 3: indexing documents with consistent rate of 3.5k documents
Observations:
- after ~90 minutes first incident of cluster health moving to yellow
- traffic continued with varying rates of indexing for the next 5 hours, cluster state remained yellow, with nodes exiting and joining the cluster occasionally
- indexing was stopped after 5 hours since first yellow state
- Cluster throws IndexShardCreationException with cause as LockObtainFailedException: Can't lock shard [qwer3483987][0], timed out after 5000ms

The cluster remained in this state for hours after this - sometimes showing 3 nodes and sometimes only 2. The state was fluctuating between yellow and red.

[2015-07-17 02:06:32,482][WARN ][indices.cluster          ] [met-ds-1-es-tune] [[qwer3483987][0]] marking and sending shard failed due to [failed to create shard]
org.elasticsearch.index.shard.IndexShardCreationException: [qwer3483987][0] failed to create shard
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:357)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [qwer3483987][0], timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:576)
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:504)
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:310)
        ... 9 more
@clintongormley clintongormley added discuss :Core/Infra/Core Core issues without another label labels Jul 17, 2015
@Srinathc
Copy link
Author

Update - the cluster recovered to green state with each node reporting consistent state (_cluster/state) after around 12 hours. Will resume bulk indexing at a lower rate to understand this better.

Is there any other info that will be helpful in looking into and resolving this issue?

@clintongormley
Copy link

Possibly related to #12011

@clintongormley
Copy link

HI @Srinathc

Can you tell us the reason that the cluster went yellow in the first place? What caused the shard failure?

Then could you also try to provide the info requested here please: #12011 (comment)

thanks

@ClauditaO
Copy link

Hi,

any ideas on why this is happening? I have a similar problem, after indexing 500M docs, my cluster will not recover any index.

recovered [0] indices into cluster_state

It was showing all the indexes before the last indexing process.

Should I leave elasticsearch running to see if it manages to recover the indexes after a while?

Thank you

@Srinathc
Copy link
Author

@clintongormley not sure why it went into yellow but there was good amount of GC events happening. Will run up the tests again and share the hot_threads output if this happens again.

@Srinathc
Copy link
Author

@clintongormley we encountered this issue again during our overnight test.
The hot threads output is available here - https://gist.github.com/Srinathc/4726dfaa0e9ca314e666#file-hot_threads-txt.

Documents of around 2k were being indexed at the rate of 2500 per second and things seemed to be really calm with CPU ~15%. But we increased the index rate to around 3500-4000 per second and found that the cluster was in this state in the morning.

Hope it helps to resolve the issue. Let me know if you need anything else

@clintongormley
Copy link

@martijnvg could you take a look at this please?

@clintongormley
Copy link

This looks like it is related to this bug in Lucene: https://issues.apache.org/jira/browse/LUCENE-6670

Basically, an OOM exception is thrown on a merge thread which means the lock is never released. This bug is fixed in Lucene 5.3 and backported to 2.0. For 1.x, there is not much we can do expect advising you to keep heap usage as low as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label discuss
Projects
None yet
Development

No branches or pull requests

3 participants