New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster ends up in unstable state after heavy indexing activity but does not recover #12318
Comments
Update - the cluster recovered to green state with each node reporting consistent state (_cluster/state) after around 12 hours. Will resume bulk indexing at a lower rate to understand this better. Is there any other info that will be helpful in looking into and resolving this issue? |
Possibly related to #12011 |
HI @Srinathc Can you tell us the reason that the cluster went yellow in the first place? What caused the shard failure? Then could you also try to provide the info requested here please: #12011 (comment) thanks |
Hi, any ideas on why this is happening? I have a similar problem, after indexing 500M docs, my cluster will not recover any index. recovered [0] indices into cluster_state It was showing all the indexes before the last indexing process. Should I leave elasticsearch running to see if it manages to recover the indexes after a while? Thank you |
@clintongormley not sure why it went into yellow but there was good amount of GC events happening. Will run up the tests again and share the hot_threads output if this happens again. |
@clintongormley we encountered this issue again during our overnight test. Documents of around 2k were being indexed at the rate of 2500 per second and things seemed to be really calm with CPU ~15%. But we increased the index rate to around 3500-4000 per second and found that the cluster was in this state in the morning. Hope it helps to resolve the issue. Let me know if you need anything else |
@martijnvg could you take a look at this please? |
This looks like it is related to this bug in Lucene: https://issues.apache.org/jira/browse/LUCENE-6670 Basically, an OOM exception is thrown on a merge thread which means the lock is never released. This bug is fixed in Lucene 5.3 and backported to 2.0. For 1.x, there is not much we can do expect advising you to keep heap usage as low as possible. |
Setup:
3 nodes of i2.xlarge on AWS (4 cores, 30G ram with 15G for java heap, 800G SSD instance store)
Elasticsearch version 1.6.0
Indexing is done from 3 storm worker nodes each doing bulk imports into the elasticsearch at frequencies of 2 seconds (averages around 2000 documents with each bulk import call)
State:
8 indices of 3 shards each, with async replication factor of 2
400 Million records in the store to a total of ~80GB of data
Tests
Run 1: indexing documents with a consistent rate of 1k documents per second for over 8 hours
Observations:
- able to query aggregations without much latencies
- cluster status remained green all through
Run 2: indexing documents with consistent rate of 2k documents per second for over 6 hours
Observations:
- able to query aggregations without much latencies
- cluster status remained green all through
Run 3: indexing documents with consistent rate of 3.5k documents
Observations:
- after ~90 minutes first incident of cluster health moving to yellow
- traffic continued with varying rates of indexing for the next 5 hours, cluster state remained yellow, with nodes exiting and joining the cluster occasionally
- indexing was stopped after 5 hours since first yellow state
- Cluster throws IndexShardCreationException with cause as LockObtainFailedException: Can't lock shard [qwer3483987][0], timed out after 5000ms
The cluster remained in this state for hours after this - sometimes showing 3 nodes and sometimes only 2. The state was fluctuating between yellow and red.
The text was updated successfully, but these errors were encountered: