Cluster ends up in unstable state after heavy indexing activity but does not recover

Setup:
3 nodes of i2.xlarge on AWS (4 cores, 30G ram with 15G for java heap, 800G SSD instance store)
Elasticsearch version 1.6.0
Indexing is done from 3 storm worker nodes each doing bulk imports into the elasticsearch at frequencies of 2 seconds (averages around 2000 documents with each bulk import call)

State:
8 indices of 3 shards each, with async replication factor of 2
400 Million records in the store to a total of ~80GB of data

Tests
Run 1: indexing documents with a consistent rate of 1k documents per second for over 8 hours
           Observations:
                  - able to query aggregations without much latencies
                  - cluster status remained green all through
Run 2: indexing documents with consistent rate of 2k documents per second for over 6 hours
           Observations:
                  - able to query aggregations without much latencies
                  - cluster status remained green all through
Run 3: indexing documents with consistent rate of 3.5k documents
           Observations:
                  - after ~90 minutes first incident of cluster health moving to yellow
                  - traffic continued with varying rates of indexing for the next 5 hours, cluster state remained yellow, with nodes exiting and joining the cluster occasionally
                  - indexing was stopped after 5 hours since first yellow state
                  - Cluster throws IndexShardCreationException with cause as LockObtainFailedException: Can't lock shard [qwer3483987][0], timed out after 5000ms

The cluster remained in this state for hours after this - sometimes showing 3 nodes and sometimes only 2. The state was fluctuating between yellow and red.

```
[2015-07-17 02:06:32,482][WARN ][indices.cluster          ] [met-ds-1-es-tune] [[qwer3483987][0]] marking and sending shard failed due to [failed to create shard]
org.elasticsearch.index.shard.IndexShardCreationException: [qwer3483987][0] failed to create shard
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:357)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [qwer3483987][0], timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:576)
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:504)
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:310)
        ... 9 more
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster ends up in unstable state after heavy indexing activity but does not recover #12318

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster ends up in unstable state after heavy indexing activity but does not recover #12318

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions