Description
Setup:
3 nodes of i2.xlarge on AWS (4 cores, 30G ram with 15G for java heap, 800G SSD instance store)
Elasticsearch version 1.6.0
Indexing is done from 3 storm worker nodes each doing bulk imports into the elasticsearch at frequencies of 2 seconds (averages around 2000 documents with each bulk import call)
State:
8 indices of 3 shards each, with async replication factor of 2
400 Million records in the store to a total of ~80GB of data
Tests
Run 1: indexing documents with a consistent rate of 1k documents per second for over 8 hours
Observations:
- able to query aggregations without much latencies
- cluster status remained green all through
Run 2: indexing documents with consistent rate of 2k documents per second for over 6 hours
Observations:
- able to query aggregations without much latencies
- cluster status remained green all through
Run 3: indexing documents with consistent rate of 3.5k documents
Observations:
- after ~90 minutes first incident of cluster health moving to yellow
- traffic continued with varying rates of indexing for the next 5 hours, cluster state remained yellow, with nodes exiting and joining the cluster occasionally
- indexing was stopped after 5 hours since first yellow state
- Cluster throws IndexShardCreationException with cause as LockObtainFailedException: Can't lock shard [qwer3483987][0], timed out after 5000ms
The cluster remained in this state for hours after this - sometimes showing 3 nodes and sometimes only 2. The state was fluctuating between yellow and red.
[2015-07-17 02:06:32,482][WARN ][indices.cluster ] [met-ds-1-es-tune] [[qwer3483987][0]] marking and sending shard failed due to [failed to create shard]
org.elasticsearch.index.shard.IndexShardCreationException: [qwer3483987][0] failed to create shard
at org.elasticsearch.index.IndexService.createShard(IndexService.java:357)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [qwer3483987][0], timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:576)
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:504)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:310)
... 9 more