Failure to recover shards after disk is full #15333

kpcool · 2015-12-09T11:11:10Z

Today, the disk got full and ElasticSearch is not able to go back again. Isn't there a built-in system that prevents such failures. I agree that we should be monitoring the hard space and not let this happen in first place, but some times things happen.

My setup is a single node at present. Using ES 2.1.0, which was supposed to have this fix.

I don't see a clear way to recover the node. A post at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html seemed to help, but still few indices got corrupted and I have no way to recovering them.

At the end, I ended up deleted the indices, but that's not the way it should be. Such things must be taken care of ultimately. But this is clearly a bug with ES 2.1.0

s1monw · 2015-12-09T11:13:22Z

maybe you can tell us what prevented you from starting up again?

kpcool · 2015-12-09T11:53:33Z

There were 15 indices on the node. Of those 15 indices, 9 indices had issues with their shards and ES status was red.

Issue curl -XGET http://localhost:9200/_cat/shards command, listed 52 shards UNASSIGNED and 4 shards in INITIALIZING status.

I issued reroute command (localhost:9200/_cluster/reroute) to move UNASSIGNED to force shard allocation.

However, the shards that were in INITIALIZING status stay there. The CPU usage was 100% (8-cores busy) for more than 4 hours, before I gave up and started deleting all indices that were causing the problem. Data was about 50GB and 6 Million records.

Even issuing systemctl stop elasticsearch.service took forever.

Is this what you were looking for, if not let me know what you are looking for and I will reply ASAP

s1monw · 2015-12-09T15:40:08Z

there are lots of open questions, do you have some logs telling why the shards where unassigned? did you just upgrade? Why do you force them to allocate? did you run into any disk space issues?

kpcool · 2015-12-09T15:59:51Z

Yes, the disk got full and then after the issue started happening as ES
stopped responding

Regards,
Ketan

On Dec 9, 2015, at 9:11 PM, Simon Willnauer notifications@github.com
wrote:

there are lots of open questions, do you have some logs telling why the
shards where unassigned? did you just upgrade? Why do you force them to
allocate? did you run into any disk space issues?

—
Reply to this email directly or view it on GitHub
#15333 (comment)
.

clintongormley · 2015-12-10T12:42:43Z

@kpcool Please could you provide the logs and also answers for the questions asked by @s1monw . The information you have provided up until now provides no clues at to why the shards were not reassigned, etc.

kpcool · 2015-12-10T13:02:02Z

Here's the log around that time.

[2015-12-09 00:00:18,560][ERROR][index.engine             ] [Mister Jip] [topbeat-2015.12.09][3] failed to merge
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
        at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
        at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
        at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
        at org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:86)
        at org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:78)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:218)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addSortedNumericField(Lucene50DocValuesConsumer.java:470)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedNumericField(PerFieldDocValuesFormat.java:126)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:417)
        at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:236)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:150)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4089)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3664)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:94)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
[2015-12-09 00:00:18,997][WARN ][index.engine             ] [Mister Jip] [topbeat-2015.12.09][3] failed engine [already closed by tragic event]
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
        at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
        at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
        at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
        at org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:86)
        at org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:78)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:218)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addSortedNumericField(Lucene50DocValuesConsumer.java:470)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedNumericField(PerFieldDocValuesFormat.java:126)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:417)
        at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:236)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:150)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4089)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3664)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:94)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
[2015-12-09 00:00:19,015][WARN ][indices.cluster          ] [Mister Jip] [[topbeat-2015.12.09][3]] marking and sending shard failed due to [engine failure, reason [already closed by tragic event]]
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
        at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
        at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
        at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
        at org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:86)
        at org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:78)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:218)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addSortedNumericField(Lucene50DocValuesConsumer.java:470)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedNumericField(PerFieldDocValuesFormat.java:126)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:417)
        at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:236)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:150)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4089)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3664)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:94)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
[2015-12-09 00:00:19,015][WARN ][cluster.action.shard     ] [Mister Jip] [topbeat-2015.12.09][3] received shard failed for [topbeat-2015.12.09][3], node[HmS7B_CdRFqPFT1UeUZEfA], [P], v[5], s[INITIALIZING], a[id=3Yn-3bO6QtClvHUDwYnClw], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-12-09T04:58:37.185Z], details[engine failure, reason [merge failed], failure MergeException[java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ]], indexUUID [rvUixkXqTty2osh3-PMubw], message [engine failure, reason [already closed by tragic event]], failure [IOException[No space left on device]]
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
        at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
        at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
        at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
        at org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:86)
        at org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:78)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:218)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addSortedNumericField(Lucene50DocValuesConsumer.java:470)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedNumericField(PerFieldDocValuesFormat.java:126)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:417)
        at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:236)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:150)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4089)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3664)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:94)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)

[2015-12-09 00:00:19,887][WARN ][index.translog           ] [Mister Jip] [topbeat-2015.12.09][0] failed to delete temp file /var/lib/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/0/translog/translog-6857015315422195400.tlog
java.nio.file.NoSuchFileException: /var/lib/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/0/translog/translog-6857015315422195400.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 00:00:24,760][WARN ][cluster.routing.allocation.decider] [Mister Jip] high disk watermark [90%] exceeded on [HmS7B_CdRFqPFT1UeUZEfA][Mister Jip][/var/lib/elasticsearch/DC_Reports/nodes/0] free: 1.3mb[0%], shards will be relocated away from this node
[2015-12-09 00:00:24,760][INFO ][cluster.routing.allocation.decider] [Mister Jip] rerouting shards: [high disk watermark exceeded on one or more nodes]

[2015-12-09 00:00:24,851][INFO ][rest.suppressed          ] /dealscornerin-50 Params: {index=dealscornerin-50}
[dealscornerin-50] IndexAlreadyExistsException[already exists]
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:168)
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:520)
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.access$200(MetaDataCreateIndexService.java:97)
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:241)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:388)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 00:00:54,241][ERROR][marvel.agent             ] [Mister Jip] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
        at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[0]: index [.marvel-es-2015.12.09], type [index_recovery], id [AVGFHCn_dr-UG15JaoIa], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[1]: index [.marvel-es-2015.12.09], type [indices_stats], id [AVGFHCn_dr-UG15JaoIb], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[2]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[3]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:3:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[4]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[5]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[6]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:2:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[7]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:2:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[8]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[9]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[10]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:0:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[11]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:0:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[12]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.03:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[13]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.03:3:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[14]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.03:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[15]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.03:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[16]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.03:2:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[17]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.03:2:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[18]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.03:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[19]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.03:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
--------------Similar logs--------------
[98]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:dealscornerin-49:0:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[99]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:packetbeat-2015.12.03:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]]
                at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
                at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
                ... 3 more
[2015-12-09 00:00:55,213][WARN ][cluster.routing.allocation.decider] [Mister Jip] high disk watermark [90%] exceeded on [HmS7B_CdRFqPFT1UeUZEfA][Mister Jip][/var/lib/elasticsearch/DC_Reports/nodes/0] free: 20kb[3.8E-5%], shards will be relocated away from this node
[2015-12-09 00:01:04,257][DEBUG][action.admin.indices.stats] [Mister Jip] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][4], node[HmS7B_CdRFqPFT1UeUZEfA], [P], v[5], s[INITIALIZING], a[id=n5bBcfxdS7ey8IpgEyxwzA], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-12-09T04:58:37.227Z], details[engine failure, reason [merge failed], failure MergeException[java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][4]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][4]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 00:01:04,257][DEBUG][action.admin.indices.stats] [Mister Jip] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][2], node[HmS7B_CdRFqPFT1UeUZEfA], [P], v[33], s[INITIALIZING], a[id=hsMorSXnRYCQa28IlkksYQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-12-09T04:58:37.227Z], details[engine failure, reason [already closed by tragic event], failure IOException[No space left on device]]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][2]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][2]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more

clintongormley · 2015-12-10T13:09:34Z

OK, so here the disk is full. What happened in the logs after you cleared out space on the disk?

kpcool · 2015-12-11T04:30:37Z

Here's the log when I tried to start the ES after shutting it down:

[2015-12-09 01:53:35,386][WARN ][bootstrap                ] If you are logged in interactively, you will have to re-login for the new limits to take effect.
[2015-12-09 01:53:35,632][INFO ][node                     ] [Maxam] version[2.1.0], pid[5742], build[72cd1f1/2015-11-18T22:40:03Z]
[2015-12-09 01:53:35,632][INFO ][node                     ] [Maxam] initializing ...
[2015-12-09 01:53:36,167][INFO ][plugins                  ] [Maxam] loaded [license, marvel-agent], sites [kopf]
[2015-12-09 01:53:36,219][INFO ][env                      ] [Maxam] using [1] data paths, mounts [[/home (/dev/mapper/centos-home)]], net usable_space [826.2gb], net total_space [872.6gb], spins? [possibly], types [xfs]
[2015-12-09 01:53:38,666][INFO ][node                     ] [Maxam] initialized
[2015-12-09 01:53:38,666][INFO ][node                     ] [Maxam] starting ...
[2015-12-09 01:53:38,877][INFO ][transport                ] [Maxam] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
[2015-12-09 01:53:38,897][INFO ][discovery                ] [Maxam] DC_Reports/ywHqZlB2Ty6FKboZPgRoZQ
[2015-12-09 01:53:41,926][INFO ][cluster.service          ] [Maxam] new_master {Maxam}{ywHqZlB2Ty6FKboZPgRoZQ}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2015-12-09 01:53:41,939][INFO ][http                     ] [Maxam] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2015-12-09 01:53:41,940][INFO ][node                     ] [Maxam] started
[2015-12-09 01:53:46,499][INFO ][license.plugin.core      ] [Maxam] license [07b70bf8-cc41-45d7-900c-67a16d05b960] - valid
[2015-12-09 01:53:46,500][ERROR][license.plugin.core      ] [Maxam]
#
# License will expire on [Thursday, December 31, 2015]. If you have a new license, please update it.
# Otherwise, please reach out to your support contact.
#
# Commercial plugins operate with reduced functionality on license expiration:
# - marvel
#  - The agent will stop collecting cluster and indices metrics
[2015-12-09 01:53:47,706][INFO ][gateway                  ] [Maxam] recovered [27] indices into cluster_state
[2015-12-09 01:53:48,164][WARN ][index.translog           ] [Maxam] [topbeat-2015.12.09][3] failed to delete temp file /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/3/translog/translog-4819800625171304865.tlog
java.nio.file.NoSuchFileException: /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/3/translog/translog-4819800625171304865.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 01:53:48,172][WARN ][index.translog           ] [Maxam] [topbeat-2015.12.09][2] failed to delete temp file /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/2/translog/translog-4092628000966967177.tlog
java.nio.file.NoSuchFileException: /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/2/translog/translog-4092628000966967177.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 01:53:48,172][WARN ][index.translog           ] [Maxam] [topbeat-2015.12.09][1] failed to delete temp file /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/1/translog/translog-1515358772559515929.tlog
java.nio.file.NoSuchFileException: /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/1/translog/translog-1515358772559515929.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 01:53:48,164][WARN ][index.translog           ] [Maxam] [topbeat-2015.12.09][4] failed to delete temp file /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/4/translog/translog-7914277937547324566.tlog
java.nio.file.NoSuchFileException: /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/4/translog/translog-7914277937547324566.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 01:53:48,819][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][1], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=beI3ZtSZRLSjmp382hpjGA], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][1]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][1]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:53:48,827][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][4], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=ibydvhMyTG-8uFU_y2Gx1g], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][4]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][4]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:53:48,835][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][3], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=bOgSHC15TWakUYPIMhEz7A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][3]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][3]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:53:48,839][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][2], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=QqLsO7WMRn-P24LXF1LuiQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][2]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][2]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:53:49,020][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][2], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=QqLsO7WMRn-P24LXF1LuiQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][2]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][2]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:54:49,046][ERROR][marvel.agent             ] [Maxam] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
        at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[0]: index [.marvel-es-2015.12.09], type [index_recovery], id [AVGFhHRp_6d18XbNgIBf], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[1]: index [.marvel-es-2015.12.09], type [indices_stats], id [AVGFhHRp_6d18XbNgIBg], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[2]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[3]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[4]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[5]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[6]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[7]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:3:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[8]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:2:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[9]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:2:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[10]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:0:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[11]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:0:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[12]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[13]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[14]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[15]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[16]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[17]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:3:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[18]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:2:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[19]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:2:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[20]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:0:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[21]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:0:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[22]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.08:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[23]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.08:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[24]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.08:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[25]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.08:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]

s1monw · 2015-12-11T10:28:51Z

with your last log I can't see exceptions that indicate that recovery failed. Does the cluster come back up or do you get stuck in recoveries? The failed to delete temp file /home/elkuser/elasticsearch/... warn logs are annoying but harmless and fixed already.

kpcool · 2015-12-11T12:00:36Z

Here's the log where there's unhandled exception.

[2015-12-09 01:54:49,046][ERROR][marvel.agent             ] [Maxam] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
        at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[0]: index [.marvel-es-2015.12.09], type [index_recovery], id [AVGFhHRp_6d18XbNgIBf], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]

Also, the cluster stuck in recovery for more than 4 hours, before I gave up and started removing indices giving problem. Basically, shards were in UNASSIGNED state and some were in INITIALIZING state.

I can upload the whole log file which about 42MB if that would help (for entire day).

s1monw · 2015-12-11T12:56:30Z

I can upload the whole log file which about 42MB if that would help (for entire day).

please

Also, the cluster stuck in recovery for more than 4 hours, before I gave up and started removing indices giving problem. Basically, shards were in UNASSIGNED state and some were in INITIALIZING state.

I can't see why this is happening. so the logs would be awesome

kpcool · 2015-12-12T03:59:06Z

dc20151208.tar.gz
dc.tar.gz

dc20151208 - was when the disk was not full but about to get full
dc.tar.gz - is when the disk was full and es couldn't initialize all shards.

s1monw · 2015-12-12T22:48:27Z

the interesting exceptions are here:

Caused by: [packetbeat-2015.12.09][[packetbeat-2015.12.09][1]] EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x203a77f9, got: 0x2c22706f];
    at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:254)
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:175)
    ... 11 more
Caused by: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x203a77f9, got: 0x2c22706f];
    at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1636)
    at org.elasticsearch.index.translog.TranslogReader.read(TranslogReader.java:132)
    at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.readOperation(TranslogReader.java:299)
    at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.next(TranslogReader.java:290)
    at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70)
    at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:240)
    ... 12 more
Caused by: TranslogCorruptedException[translog stream is corrupted, expected: 0x203a77f9, got: 0x2c22706f]
    at org.elasticsearch.index.translog.Translog.verifyChecksum(Translog.java:1593)
    at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1626)
    ... 17 more

I still need to investigate what's going on but can you tell me what system you are running this on? Is this a local machine or a cloud machine? I am also curious what filesystem you are using?

kpcool · 2015-12-14T06:07:28Z

Its a standalone node.
Machine Info: 16GB DDR3, 1TB HDD , Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz.
FileSystem: xfs
CentOS: 7.0 64Bit
Java : "1.8.0_65"

s1monw · 2015-12-14T09:08:18Z

alright I think I found the issue here @kpcool your logfiles brought the conclusion thanks you very much. This is actually a serious issue with our transaction log which basically corrupts itself when you hit a disk-full exception. I will keep you posted on this issue. Thanks for baring with me and helping to figure this out.

s1monw · 2015-12-14T09:12:30Z

What happens here is that when we hit a disk full expection while we are flushing the transaction log we might be able to write a portion of the data but we will try to flush the entire data block over and over again. Yet, in the most of the scenarios the disk-full happens during a merge and that merge will fail and release disk-space. Once that is done we might be able to flush the translog again but we already wrote big chunks of data to disk which are now 1. corrupted and 2. treated as non-existing since our internal offsets haven't advanced.

Today we are super lenient (how could I missed that for f**k sake) with failing / closing the translog writer when we hit an exception. It's actually worse, we allow to further write to it and don't care what has been already written to disk and what hasn't. We keep the buffer in memory and try to write it again on the next operation. When we hit a disk-full expcetion due to for instance a big merge we are likely adding document to the translog but fail to write them to disk. Once the merge failed and freed up it's diskspace (note this is a small window when concurrently indexing and failing the shard due to out of space exceptions) we will allow in-flight operations to add to the translog and then once we fail the shard fsync it. These operations are written to disk and fsynced which is fine but the previous buffer flush might have written some bytes to disk which are not corrupting the translog. That wouldn't be an issue if we prevented the fsync. Closes elastic#15333

robcza · 2016-07-29T08:02:01Z

Encountered this issue on one of the older clusters after disk full issue. Is there any way to recover the index? Losing something from the translog is not a big issue for me.

Here is the expection while starting the node:

[2016-07-29 07:48:15,265][WARN ][indices.cluster          ] [Recorder] [[myindex][0]] marking and sending shard failed due to [failed recovery]
[myindex][[myindex][0]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266];
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
    at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
    at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: [myindex][[myindex][0]] EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266];
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:177)
    at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
    at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
    at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
    at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
    at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
    ... 5 more
Caused by: [myindex][[myindex][0]] EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266];
    at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:240)
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:174)
    ... 11 more
Caused by: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266];
    at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1717)
    at org.elasticsearch.index.translog.TranslogReader.read(TranslogReader.java:132)
    at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.readOperation(TranslogReader.java:296)
    at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.next(TranslogReader.java:287)
    at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70)
    at org.elasticsearch.index.shard.TranslogRecoveryPerformer.recoveryFromSnapshot(TranslogRecoveryPerformer.java:105)
    at org.elasticsearch.index.shard.IndexShard$1.recoveryFromSnapshot(IndexShard.java:1578)
    at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:238)
    ... 12 more
Caused by: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266]
    at org.elasticsearch.index.translog.Translog.verifyChecksum(Translog.java:1675)
    at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1707)
    ... 19 more

These are the contents of the index translog directory:

elasticsearch/nodes/0/indices/myindex/0/translog# ls
translog-1445516620591.ckp  translog-1445516623424.ckp   translog-1445516631010.tlog  translog-1445516631019.tlog  translog-1445516631028.tlog  translog-1445516631037.tlog  translog-1445516631046.tlog
translog-1445516620592.ckp  translog-1445516624605.ckp   translog-1445516631011.ckp   translog-1445516631020.ckp   translog-1445516631029.ckp   translog-1445516631038.ckp   translog-1445516631047.ckp
translog-1445516620593.ckp  translog-1445516624606.ckp   translog-1445516631011.tlog  translog-1445516631020.tlog  translog-1445516631029.tlog  translog-1445516631038.tlog  translog-1445516631047.tlog
translog-1445516620594.ckp  translog-1445516624607.ckp   translog-1445516631012.ckp   translog-1445516631021.ckp   translog-1445516631030.ckp   translog-1445516631039.ckp   translog-1445516631048.ckp
translog-1445516620595.ckp  translog-1445516624608.ckp   translog-1445516631012.tlog  translog-1445516631021.tlog  translog-1445516631030.tlog  translog-1445516631039.tlog  translog-1445516631048.tlog
translog-1445516620596.ckp  translog-1445516624609.ckp   translog-1445516631013.ckp   translog-1445516631022.ckp   translog-1445516631031.ckp   translog-1445516631040.ckp   translog-1445516631049.ckp
translog-1445516620790.ckp  translog-1445516624610.ckp   translog-1445516631013.tlog  translog-1445516631022.tlog  translog-1445516631031.tlog  translog-1445516631040.tlog  translog-1445516631049.tlog
translog-1445516621043.ckp  translog-1445516624611.ckp   translog-1445516631014.ckp   translog-1445516631023.ckp   translog-1445516631032.ckp   translog-1445516631041.ckp   translog-1445516631050.ckp
translog-1445516621044.ckp  translog-1445516624612.ckp   translog-1445516631014.tlog  translog-1445516631023.tlog  translog-1445516631032.tlog  translog-1445516631041.tlog  translog-1445516631050.tlog
translog-1445516621237.ckp  translog-1445516625986.ckp   translog-1445516631015.ckp   translog-1445516631024.ckp   translog-1445516631033.ckp   translog-1445516631042.ckp   translog-1445516631051.ckp
translog-1445516621238.ckp  translog-1445516628096.ckp   translog-1445516631015.tlog  translog-1445516631024.tlog  translog-1445516631033.tlog  translog-1445516631042.tlog  translog-1445516631051.tlog
translog-1445516621239.ckp  translog-1445516628097.ckp   translog-1445516631016.ckp   translog-1445516631025.ckp   translog-1445516631034.ckp   translog-1445516631043.ckp   translog-1445516631052.ckp
translog-1445516621240.ckp  translog-1445516628624.ckp   translog-1445516631016.tlog  translog-1445516631025.tlog  translog-1445516631034.tlog  translog-1445516631043.tlog  translog-1445516631052.tlog
translog-1445516621380.ckp  translog-1445516628625.ckp   translog-1445516631017.ckp   translog-1445516631026.ckp   translog-1445516631035.ckp   translog-1445516631044.ckp   translog-1445516631053.tlog
translog-1445516623082.ckp  translog-1445516629747.ckp   translog-1445516631017.tlog  translog-1445516631026.tlog  translog-1445516631035.tlog  translog-1445516631044.tlog  translog.ckp
translog-1445516623417.ckp  translog-1445516631009.ckp   translog-1445516631018.ckp   translog-1445516631027.ckp   translog-1445516631036.ckp   translog-1445516631045.ckp
translog-1445516623418.ckp  translog-1445516631009.tlog  translog-1445516631018.tlog  translog-1445516631027.tlog  translog-1445516631036.tlog  translog-1445516631045.tlog
translog-1445516623423.ckp  translog-1445516631010.ckp   translog-1445516631019.ckp   translog-1445516631028.ckp   translog-1445516631037.ckp   translog-1445516631046.ckp

@s1monw Is there any theoretical chance to fix this? Maybe removing part of the translog a persuading elasticsearch it is complete?

jamshid · 2017-12-12T20:03:50Z

FYI I saw this same error in an Elasticsearch 2.3.3 environment that ran out of disk space. I was surprised I had to restart Elasticsearch in order to recover.
I was hoping this bug was fixed by #15420 but looks like that fix is in 2.3.3, so there must be additional bugs.
Hopefully latest Elasticsearch is automatically recovers from full disk problems.

clintongormley added feedback_needed :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Dec 10, 2015

s1monw added >bug blocker v5.0.0-alpha1 v2.2.0 v2.0.2 v2.1.1 :Translog and removed feedback_needed labels Dec 14, 2015

s1monw mentioned this issue Dec 14, 2015

Fail and close translog hard if writing to disk fails #15420

Merged

s1monw closed this as completed in #15420 Dec 14, 2015

clintongormley mentioned this issue Jan 10, 2016

Logfile being spammed with "NoSuchFileException" #15734

Closed

wuyunfeng mentioned this issue Nov 21, 2017

Translog [internal:index/shard/recovery/start_recovery]]; nested: AlreadyClosedException[translog is already closed] #27465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to recover shards after disk is full #15333

Failure to recover shards after disk is full #15333

kpcool commented Dec 9, 2015

s1monw commented Dec 9, 2015

kpcool commented Dec 9, 2015

s1monw commented Dec 9, 2015

kpcool commented Dec 9, 2015

clintongormley commented Dec 10, 2015

kpcool commented Dec 10, 2015

clintongormley commented Dec 10, 2015

kpcool commented Dec 11, 2015

s1monw commented Dec 11, 2015

kpcool commented Dec 11, 2015

s1monw commented Dec 11, 2015

kpcool commented Dec 12, 2015

s1monw commented Dec 12, 2015

kpcool commented Dec 14, 2015

s1monw commented Dec 14, 2015

s1monw commented Dec 14, 2015

robcza commented Jul 29, 2016

jamshid commented Dec 12, 2017

Failure to recover shards after disk is full #15333

Failure to recover shards after disk is full #15333

Comments

kpcool commented Dec 9, 2015

s1monw commented Dec 9, 2015

kpcool commented Dec 9, 2015

s1monw commented Dec 9, 2015

kpcool commented Dec 9, 2015

clintongormley commented Dec 10, 2015

kpcool commented Dec 10, 2015

clintongormley commented Dec 10, 2015

kpcool commented Dec 11, 2015

s1monw commented Dec 11, 2015

kpcool commented Dec 11, 2015

s1monw commented Dec 11, 2015

kpcool commented Dec 12, 2015

s1monw commented Dec 12, 2015

kpcool commented Dec 14, 2015

s1monw commented Dec 14, 2015

s1monw commented Dec 14, 2015

robcza commented Jul 29, 2016

jamshid commented Dec 12, 2017