Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to recover shards after disk is full #15333

Closed
kpcool opened this issue Dec 9, 2015 · 18 comments
Closed

Failure to recover shards after disk is full #15333

kpcool opened this issue Dec 9, 2015 · 18 comments
Labels
blocker >bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v2.0.2 v2.1.1 v2.2.0 v5.0.0-alpha1

Comments

@kpcool
Copy link

kpcool commented Dec 9, 2015

Today, the disk got full and ElasticSearch is not able to go back again. Isn't there a built-in system that prevents such failures. I agree that we should be monitoring the hard space and not let this happen in first place, but some times things happen.

My setup is a single node at present. Using ES 2.1.0, which was supposed to have this fix.

I don't see a clear way to recover the node. A post at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html seemed to help, but still few indices got corrupted and I have no way to recovering them.

At the end, I ended up deleted the indices, but that's not the way it should be. Such things must be taken care of ultimately. But this is clearly a bug with ES 2.1.0

@s1monw
Copy link
Contributor

s1monw commented Dec 9, 2015

maybe you can tell us what prevented you from starting up again?

@kpcool
Copy link
Author

kpcool commented Dec 9, 2015

There were 15 indices on the node. Of those 15 indices, 9 indices had issues with their shards and ES status was red.

Issue curl -XGET http://localhost:9200/_cat/shards command, listed 52 shards UNASSIGNED and 4 shards in INITIALIZING status.

I issued reroute command (localhost:9200/_cluster/reroute) to move UNASSIGNED to force shard allocation.

However, the shards that were in INITIALIZING status stay there. The CPU usage was 100% (8-cores busy) for more than 4 hours, before I gave up and started deleting all indices that were causing the problem. Data was about 50GB and 6 Million records.

Even issuing systemctl stop elasticsearch.service took forever.

Is this what you were looking for, if not let me know what you are looking for and I will reply ASAP

@s1monw
Copy link
Contributor

s1monw commented Dec 9, 2015

there are lots of open questions, do you have some logs telling why the shards where unassigned? did you just upgrade? Why do you force them to allocate? did you run into any disk space issues?

@kpcool
Copy link
Author

kpcool commented Dec 9, 2015

Yes, the disk got full and then after the issue started happening as ES
stopped responding

Regards,
Ketan

On Dec 9, 2015, at 9:11 PM, Simon Willnauer notifications@github.com
wrote:

there are lots of open questions, do you have some logs telling why the
shards where unassigned? did you just upgrade? Why do you force them to
allocate? did you run into any disk space issues?


Reply to this email directly or view it on GitHub
#15333 (comment)
.

@clintongormley
Copy link

@kpcool Please could you provide the logs and also answers for the questions asked by @s1monw . The information you have provided up until now provides no clues at to why the shards were not reassigned, etc.

@clintongormley clintongormley added feedback_needed :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Dec 10, 2015
@kpcool
Copy link
Author

kpcool commented Dec 10, 2015

Here's the log around that time.

[2015-12-09 00:00:18,560][ERROR][index.engine             ] [Mister Jip] [topbeat-2015.12.09][3] failed to merge
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
        at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
        at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
        at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
        at org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:86)
        at org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:78)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:218)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addSortedNumericField(Lucene50DocValuesConsumer.java:470)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedNumericField(PerFieldDocValuesFormat.java:126)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:417)
        at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:236)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:150)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4089)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3664)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:94)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
[2015-12-09 00:00:18,997][WARN ][index.engine             ] [Mister Jip] [topbeat-2015.12.09][3] failed engine [already closed by tragic event]
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
        at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
        at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
        at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
        at org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:86)
        at org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:78)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:218)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addSortedNumericField(Lucene50DocValuesConsumer.java:470)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedNumericField(PerFieldDocValuesFormat.java:126)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:417)
        at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:236)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:150)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4089)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3664)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:94)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
[2015-12-09 00:00:19,015][WARN ][indices.cluster          ] [Mister Jip] [[topbeat-2015.12.09][3]] marking and sending shard failed due to [engine failure, reason [already closed by tragic event]]
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
        at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
        at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
        at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
        at org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:86)
        at org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:78)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:218)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addSortedNumericField(Lucene50DocValuesConsumer.java:470)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedNumericField(PerFieldDocValuesFormat.java:126)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:417)
        at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:236)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:150)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4089)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3664)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:94)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
[2015-12-09 00:00:19,015][WARN ][cluster.action.shard     ] [Mister Jip] [topbeat-2015.12.09][3] received shard failed for [topbeat-2015.12.09][3], node[HmS7B_CdRFqPFT1UeUZEfA], [P], v[5], s[INITIALIZING], a[id=3Yn-3bO6QtClvHUDwYnClw], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-12-09T04:58:37.185Z], details[engine failure, reason [merge failed], failure MergeException[java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ]], indexUUID [rvUixkXqTty2osh3-PMubw], message [engine failure, reason [already closed by tragic event]], failure [IOException[No space left on device]]
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
        at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
        at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73)
        at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
        at org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:86)
        at org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:78)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:218)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
        at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addSortedNumericField(Lucene50DocValuesConsumer.java:470)
        at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedNumericField(PerFieldDocValuesFormat.java:126)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:417)
        at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:236)
        at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:150)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4089)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3664)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:94)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)

[2015-12-09 00:00:19,887][WARN ][index.translog           ] [Mister Jip] [topbeat-2015.12.09][0] failed to delete temp file /var/lib/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/0/translog/translog-6857015315422195400.tlog
java.nio.file.NoSuchFileException: /var/lib/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/0/translog/translog-6857015315422195400.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 00:00:24,760][WARN ][cluster.routing.allocation.decider] [Mister Jip] high disk watermark [90%] exceeded on [HmS7B_CdRFqPFT1UeUZEfA][Mister Jip][/var/lib/elasticsearch/DC_Reports/nodes/0] free: 1.3mb[0%], shards will be relocated away from this node
[2015-12-09 00:00:24,760][INFO ][cluster.routing.allocation.decider] [Mister Jip] rerouting shards: [high disk watermark exceeded on one or more nodes]

[2015-12-09 00:00:24,851][INFO ][rest.suppressed          ] /dealscornerin-50 Params: {index=dealscornerin-50}
[dealscornerin-50] IndexAlreadyExistsException[already exists]
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:168)
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:520)
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.access$200(MetaDataCreateIndexService.java:97)
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:241)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:388)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 00:00:54,241][ERROR][marvel.agent             ] [Mister Jip] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
        at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[0]: index [.marvel-es-2015.12.09], type [index_recovery], id [AVGFHCn_dr-UG15JaoIa], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[1]: index [.marvel-es-2015.12.09], type [indices_stats], id [AVGFHCn_dr-UG15JaoIb], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[2]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[3]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:3:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[4]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[5]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[6]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:2:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[7]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:2:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[8]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[9]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[10]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.04:0:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[11]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.04:0:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[12]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.03:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[13]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.03:3:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[14]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.03:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[15]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.03:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[16]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.03:2:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[17]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.03:2:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[18]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:topbeat-2015.12.03:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[19]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:_na:topbeat-2015.12.03:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
--------------Similar logs--------------
[98]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:dealscornerin-49:0:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]
[99]: index [.marvel-es-2015.12.09], type [shards], id [nbwEWrIlSBWjVgm32O2hAA:HmS7B_CdRFqPFT1UeUZEfA:packetbeat-2015.12.03:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3e3d36d4]]]
                at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
                at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
                ... 3 more
[2015-12-09 00:00:55,213][WARN ][cluster.routing.allocation.decider] [Mister Jip] high disk watermark [90%] exceeded on [HmS7B_CdRFqPFT1UeUZEfA][Mister Jip][/var/lib/elasticsearch/DC_Reports/nodes/0] free: 20kb[3.8E-5%], shards will be relocated away from this node
[2015-12-09 00:01:04,257][DEBUG][action.admin.indices.stats] [Mister Jip] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][4], node[HmS7B_CdRFqPFT1UeUZEfA], [P], v[5], s[INITIALIZING], a[id=n5bBcfxdS7ey8IpgEyxwzA], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-12-09T04:58:37.227Z], details[engine failure, reason [merge failed], failure MergeException[java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][4]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][4]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 00:01:04,257][DEBUG][action.admin.indices.stats] [Mister Jip] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][2], node[HmS7B_CdRFqPFT1UeUZEfA], [P], v[33], s[INITIALIZING], a[id=hsMorSXnRYCQa28IlkksYQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-12-09T04:58:37.227Z], details[engine failure, reason [already closed by tragic event], failure IOException[No space left on device]]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][2]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][2]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more

@clintongormley
Copy link

OK, so here the disk is full. What happened in the logs after you cleared out space on the disk?

@kpcool
Copy link
Author

kpcool commented Dec 11, 2015

Here's the log when I tried to start the ES after shutting it down:

[2015-12-09 01:53:35,386][WARN ][bootstrap                ] If you are logged in interactively, you will have to re-login for the new limits to take effect.
[2015-12-09 01:53:35,632][INFO ][node                     ] [Maxam] version[2.1.0], pid[5742], build[72cd1f1/2015-11-18T22:40:03Z]
[2015-12-09 01:53:35,632][INFO ][node                     ] [Maxam] initializing ...
[2015-12-09 01:53:36,167][INFO ][plugins                  ] [Maxam] loaded [license, marvel-agent], sites [kopf]
[2015-12-09 01:53:36,219][INFO ][env                      ] [Maxam] using [1] data paths, mounts [[/home (/dev/mapper/centos-home)]], net usable_space [826.2gb], net total_space [872.6gb], spins? [possibly], types [xfs]
[2015-12-09 01:53:38,666][INFO ][node                     ] [Maxam] initialized
[2015-12-09 01:53:38,666][INFO ][node                     ] [Maxam] starting ...
[2015-12-09 01:53:38,877][INFO ][transport                ] [Maxam] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
[2015-12-09 01:53:38,897][INFO ][discovery                ] [Maxam] DC_Reports/ywHqZlB2Ty6FKboZPgRoZQ
[2015-12-09 01:53:41,926][INFO ][cluster.service          ] [Maxam] new_master {Maxam}{ywHqZlB2Ty6FKboZPgRoZQ}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2015-12-09 01:53:41,939][INFO ][http                     ] [Maxam] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2015-12-09 01:53:41,940][INFO ][node                     ] [Maxam] started
[2015-12-09 01:53:46,499][INFO ][license.plugin.core      ] [Maxam] license [07b70bf8-cc41-45d7-900c-67a16d05b960] - valid
[2015-12-09 01:53:46,500][ERROR][license.plugin.core      ] [Maxam]
#
# License will expire on [Thursday, December 31, 2015]. If you have a new license, please update it.
# Otherwise, please reach out to your support contact.
#
# Commercial plugins operate with reduced functionality on license expiration:
# - marvel
#  - The agent will stop collecting cluster and indices metrics
[2015-12-09 01:53:47,706][INFO ][gateway                  ] [Maxam] recovered [27] indices into cluster_state
[2015-12-09 01:53:48,164][WARN ][index.translog           ] [Maxam] [topbeat-2015.12.09][3] failed to delete temp file /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/3/translog/translog-4819800625171304865.tlog
java.nio.file.NoSuchFileException: /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/3/translog/translog-4819800625171304865.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 01:53:48,172][WARN ][index.translog           ] [Maxam] [topbeat-2015.12.09][2] failed to delete temp file /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/2/translog/translog-4092628000966967177.tlog
java.nio.file.NoSuchFileException: /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/2/translog/translog-4092628000966967177.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 01:53:48,172][WARN ][index.translog           ] [Maxam] [topbeat-2015.12.09][1] failed to delete temp file /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/1/translog/translog-1515358772559515929.tlog
java.nio.file.NoSuchFileException: /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/1/translog/translog-1515358772559515929.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 01:53:48,164][WARN ][index.translog           ] [Maxam] [topbeat-2015.12.09][4] failed to delete temp file /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/4/translog/translog-7914277937547324566.tlog
java.nio.file.NoSuchFileException: /home/elkuser/elasticsearch/DC_Reports/nodes/0/indices/topbeat-2015.12.09/4/translog/translog-7914277937547324566.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
        at java.nio.file.Files.delete(Files.java:1126)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:166)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-09 01:53:48,819][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][1], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=beI3ZtSZRLSjmp382hpjGA], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][1]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][1]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:53:48,827][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][4], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=ibydvhMyTG-8uFU_y2Gx1g], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][4]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][4]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:53:48,835][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][3], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=bOgSHC15TWakUYPIMhEz7A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][3]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][3]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:53:48,839][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][2], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=QqLsO7WMRn-P24LXF1LuiQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][2]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][2]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:53:49,020][DEBUG][action.admin.indices.stats] [Maxam] [indices:monitor/stats] failed to execute operation for shard [[topbeat-2015.12.09][2], node[ywHqZlB2Ty6FKboZPgRoZQ], [P], v[3], s[INITIALIZING], a[id=QqLsO7WMRn-P24LXF1LuiQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-12-09T06:53:42.007Z]]]
[topbeat-2015.12.09][[topbeat-2015.12.09][2]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [topbeat-2015.12.09][[topbeat-2015.12.09][2]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
        at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
        at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
        at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
        at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
        ... 7 more
[2015-12-09 01:54:49,046][ERROR][marvel.agent             ] [Maxam] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
        at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[0]: index [.marvel-es-2015.12.09], type [index_recovery], id [AVGFhHRp_6d18XbNgIBf], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[1]: index [.marvel-es-2015.12.09], type [indices_stats], id [AVGFhHRp_6d18XbNgIBg], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[2]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[3]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[4]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[5]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[6]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[7]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:3:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[8]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:2:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[9]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:2:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[10]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:0:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[11]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.04:0:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[12]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[13]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[14]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[15]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[16]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:3:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[17]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:3:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[18]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:2:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[19]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:2:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[20]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:0:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[21]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.03:0:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[22]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.08:1:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[23]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.08:1:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[24]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.08:4:p], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]
[25]: index [.marvel-es-2015.12.09], type [shards], id [KZr_5qQeRbiay0_pdQRUjw:_na:topbeat-2015.12.08:4:r], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]

@s1monw
Copy link
Contributor

s1monw commented Dec 11, 2015

with your last log I can't see exceptions that indicate that recovery failed. Does the cluster come back up or do you get stuck in recoveries? The failed to delete temp file /home/elkuser/elasticsearch/... warn logs are annoying but harmless and fixed already.

@kpcool
Copy link
Author

kpcool commented Dec 11, 2015

Here's the log where there's unhandled exception.

[2015-12-09 01:54:49,046][ERROR][marvel.agent             ] [Maxam] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
        at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[0]: index [.marvel-es-2015.12.09], type [index_recovery], id [AVGFhHRp_6d18XbNgIBf], message [UnavailableShardsException[[.marvel-es-2015.12.09][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1a525917]]

Also, the cluster stuck in recovery for more than 4 hours, before I gave up and started removing indices giving problem. Basically, shards were in UNASSIGNED state and some were in INITIALIZING state.

I can upload the whole log file which about 42MB if that would help (for entire day).

@s1monw
Copy link
Contributor

s1monw commented Dec 11, 2015

I can upload the whole log file which about 42MB if that would help (for entire day).

please

Also, the cluster stuck in recovery for more than 4 hours, before I gave up and started removing indices giving problem. Basically, shards were in UNASSIGNED state and some were in INITIALIZING state.

I can't see why this is happening. so the logs would be awesome

@kpcool
Copy link
Author

kpcool commented Dec 12, 2015

dc20151208.tar.gz
dc.tar.gz

dc20151208 - was when the disk was not full but about to get full
dc.tar.gz - is when the disk was full and es couldn't initialize all shards.

@s1monw
Copy link
Contributor

s1monw commented Dec 12, 2015

the interesting exceptions are here:

Caused by: [packetbeat-2015.12.09][[packetbeat-2015.12.09][1]] EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x203a77f9, got: 0x2c22706f];
    at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:254)
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:175)
    ... 11 more
Caused by: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x203a77f9, got: 0x2c22706f];
    at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1636)
    at org.elasticsearch.index.translog.TranslogReader.read(TranslogReader.java:132)
    at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.readOperation(TranslogReader.java:299)
    at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.next(TranslogReader.java:290)
    at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70)
    at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:240)
    ... 12 more
Caused by: TranslogCorruptedException[translog stream is corrupted, expected: 0x203a77f9, got: 0x2c22706f]
    at org.elasticsearch.index.translog.Translog.verifyChecksum(Translog.java:1593)
    at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1626)
    ... 17 more

I still need to investigate what's going on but can you tell me what system you are running this on? Is this a local machine or a cloud machine? I am also curious what filesystem you are using?

@kpcool
Copy link
Author

kpcool commented Dec 14, 2015

Its a standalone node.
Machine Info: 16GB DDR3, 1TB HDD , Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz.
FileSystem: xfs
CentOS: 7.0 64Bit
Java : "1.8.0_65"

@s1monw
Copy link
Contributor

s1monw commented Dec 14, 2015

alright I think I found the issue here @kpcool your logfiles brought the conclusion thanks you very much. This is actually a serious issue with our transaction log which basically corrupts itself when you hit a disk-full exception. I will keep you posted on this issue. Thanks for baring with me and helping to figure this out.

@s1monw
Copy link
Contributor

s1monw commented Dec 14, 2015

What happens here is that when we hit a disk full expection while we are flushing the transaction log we might be able to write a portion of the data but we will try to flush the entire data block over and over again. Yet, in the most of the scenarios the disk-full happens during a merge and that merge will fail and release disk-space. Once that is done we might be able to flush the translog again but we already wrote big chunks of data to disk which are now 1. corrupted and 2. treated as non-existing since our internal offsets haven't advanced.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Dec 14, 2015
Today we are super lenient (how could I missed that for f**k sake) with failing
/ closing the translog writer when we hit an exception. It's actually worse, we allow
to further write to it and don't care what has been already written to disk and what hasn't.
We keep the buffer in memory and try to write it again on the next operation.

When we hit a disk-full expcetion due to for instance a big merge we are likely adding document to the
translog but fail to write them to disk. Once the merge failed and freed up it's diskspace (note this is
a small window when concurrently indexing and failing the shard due to out of space exceptions) we will
allow in-flight operations to add to the translog and then once we fail the shard fsync it. These operations
are written to disk and fsynced which is fine but the previous buffer flush might have written some bytes
to disk which are not corrupting the translog. That wouldn't be an issue if we prevented the fsync.

Closes elastic#15333
@robcza
Copy link

robcza commented Jul 29, 2016

Encountered this issue on one of the older clusters after disk full issue. Is there any way to recover the index? Losing something from the translog is not a big issue for me.

Here is the expection while starting the node:

[2016-07-29 07:48:15,265][WARN ][indices.cluster          ] [Recorder] [[myindex][0]] marking and sending shard failed due to [failed recovery]
[myindex][[myindex][0]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266];
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
    at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
    at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: [myindex][[myindex][0]] EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266];
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:177)
    at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
    at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
    at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
    at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
    at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
    ... 5 more
Caused by: [myindex][[myindex][0]] EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266];
    at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:240)
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:174)
    ... 11 more
Caused by: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266];
    at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1717)
    at org.elasticsearch.index.translog.TranslogReader.read(TranslogReader.java:132)
    at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.readOperation(TranslogReader.java:296)
    at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.next(TranslogReader.java:287)
    at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70)
    at org.elasticsearch.index.shard.TranslogRecoveryPerformer.recoveryFromSnapshot(TranslogRecoveryPerformer.java:105)
    at org.elasticsearch.index.shard.IndexShard$1.recoveryFromSnapshot(IndexShard.java:1578)
    at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:238)
    ... 12 more
Caused by: TranslogCorruptedException[translog stream is corrupted, expected: 0x88b7b1d6, got: 0x2c202266]
    at org.elasticsearch.index.translog.Translog.verifyChecksum(Translog.java:1675)
    at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1707)
    ... 19 more

These are the contents of the index translog directory:

elasticsearch/nodes/0/indices/myindex/0/translog# ls
translog-1445516620591.ckp  translog-1445516623424.ckp   translog-1445516631010.tlog  translog-1445516631019.tlog  translog-1445516631028.tlog  translog-1445516631037.tlog  translog-1445516631046.tlog
translog-1445516620592.ckp  translog-1445516624605.ckp   translog-1445516631011.ckp   translog-1445516631020.ckp   translog-1445516631029.ckp   translog-1445516631038.ckp   translog-1445516631047.ckp
translog-1445516620593.ckp  translog-1445516624606.ckp   translog-1445516631011.tlog  translog-1445516631020.tlog  translog-1445516631029.tlog  translog-1445516631038.tlog  translog-1445516631047.tlog
translog-1445516620594.ckp  translog-1445516624607.ckp   translog-1445516631012.ckp   translog-1445516631021.ckp   translog-1445516631030.ckp   translog-1445516631039.ckp   translog-1445516631048.ckp
translog-1445516620595.ckp  translog-1445516624608.ckp   translog-1445516631012.tlog  translog-1445516631021.tlog  translog-1445516631030.tlog  translog-1445516631039.tlog  translog-1445516631048.tlog
translog-1445516620596.ckp  translog-1445516624609.ckp   translog-1445516631013.ckp   translog-1445516631022.ckp   translog-1445516631031.ckp   translog-1445516631040.ckp   translog-1445516631049.ckp
translog-1445516620790.ckp  translog-1445516624610.ckp   translog-1445516631013.tlog  translog-1445516631022.tlog  translog-1445516631031.tlog  translog-1445516631040.tlog  translog-1445516631049.tlog
translog-1445516621043.ckp  translog-1445516624611.ckp   translog-1445516631014.ckp   translog-1445516631023.ckp   translog-1445516631032.ckp   translog-1445516631041.ckp   translog-1445516631050.ckp
translog-1445516621044.ckp  translog-1445516624612.ckp   translog-1445516631014.tlog  translog-1445516631023.tlog  translog-1445516631032.tlog  translog-1445516631041.tlog  translog-1445516631050.tlog
translog-1445516621237.ckp  translog-1445516625986.ckp   translog-1445516631015.ckp   translog-1445516631024.ckp   translog-1445516631033.ckp   translog-1445516631042.ckp   translog-1445516631051.ckp
translog-1445516621238.ckp  translog-1445516628096.ckp   translog-1445516631015.tlog  translog-1445516631024.tlog  translog-1445516631033.tlog  translog-1445516631042.tlog  translog-1445516631051.tlog
translog-1445516621239.ckp  translog-1445516628097.ckp   translog-1445516631016.ckp   translog-1445516631025.ckp   translog-1445516631034.ckp   translog-1445516631043.ckp   translog-1445516631052.ckp
translog-1445516621240.ckp  translog-1445516628624.ckp   translog-1445516631016.tlog  translog-1445516631025.tlog  translog-1445516631034.tlog  translog-1445516631043.tlog  translog-1445516631052.tlog
translog-1445516621380.ckp  translog-1445516628625.ckp   translog-1445516631017.ckp   translog-1445516631026.ckp   translog-1445516631035.ckp   translog-1445516631044.ckp   translog-1445516631053.tlog
translog-1445516623082.ckp  translog-1445516629747.ckp   translog-1445516631017.tlog  translog-1445516631026.tlog  translog-1445516631035.tlog  translog-1445516631044.tlog  translog.ckp
translog-1445516623417.ckp  translog-1445516631009.ckp   translog-1445516631018.ckp   translog-1445516631027.ckp   translog-1445516631036.ckp   translog-1445516631045.ckp
translog-1445516623418.ckp  translog-1445516631009.tlog  translog-1445516631018.tlog  translog-1445516631027.tlog  translog-1445516631036.tlog  translog-1445516631045.tlog
translog-1445516623423.ckp  translog-1445516631010.ckp   translog-1445516631019.ckp   translog-1445516631028.ckp   translog-1445516631037.ckp   translog-1445516631046.ckp

@s1monw Is there any theoretical chance to fix this? Maybe removing part of the translog a persuading elasticsearch it is complete?

@jamshid
Copy link

jamshid commented Dec 12, 2017

FYI I saw this same error in an Elasticsearch 2.3.3 environment that ran out of disk space. I was surprised I had to restart Elasticsearch in order to recover.
I was hoping this bug was fixed by #15420 but looks like that fix is in 2.3.3, so there must be additional bugs.
Hopefully latest Elasticsearch is automatically recovers from full disk problems.

@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v2.0.2 v2.1.1 v2.2.0 v5.0.0-alpha1
Projects
None yet
Development

No branches or pull requests

5 participants