Failure to recover shards after the disk was full #12055

WellingR · 2015-07-06T15:08:18Z

On one of our servers running Elasticsearch, some other process wrote to many logfiles such that the disk was out of space. After deleting these logfiles and rebooting the system, Elasticsearch did not recover.

We are running on a single server, using Elasticsearch 1.5.2

I believe we manages to recover by deleting some of the *.recovering files in the elasticsearch data directories, however it would be great if Elasticsearch could recover as much as possible by itself.

[2015-07-03 14:09:37,196][WARN ][cluster.action.shard     ] [mxserver] [abds-historic-snapshots-2015-07-03][1] received shard failed for [abds-historic-snapshots-2015-07-03][1], node[9HgooclMS6W9m-1lqKxV8Q], [P], s[INITIALIZING], indexUUID [6unWDyfbQ_yF9XTYAlMz4g], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[abds-historic-snapshots-2015-07-03][1] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [116]]; ]]
[2015-07-03 14:09:37,205][WARN ][index.engine             ] [mxserver] [abds-instance][0] failed to sync translog
[2015-07-03 14:09:37,206][WARN ][indices.cluster          ] [mxserver] [[abds-instance][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [abds-instance][0] failed to recover shard
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
    ... 4 more
Caused by: org.elasticsearch.ElasticsearchException: failed to read [abdstrack][AdsbTrack-7668367]
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:522)
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
    ... 5 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
    at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:519)
    ... 6 more
[2015-07-03 14:09:37,206][WARN ][cluster.action.shard     ] [mxserver] [abds-instance][0] received shard failed for [abds-instance][0], node[9HgooclMS6W9m-1lqKxV8Q], [P], s[INITIALIZING], indexUUID [H8FyNbqATmWQ6p8RYSGncw], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[abds-instance][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchException[failed to read [abdstrack][AdsbTrack-7668367]]; nested: ElasticsearchIllegalArgumentException[No version type match [48]]; ]]
[2015-07-03 14:09:37,216][WARN ][index.engine             ] [mxserver] [abds-historic-snapshots-2015-07-03][1] failed to sync translog
[2015-07-03 14:09:37,217][WARN ][indices.cluster          ] [mxserver] [[abds-historic-snapshots-2015-07-03][1]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [abds-historic-snapshots-2015-07-03][1] failed to recover shard
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
    ... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [116]
    at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
    at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:376)
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
    ... 5 more

Note: This issue seems very similar to #10606 which I have reported before.

The text was updated successfully, but these errors were encountered:

287400117 · 2015-07-08T08:31:44Z

The problem you solved?

s1monw · 2015-07-08T09:38:50Z

this will be fixed in Elasticsearch 2.0. It will unlikely make it into 1.x series since it depends on a large amount of changes that are only in 2.0

autrejacoupa · 2015-07-08T18:33:27Z

When is Elasticsearch 2.0 scheduled for release?

balaji006 · 2015-09-01T14:47:13Z

Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

tony-bye · 2015-11-12T22:51:06Z

I also had this kind of error after a partition was full, and deleting the .recovery files as balaji006 suggested worked fine. I had a lot of affected index/shard directories, but after deleting each .recovery file elasticsearch worked fine again.
====Update
Oops, spoke too soon. Now all queries give me "All shards failed for phase: [query]"

sa-shukla · 2015-11-23T18:22:47Z

I am running ElasticSearch 2.0, but am still receiving IndexShard Recovery failures:

[2015-11-23 18:03:32,670][WARN ][cluster.action.shard ] [The Russian] [logstash-2015.10.24][4] received shard failed for [logstash-2015.10.24][4], node[omb9PXHUTXqpKeesvkCbPw], [P], v[742647], s[INITIALIZING], a[id=XUctUOPUQLiHXyK2J9gdlg], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-11-23T18:03:32.486Z], details[failed recovery, failure IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1]; ]], indexUUID [jf5m3aXaQLyH9gMhwMBuDQ], message [failed recovery], failure [IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1]; ]
[logstash-2015.10.24][[logstash-2015.10.24][4]] IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1];
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1
at org.elasticsearch.index.translog.Translog.upgradeLegacyTranslog(Translog.java:253)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:185)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:131)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1349)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1344)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:889)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:866)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:249)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:60)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:133)
... 3 more

There has been no disk full issue since my upgrade to 2.0, so possibility of recovery file getting corrupted is very low.

Any fixes / workaround would be very much appreciated.

Regards,
Sagar

kpcool · 2015-12-09T11:09:39Z

Today, the disk got full and ElasticSearch is not able to go back again. Isn't there a built-in system that prevents such failures. I agree that we should be monitoring the hard space and not let this happen in first place, but some times things happen.

My setup is a single node at present.

I don't see a clear way to recover the node. A post at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html seemed to help, but still few indices got corrupted and I have no way to recovering them.

At the end, I ended up deleted the indices, but that's not the way it should be. Such things must be taken care of ultimatley

CorbMax · 2016-01-07T10:54:22Z

Same issue here. Applied tips at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html but not solved.

This is really really disappointing!

bleskes · 2016-01-08T10:01:24Z

@CorbMax (and @kpcool ) which ES versions are you on?

CorbMax · 2016-01-08T10:08:24Z

I'm on 2.0, but going to upgrade to 2.1
Unfortunately I was obliged to delete indexes to unlock the system...

starkers · 2016-02-02T23:52:18Z

I think I've experienced the same just after updating from 2.1.0 to 2.2.0 (from official stable PPA)

Its only a few devel indexes but the recovery seems to just (after stopping elastic, growing the disk, starting elastic) filled up the disk very quickly with translog "stuff"

I'm just going to delete the stuff but this shouldn't be difficult to replicate.

bleskes · 2016-02-03T11:04:22Z

@starkers clean please capture the files and logs before deleting, and share them somewhere? these things are typically not easy to reproduce :(

systeminsightsbuild · 2016-02-06T06:04:47Z

Should this not be reopned? I just a disk full and now and getting

[2016-02-06 06:01:59,643][WARN ][cluster.action.shard     ] [ops-elk-1] [logstash-2016.02.05][2] received shard failed for ...

This is for 2.1.1

bleskes · 2016-02-06T09:54:56Z

@systeminsightsbuild sadly there can be many reasons for this can of failure. this specific issue is about translog corruption due to a failure to fully write an operation. This is fixed in 2.0. There might be other issues as well. It's hard to tell from the log line you sent as it misses the part that tells why the shard failed. If you can post that (and feel free open a new issue), we can see what's going on.

likaiguo · 2016-03-14T19:22:10Z

it works!!!

balaji006 commented on 1 Sep 2015
Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

Thanks @balaji006

amir-rahnama · 2016-04-06T13:20:12Z

@simonw This is still there for 2.2.3.

@balaji006 workaround fixed the issue, but I think that needs to be addressed.

bleskes · 2016-04-07T09:03:42Z

@ambodi can open a new issue with the details of what you saw? this can come in many flavors. I'm also curious how you had a .recovering translog file, which is not used in 2.x

amir-rahnama · 2016-04-14T10:03:27Z

@bleskes here is what I see:

2016-04-14T10:02:45.691973552Z Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream 2016-04-14T10:02:45.691977952Z at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72) 2016-04-14T10:02:45.691982252Z at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260) 2016-04-14T10:02:45.691992452Z ... 4 more

@bleskes we upgraded from 1.5 to 2.2.3

bleskes · 2016-04-14T11:48:27Z

@ambodi thx. That exception stack trace refers to a class that has been removed in the 2.x series. The code that generated this exception is therefore from your 1.5 version. This makes me think something went wrong with your upgrade and that that node is still on 1.5.

PS. I taking you mean upgrade to 2.2.3 (as you wrote before) and not 2.8

tamsky · 2016-06-12T22:04:11Z

For reference, an original thread with a complete set of instructions for this error is at:

https://groups.google.com/d/msg/elasticsearch/HtgNeUJ5uao/AdMssa0WnJMJ

And to correct mistakes found above:

Everywhere above that mentions files with suffix ".recovery" is mistaken. The correct suffix is .recovering

For us, after stopping ES, moving these .recovering files to another filesystem, and then starting ES, our cluster was able to recover. (ES version 1.6.2)

jamshid · 2017-01-09T20:57:07Z

@tamsky the link doesn't work, maybe the elasticsearch group was deleted/moved?
FWIW I found this issue because I had a problem with ES 2.3.3 running out of disk space and then not recovering properly. But I guess it's not related to this issue since the .recovering file is no longer used? Sorry don't have logs from the ES 2.3.3 problem.

tamsky · 2017-01-10T18:05:16Z

Thanks for pointing out the group is gone.

I'm disappointed the ES team invalidated (and made unsearchable by old URL) all those groups links after their bulk import and announcement. I've learned my lesson : at a minimum, quote the thread subject.

A bit of spelunking later, I found a citation containing both thread URL and subject
[ ES failed to recover after crash ]

http://repository.tudelft.nl/islandora/object/uuid:1db911a0-18e0-49b5-ac9d-b6700f9b60ab/datastream/OBJ/download

Here's the migrated thread:
https://discuss.elastic.co/t/es-failed-to-recover-after-crash/8195

I guess the message I had linked to was this
https://discuss.elastic.co/t/es-failed-to-recover-after-crash/8195/5
but my comment giving corrections seems out of place or already corrected.

s1monw closed this as completed Jul 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to recover shards after the disk was full #12055

Failure to recover shards after the disk was full #12055

WellingR commented Jul 6, 2015

287400117 commented Jul 8, 2015

s1monw commented Jul 8, 2015

autrejacoupa commented Jul 8, 2015

balaji006 commented Sep 1, 2015

tony-bye commented Nov 12, 2015

sa-shukla commented Nov 23, 2015

kpcool commented Dec 9, 2015

CorbMax commented Jan 7, 2016

bleskes commented Jan 8, 2016

CorbMax commented Jan 8, 2016

starkers commented Feb 2, 2016

bleskes commented Feb 3, 2016

systeminsightsbuild commented Feb 6, 2016

bleskes commented Feb 6, 2016

likaiguo commented Mar 14, 2016

amir-rahnama commented Apr 6, 2016

bleskes commented Apr 7, 2016

amir-rahnama commented Apr 14, 2016

bleskes commented Apr 14, 2016

tamsky commented Jun 12, 2016

jamshid commented Jan 9, 2017

tamsky commented Jan 10, 2017

Failure to recover shards after the disk was full #12055

Failure to recover shards after the disk was full #12055

Comments

WellingR commented Jul 6, 2015

287400117 commented Jul 8, 2015

s1monw commented Jul 8, 2015

autrejacoupa commented Jul 8, 2015

balaji006 commented Sep 1, 2015

tony-bye commented Nov 12, 2015

sa-shukla commented Nov 23, 2015

kpcool commented Dec 9, 2015

CorbMax commented Jan 7, 2016

bleskes commented Jan 8, 2016

CorbMax commented Jan 8, 2016

starkers commented Feb 2, 2016

bleskes commented Feb 3, 2016

systeminsightsbuild commented Feb 6, 2016

bleskes commented Feb 6, 2016

likaiguo commented Mar 14, 2016

amir-rahnama commented Apr 6, 2016

bleskes commented Apr 7, 2016

amir-rahnama commented Apr 14, 2016

bleskes commented Apr 14, 2016

tamsky commented Jun 12, 2016

jamshid commented Jan 9, 2017

tamsky commented Jan 10, 2017