Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to recover shards after the disk was full #12055

Closed
WellingR opened this issue Jul 6, 2015 · 22 comments
Closed

Failure to recover shards after the disk was full #12055

WellingR opened this issue Jul 6, 2015 · 22 comments

Comments

@WellingR
Copy link

WellingR commented Jul 6, 2015

On one of our servers running Elasticsearch, some other process wrote to many logfiles such that the disk was out of space. After deleting these logfiles and rebooting the system, Elasticsearch did not recover.

We are running on a single server, using Elasticsearch 1.5.2

I believe we manages to recover by deleting some of the *.recovering files in the elasticsearch data directories, however it would be great if Elasticsearch could recover as much as possible by itself.

[2015-07-03 14:09:37,196][WARN ][cluster.action.shard     ] [mxserver] [abds-historic-snapshots-2015-07-03][1] received shard failed for [abds-historic-snapshots-2015-07-03][1], node[9HgooclMS6W9m-1lqKxV8Q], [P], s[INITIALIZING], indexUUID [6unWDyfbQ_yF9XTYAlMz4g], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[abds-historic-snapshots-2015-07-03][1] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [116]]; ]]
[2015-07-03 14:09:37,205][WARN ][index.engine             ] [mxserver] [abds-instance][0] failed to sync translog
[2015-07-03 14:09:37,206][WARN ][indices.cluster          ] [mxserver] [[abds-instance][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [abds-instance][0] failed to recover shard
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
    ... 4 more
Caused by: org.elasticsearch.ElasticsearchException: failed to read [abdstrack][AdsbTrack-7668367]
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:522)
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
    ... 5 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
    at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
    at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:519)
    ... 6 more
[2015-07-03 14:09:37,206][WARN ][cluster.action.shard     ] [mxserver] [abds-instance][0] received shard failed for [abds-instance][0], node[9HgooclMS6W9m-1lqKxV8Q], [P], s[INITIALIZING], indexUUID [H8FyNbqATmWQ6p8RYSGncw], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[abds-instance][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchException[failed to read [abdstrack][AdsbTrack-7668367]]; nested: ElasticsearchIllegalArgumentException[No version type match [48]]; ]]
[2015-07-03 14:09:37,216][WARN ][index.engine             ] [mxserver] [abds-historic-snapshots-2015-07-03][1] failed to sync translog
[2015-07-03 14:09:37,217][WARN ][indices.cluster          ] [mxserver] [[abds-historic-snapshots-2015-07-03][1]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [abds-historic-snapshots-2015-07-03][1] failed to recover shard
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
    ... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [116]
    at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
    at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:376)
    at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
    ... 5 more

Note: This issue seems very similar to #10606 which I have reported before.

@287400117
Copy link

The problem you solved?

@s1monw
Copy link
Contributor

s1monw commented Jul 8, 2015

this will be fixed in Elasticsearch 2.0. It will unlikely make it into 1.x series since it depends on a large amount of changes that are only in 2.0

@s1monw s1monw closed this as completed Jul 8, 2015
@autrejacoupa
Copy link

When is Elasticsearch 2.0 scheduled for release?

@balaji006
Copy link

Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

@tony-bye
Copy link

I also had this kind of error after a partition was full, and deleting the .recovery files as balaji006 suggested worked fine. I had a lot of affected index/shard directories, but after deleting each .recovery file elasticsearch worked fine again.
====Update
Oops, spoke too soon. Now all queries give me "All shards failed for phase: [query]"

@sa-shukla
Copy link

I am running ElasticSearch 2.0, but am still receiving IndexShard Recovery failures:

[2015-11-23 18:03:32,670][WARN ][cluster.action.shard ] [The Russian] [logstash-2015.10.24][4] received shard failed for [logstash-2015.10.24][4], node[omb9PXHUTXqpKeesvkCbPw], [P], v[742647], s[INITIALIZING], a[id=XUctUOPUQLiHXyK2J9gdlg], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-11-23T18:03:32.486Z], details[failed recovery, failure IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1]; ]], indexUUID [jf5m3aXaQLyH9gMhwMBuDQ], message [failed recovery], failure [IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1]; ]
[logstash-2015.10.24][[logstash-2015.10.24][4]] IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1];
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: latest found translog has a lower generation that the excepcted uncommitted 1421133423283 > -1
at org.elasticsearch.index.translog.Translog.upgradeLegacyTranslog(Translog.java:253)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:185)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:131)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1349)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1344)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:889)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:866)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:249)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:60)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:133)
... 3 more

There has been no disk full issue since my upgrade to 2.0, so possibility of recovery file getting corrupted is very low.

Any fixes / workaround would be very much appreciated.

Regards,
Sagar

@kpcool
Copy link

kpcool commented Dec 9, 2015

Today, the disk got full and ElasticSearch is not able to go back again. Isn't there a built-in system that prevents such failures. I agree that we should be monitoring the hard space and not let this happen in first place, but some times things happen.

My setup is a single node at present.

I don't see a clear way to recover the node. A post at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html seemed to help, but still few indices got corrupted and I have no way to recovering them.

At the end, I ended up deleted the indices, but that's not the way it should be. Such things must be taken care of ultimatley

@CorbMax
Copy link

CorbMax commented Jan 7, 2016

Same issue here. Applied tips at https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html but not solved.

This is really really disappointing!

@bleskes
Copy link
Contributor

bleskes commented Jan 8, 2016

@CorbMax (and @kpcool ) which ES versions are you on?

@CorbMax
Copy link

CorbMax commented Jan 8, 2016

I'm on 2.0, but going to upgrade to 2.1
Unfortunately I was obliged to delete indexes to unlock the system...

@starkers
Copy link

starkers commented Feb 2, 2016

I think I've experienced the same just after updating from 2.1.0 to 2.2.0 (from official stable PPA)

Its only a few devel indexes but the recovery seems to just (after stopping elastic, growing the disk, starting elastic) filled up the disk very quickly with translog "stuff"

I'm just going to delete the stuff but this shouldn't be difficult to replicate.

@bleskes
Copy link
Contributor

bleskes commented Feb 3, 2016

@starkers clean please capture the files and logs before deleting, and share them somewhere? these things are typically not easy to reproduce :(

@systeminsightsbuild
Copy link

Should this not be reopned? I just a disk full and now and getting

[2016-02-06 06:01:59,643][WARN ][cluster.action.shard     ] [ops-elk-1] [logstash-2016.02.05][2] received shard failed for ...

This is for 2.1.1

@bleskes
Copy link
Contributor

bleskes commented Feb 6, 2016

@systeminsightsbuild sadly there can be many reasons for this can of failure. this specific issue is about translog corruption due to a failure to fully write an operation. This is fixed in 2.0. There might be other issues as well. It's hard to tell from the log line you sent as it misses the part that tells why the shard failed. If you can post that (and feel free open a new issue), we can see what's going on.

@likaiguo
Copy link

it works!!!

balaji006 commented on 1 Sep 2015
Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

Thanks @balaji006

@amir-rahnama
Copy link

@simonw This is still there for 2.2.3.

@balaji006 workaround fixed the issue, but I think that needs to be addressed.

@bleskes
Copy link
Contributor

bleskes commented Apr 7, 2016

@ambodi can open a new issue with the details of what you saw? this can come in many flavors. I'm also curious how you had a .recovering translog file, which is not used in 2.x

@amir-rahnama
Copy link

@bleskes here is what I see:

2016-04-14T10:02:45.691973552Z Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream 2016-04-14T10:02:45.691977952Z at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72) 2016-04-14T10:02:45.691982252Z at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260) 2016-04-14T10:02:45.691992452Z ... 4 more

@bleskes we upgraded from 1.5 to 2.2.3

@bleskes
Copy link
Contributor

bleskes commented Apr 14, 2016

@ambodi thx. That exception stack trace refers to a class that has been removed in the 2.x series. The code that generated this exception is therefore from your 1.5 version. This makes me think something went wrong with your upgrade and that that node is still on 1.5.

PS. I taking you mean upgrade to 2.2.3 (as you wrote before) and not 2.8

@tamsky
Copy link

tamsky commented Jun 12, 2016

For reference, an original thread with a complete set of instructions for this error is at:

And to correct mistakes found above:

  • Everywhere above that mentions files with suffix ".recovery" is mistaken. The correct suffix is .recovering

For us, after stopping ES, moving these .recovering files to another filesystem, and then starting ES, our cluster was able to recover. (ES version 1.6.2)

@jamshid
Copy link

jamshid commented Jan 9, 2017

@tamsky the link doesn't work, maybe the elasticsearch group was deleted/moved?
FWIW I found this issue because I had a problem with ES 2.3.3 running out of disk space and then not recovering properly. But I guess it's not related to this issue since the .recovering file is no longer used? Sorry don't have logs from the ES 2.3.3 problem.

@tamsky
Copy link

tamsky commented Jan 10, 2017

Thanks for pointing out the group is gone.

I'm disappointed the ES team invalidated (and made unsearchable by old URL) all those groups links after their bulk import and announcement. I've learned my lesson : at a minimum, quote the thread subject.

A bit of spelunking later, I found a citation containing both thread URL and subject
[ ES failed to recover after crash ]

Here's the migrated thread:
https://discuss.elastic.co/t/es-failed-to-recover-after-crash/8195

I guess the message I had linked to was this
https://discuss.elastic.co/t/es-failed-to-recover-after-crash/8195/5
but my comment giving corrections seems out of place or already corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests