Restore snapshot leaves shards unassigned and with CorruptIndexException #9275

richtmat · 2015-01-13T15:53:11Z

I'm doing a snapshot on a regular basis which just works fine. But when restoring these snapshots some shards fail:

{
  "snapshot" : {
    "snapshot" : "snapshot-2015-01-13-2",
    "indices" : [ "dev_cloud_asset_data", "dev_cloud_app_data" ],
    "shards" : {
      "total" : 10,
      "failed" : 8,
      "successful" : 2
    }
  }
}

and the log has some CorruptIndexExceptions:

[2015-01-13 15:51:56,678][WARN ][cluster.action.shard     ] [finderbox-dev-1-elasticsearch-dev] [dev_cloud_app_data][4] received shard failed for [dev_cloud_app_data][4], node[LCb28ckARa2IsGnJ7_v-Yg], [P], restoring[backup-dev:snapshot-2015-01-13-2], s[INITIALIZING], indexUUID [kc9un2LMQkeCBopB6MxvQw], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[dev_cloud_app_data][4] failed recovery]; nested: IndexShardRestoreFailedException[[dev_cloud_app_data][4] restore failed]; nested: IndexShardRestoreFailedException[[dev_cloud_app_data][4] failed to restore snapshot [snapshot-2015-01-13-2]]; nested: IndexShardRestoreFailedException[[dev_cloud_app_data][4] Can't restore corrupted shard]; nested: CorruptIndexException[[dev_cloud_app_data][4] Preexisting corrupted index [corrupted_m4IDHs94TFa-8FqzXlfa7A] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=pp1w1u actual=br96ym (resource=name [_1.cfs], length [4165], checksum [pp1w1u], writtenBy [4.10.2])]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=pp1w1u actual=br96ym (resource=name [_1.cfs], length [4165], checksum [pp1w1u], writtenBy [4.10.2])
        at org.elasticsearch.index.store.Store$VerifyingIndexOutput.readAndCompareChecksum(Store.java:882)
        at org.elasticsearch.index.store.Store$VerifyingIndexOutput.writeBytes(Store.java:894)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:834)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
]; ]]
[2015-01-13 15:51:56,679][WARN ][index.snapshots.blobstore] [finderbox-dev-1-elasticsearch-dev] [dev_cloud_asset_data][1] Can't read metadata from store
org.apache.lucene.index.CorruptIndexException: [dev_cloud_asset_data][1] Preexisting corrupted index [corrupted_YJZCTeR0SD6vMEQUGCphpQ] caused by: CorruptIndexException[verification failed (hardware problem?) : expected=lni6fe actual=null writtenLength=106 expectedLength=11473 (resource=name [_2i.cfs], length [11473], checksum [lni6fe], writtenBy [4.10.2])]
org.apache.lucene.index.CorruptIndexException: verification failed (hardware problem?) : expected=lni6fe actual=null writtenLength=106 expectedLength=11473 (resource=name [_2i.cfs], length [11473], checksum [lni6fe], writtenBy [4.10.2])
        at org.elasticsearch.index.store.Store$VerifyingIndexOutput.verify(Store.java:866)
        at org.elasticsearch.index.store.Store.verify(Store.java:345)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:842)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

        at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:411)
        at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:392)
        at org.elasticsearch.index.store.Store.getMetadata(Store.java:181)
        at org.elasticsearch.index.store.Store.getMetadataOrEmpty(Store.java:147)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:718)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-01-13 15:51:56,679][WARN ][indices.cluster          ] [finderbox-dev-1-elasticsearch-dev] [dev_cloud_asset_data][1] failed to start shard

I have done successful restores before, though. I'm on elasticsearch 1.4.0 .

The text was updated successfully, but these errors were encountered:

richtmat · 2015-01-15T13:43:08Z

I double checked with version 1.4.2 and a cluster of 2 instances. These fail as well.

clintongormley · 2015-01-16T13:03:50Z

Hi @richtmat

Were these indices created on 1.4.0 originally? Or an earlier version? It looks like the files have been truncated, but I'm unsure whether it is the snapshot that contains the truncated files, or whether they're being truncated during the restore.

richtmat · 2015-01-16T13:08:19Z

@clintongormley, yes, I started taking snapshots with 1.4.0 and could not restore them with 1.4.0. Same with snapshot and restore with 1.4.2 .

imotov · 2015-01-16T14:38:50Z

@richtmat what type of repository are you using and how is it configured?

richtmat · 2015-01-16T15:03:46Z

@imotov it's azure and I'm using config defaults, no special settings set.

dadoonet · 2015-01-16T15:55:12Z

Hi @richtmat,

We think this is most likely an issue with azure plugin. But to make sure, I'd like to reproduce your issue.
Could you share here how do you define the repository on the source cluster and on the restore cluster (could be the same if you backup/restore from/to the same cluster)?

Thanks!

richtmat · 2015-01-16T20:07:37Z

Hi @dadoonet,

that is easy, it is the simple example from the plugin:

$ curl -XPUT 'http://localhost:9200/_snapshot/backup-dev' -d '{
    "type": "azure"
}'

I reproduced on the same cluster. I tried a single dev machine before and that was not working. If it is of any help, this machine is setup with the puppet module of elasticsearch. So on my dev machine it is a single instance running.

Thank you.

dadoonet · 2015-01-16T20:09:51Z

Could you tell me more about the amount of data that has been backed up? Is it some Mb? Gb?

richtmat · 2015-01-16T20:11:20Z

Probably about 5MB of test data.

dadoonet · 2015-01-16T20:13:11Z

Ok. Small then. One more question, is this failing for all indices or only with a single one?

richtmat · 2015-01-16T20:19:26Z

Both indices fail.
I cannot check it at the moment but the more I think about it, it seems strange to me:
As I said there is an instance running, config from /etc/elasticsearch/elasticsearch-dev/elasticsearch.yml.
There is another config in /etc/elasticsearch/elasticsearch.yml of course, but that instance is not running. I have stopped this instance though.
This setup is not our production setup and I am going to clean it up, but I want to make sure that this is not part of the problem.
You have probably seen that I created an issue on the plugins project, do you want me to close that one?

richtmat · 2015-01-20T12:58:02Z

@dadoonet can I assist you any more on that issue?

dadoonet · 2015-02-03T10:45:15Z

@richtmat I wonder if something went wrong when you snapshotted your index.

Any chance you could try to snapshot again the same data on Azure and then restore? If you can reproduce it, would it be possible for you to snapshot you index using the default shared FS, restore and see where is goes?
If it works on the shared FS, could you upload somewhere your shared FS snapshot so I can play around with your data?

And send me an email with the link at david.pilato (at) elasticsearch (dot) com?

richtmat · 2015-02-03T13:06:16Z

@dadoonet I sent you the link via email.

dadoonet · 2015-02-03T13:22:49Z

Thanks @richtmat

It sounds like you are using https://github.com/yakaz/elasticsearch-analysis-combo, right?

richtmat · 2015-02-03T13:26:14Z

@dadoonet yes I do.

dadoonet · 2015-02-03T13:27:58Z

And the https://github.com/elasticsearch/elasticsearch-mapper-attachments as well? Could you give me the full list of plugins you have?

GET /_cat/plugins?v

richtmat · 2015-02-03T13:30:20Z

curl localhost:9200/_cat/plugins?v
name                              component          version type url            
es-dev mapper-attachments 2.4.1   j                   
es-dev cloud-azure        2.5.1   j                   
es-dev analysis-combo     1.5.1   j                   
es-dev head               NA      s    /_plugin/head/

dadoonet · 2015-02-03T14:46:16Z

That's really interesting. I can reproduce an issue which is not exactly the one you are describing but I guess it could be the same cause.

Here is what I did:

Copy your private backup locally to /tmp/backup
Restore it:

PUT /_snapshot/fsbackup
{"type":"fs","settings":{"location":"/tmp/backup"}}
POST /_snapshot/fsbackup/snapshot_1/_restore?wait_for_completion=true
GET dev_cloud_*/_count

Create the Azure repo and snapshot:

PUT /_snapshot/backup-dev
{
    "type": "azure"
}
PUT /_snapshot/backup-dev/snapshot_1?wait_for_completion=true

Remove local indices and restore:

DELETE _all
POST /_snapshot/backup-dev/snapshot_1/_restore?wait_for_completion=true

Each time I'm trying the same operation, I'm getting this error:

[2015-02-03 15:35:13,511][WARN ][indices.cluster          ] [Cecilia Reyes] [dev_cloud_asset_data][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [dev_cloud_asset_data][2] failed recovery
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [dev_cloud_asset_data][2] restore failed
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    ... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [dev_cloud_asset_data][2] failed to restore snapshot [snapshot_1]
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:165)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [dev_cloud_asset_data][2] Failed to recover index
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:787)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
    ... 5 more
Caused by: java.io.FileNotFoundException: com.sun.jersey.api.client.UniformInterfaceException: GET http://elasticittests.blob.core.windows.net/elasticsearch-snapshots/indices/dev_cloud_asset_data/2/__d returned a response status of 404 Not Found <?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:90e21b4b-0001-003f-3a03-e5b651000000
Time:2015-02-03T14:35:13.1309819Z</Message></Error>
Response Body: 
    at org.elasticsearch.cloud.azure.blobstore.AzureBlobContainer.openInput(AzureBlobContainer.java:77)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$PartSliceStream.openSlice(BlobStoreIndexShardRepository.java:674)
    at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.nextStream(SlicedInputStream.java:53)
    at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.currentStream(SlicedInputStream.java:67)
    at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:88)
    at java.io.InputStream.read(InputStream.java:101)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:834)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
    ... 6 more

When checking in azure console, I can see that blob indices/dev_cloud_asset_data/2/__d has not been created though the snapshot operation reported that everything was fine.

I'm going to try to reproduce this with the most recent changes in cloud-azure plugin.
Will update this thread.

dadoonet · 2015-02-04T10:21:16Z

I tried the same this morning with elasticsearch 1.4.2 and Azure plugin 2.5.2-SNAPSHOT and everything went well on the same dataset. Snapshot was complete (no missing file) and restore obviously went well.

My suspicion about the initial issue is that something was incorrect at snapshot time but we somehow swallow the exception thrown by Azure instead of failing the Snapshot or doing a retry.

While debugging that, I found some bugs in the way we currently do Snapshot and Restore in Azure plugin, so I'll fix issues I found so far. For example, when you remove a container from Azure console, even if Azure told you that it has been done, it appears to be an asynchronous deletion so you can hit error like can not initialize container [elasticsearch-snapshots]: [The specified container is being deleted. Try operation later.] but this error does not fail the snapshot...

We can close this issue in elasticsearch core now and follow up the discussion in Azure plugin: elastic/elasticsearch-cloud-azure#51

richtmat mentioned this issue Jan 15, 2015

Restore snapshot leaves shards unassigned and with CorruptIndexException elastic/elasticsearch-cloud-azure#51

Closed

clintongormley added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs feedback_needed labels Jan 16, 2015

clintongormley assigned imotov Jan 16, 2015

imotov assigned dadoonet and unassigned imotov Jan 27, 2015

dadoonet closed this as completed Feb 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore snapshot leaves shards unassigned and with CorruptIndexException #9275

Restore snapshot leaves shards unassigned and with CorruptIndexException #9275

richtmat commented Jan 13, 2015

richtmat commented Jan 15, 2015

clintongormley commented Jan 16, 2015

richtmat commented Jan 16, 2015

imotov commented Jan 16, 2015

richtmat commented Jan 16, 2015

dadoonet commented Jan 16, 2015

richtmat commented Jan 16, 2015

dadoonet commented Jan 16, 2015

richtmat commented Jan 16, 2015

dadoonet commented Jan 16, 2015

richtmat commented Jan 16, 2015

richtmat commented Jan 20, 2015

dadoonet commented Feb 3, 2015

richtmat commented Feb 3, 2015

dadoonet commented Feb 3, 2015

richtmat commented Feb 3, 2015

dadoonet commented Feb 3, 2015

richtmat commented Feb 3, 2015

dadoonet commented Feb 3, 2015

dadoonet commented Feb 4, 2015

Restore snapshot leaves shards unassigned and with CorruptIndexException #9275

Restore snapshot leaves shards unassigned and with CorruptIndexException #9275

Comments

richtmat commented Jan 13, 2015

richtmat commented Jan 15, 2015

clintongormley commented Jan 16, 2015

richtmat commented Jan 16, 2015

imotov commented Jan 16, 2015

richtmat commented Jan 16, 2015

dadoonet commented Jan 16, 2015

richtmat commented Jan 16, 2015

dadoonet commented Jan 16, 2015

richtmat commented Jan 16, 2015

dadoonet commented Jan 16, 2015

richtmat commented Jan 16, 2015

richtmat commented Jan 20, 2015

dadoonet commented Feb 3, 2015

richtmat commented Feb 3, 2015

dadoonet commented Feb 3, 2015

richtmat commented Feb 3, 2015

dadoonet commented Feb 3, 2015

richtmat commented Feb 3, 2015

dadoonet commented Feb 3, 2015

dadoonet commented Feb 4, 2015