Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore snapshot leaves shards unassigned and with CorruptIndexException #9275

Closed
richtmat opened this issue Jan 13, 2015 · 20 comments
Closed
Assignees
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs feedback_needed

Comments

@richtmat
Copy link

I'm doing a snapshot on a regular basis which just works fine. But when restoring these snapshots some shards fail:

{
  "snapshot" : {
    "snapshot" : "snapshot-2015-01-13-2",
    "indices" : [ "dev_cloud_asset_data", "dev_cloud_app_data" ],
    "shards" : {
      "total" : 10,
      "failed" : 8,
      "successful" : 2
    }
  }
}

and the log has some CorruptIndexExceptions:

[2015-01-13 15:51:56,678][WARN ][cluster.action.shard     ] [finderbox-dev-1-elasticsearch-dev] [dev_cloud_app_data][4] received shard failed for [dev_cloud_app_data][4], node[LCb28ckARa2IsGnJ7_v-Yg], [P], restoring[backup-dev:snapshot-2015-01-13-2], s[INITIALIZING], indexUUID [kc9un2LMQkeCBopB6MxvQw], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[dev_cloud_app_data][4] failed recovery]; nested: IndexShardRestoreFailedException[[dev_cloud_app_data][4] restore failed]; nested: IndexShardRestoreFailedException[[dev_cloud_app_data][4] failed to restore snapshot [snapshot-2015-01-13-2]]; nested: IndexShardRestoreFailedException[[dev_cloud_app_data][4] Can't restore corrupted shard]; nested: CorruptIndexException[[dev_cloud_app_data][4] Preexisting corrupted index [corrupted_m4IDHs94TFa-8FqzXlfa7A] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=pp1w1u actual=br96ym (resource=name [_1.cfs], length [4165], checksum [pp1w1u], writtenBy [4.10.2])]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=pp1w1u actual=br96ym (resource=name [_1.cfs], length [4165], checksum [pp1w1u], writtenBy [4.10.2])
        at org.elasticsearch.index.store.Store$VerifyingIndexOutput.readAndCompareChecksum(Store.java:882)
        at org.elasticsearch.index.store.Store$VerifyingIndexOutput.writeBytes(Store.java:894)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:834)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
]; ]]
[2015-01-13 15:51:56,679][WARN ][index.snapshots.blobstore] [finderbox-dev-1-elasticsearch-dev] [dev_cloud_asset_data][1] Can't read metadata from store
org.apache.lucene.index.CorruptIndexException: [dev_cloud_asset_data][1] Preexisting corrupted index [corrupted_YJZCTeR0SD6vMEQUGCphpQ] caused by: CorruptIndexException[verification failed (hardware problem?) : expected=lni6fe actual=null writtenLength=106 expectedLength=11473 (resource=name [_2i.cfs], length [11473], checksum [lni6fe], writtenBy [4.10.2])]
org.apache.lucene.index.CorruptIndexException: verification failed (hardware problem?) : expected=lni6fe actual=null writtenLength=106 expectedLength=11473 (resource=name [_2i.cfs], length [11473], checksum [lni6fe], writtenBy [4.10.2])
        at org.elasticsearch.index.store.Store$VerifyingIndexOutput.verify(Store.java:866)
        at org.elasticsearch.index.store.Store.verify(Store.java:345)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:842)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

        at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:411)
        at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:392)
        at org.elasticsearch.index.store.Store.getMetadata(Store.java:181)
        at org.elasticsearch.index.store.Store.getMetadataOrEmpty(Store.java:147)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:718)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-01-13 15:51:56,679][WARN ][indices.cluster          ] [finderbox-dev-1-elasticsearch-dev] [dev_cloud_asset_data][1] failed to start shard

I have done successful restores before, though. I'm on elasticsearch 1.4.0 .

@richtmat
Copy link
Author

I double checked with version 1.4.2 and a cluster of 2 instances. These fail as well.

@clintongormley
Copy link

Hi @richtmat

Were these indices created on 1.4.0 originally? Or an earlier version? It looks like the files have been truncated, but I'm unsure whether it is the snapshot that contains the truncated files, or whether they're being truncated during the restore.

@clintongormley clintongormley added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs feedback_needed labels Jan 16, 2015
@richtmat
Copy link
Author

@clintongormley, yes, I started taking snapshots with 1.4.0 and could not restore them with 1.4.0. Same with snapshot and restore with 1.4.2 .

@imotov
Copy link
Contributor

imotov commented Jan 16, 2015

@richtmat what type of repository are you using and how is it configured?

@richtmat
Copy link
Author

@imotov it's azure and I'm using config defaults, no special settings set.

@dadoonet
Copy link
Member

Hi @richtmat,

We think this is most likely an issue with azure plugin. But to make sure, I'd like to reproduce your issue.
Could you share here how do you define the repository on the source cluster and on the restore cluster (could be the same if you backup/restore from/to the same cluster)?

Thanks!

@richtmat
Copy link
Author

Hi @dadoonet,

that is easy, it is the simple example from the plugin:

$ curl -XPUT 'http://localhost:9200/_snapshot/backup-dev' -d '{
    "type": "azure"
}'

I reproduced on the same cluster. I tried a single dev machine before and that was not working. If it is of any help, this machine is setup with the puppet module of elasticsearch. So on my dev machine it is a single instance running.

Thank you.

@dadoonet
Copy link
Member

Could you tell me more about the amount of data that has been backed up? Is it some Mb? Gb?

@richtmat
Copy link
Author

Probably about 5MB of test data.

@dadoonet
Copy link
Member

Ok. Small then. One more question, is this failing for all indices or only with a single one?

@richtmat
Copy link
Author

Both indices fail.
I cannot check it at the moment but the more I think about it, it seems strange to me:
As I said there is an instance running, config from /etc/elasticsearch/elasticsearch-dev/elasticsearch.yml.
There is another config in /etc/elasticsearch/elasticsearch.yml of course, but that instance is not running. I have stopped this instance though.
This setup is not our production setup and I am going to clean it up, but I want to make sure that this is not part of the problem.
You have probably seen that I created an issue on the plugins project, do you want me to close that one?

@richtmat
Copy link
Author

@dadoonet can I assist you any more on that issue?

@imotov imotov assigned dadoonet and unassigned imotov Jan 27, 2015
@dadoonet
Copy link
Member

dadoonet commented Feb 3, 2015

@richtmat I wonder if something went wrong when you snapshotted your index.

Any chance you could try to snapshot again the same data on Azure and then restore? If you can reproduce it, would it be possible for you to snapshot you index using the default shared FS, restore and see where is goes?
If it works on the shared FS, could you upload somewhere your shared FS snapshot so I can play around with your data?

And send me an email with the link at david.pilato (at) elasticsearch (dot) com?

@richtmat
Copy link
Author

richtmat commented Feb 3, 2015

@dadoonet I sent you the link via email.

@dadoonet
Copy link
Member

dadoonet commented Feb 3, 2015

Thanks @richtmat

It sounds like you are using https://github.com/yakaz/elasticsearch-analysis-combo, right?

@richtmat
Copy link
Author

richtmat commented Feb 3, 2015

@dadoonet yes I do.

@dadoonet
Copy link
Member

dadoonet commented Feb 3, 2015

And the https://github.com/elasticsearch/elasticsearch-mapper-attachments as well? Could you give me the full list of plugins you have?

GET /_cat/plugins?v

@richtmat
Copy link
Author

richtmat commented Feb 3, 2015

curl localhost:9200/_cat/plugins?v
name                              component          version type url            
es-dev mapper-attachments 2.4.1   j                   
es-dev cloud-azure        2.5.1   j                   
es-dev analysis-combo     1.5.1   j                   
es-dev head               NA      s    /_plugin/head/ 

@dadoonet
Copy link
Member

dadoonet commented Feb 3, 2015

That's really interesting. I can reproduce an issue which is not exactly the one you are describing but I guess it could be the same cause.

Here is what I did:

  • Copy your private backup locally to /tmp/backup
  • Restore it:
PUT /_snapshot/fsbackup
{"type":"fs","settings":{"location":"/tmp/backup"}}
POST /_snapshot/fsbackup/snapshot_1/_restore?wait_for_completion=true
GET dev_cloud_*/_count
  • Create the Azure repo and snapshot:
PUT /_snapshot/backup-dev
{
    "type": "azure"
}
PUT /_snapshot/backup-dev/snapshot_1?wait_for_completion=true
  • Remove local indices and restore:
DELETE _all
POST /_snapshot/backup-dev/snapshot_1/_restore?wait_for_completion=true

Each time I'm trying the same operation, I'm getting this error:

[2015-02-03 15:35:13,511][WARN ][indices.cluster          ] [Cecilia Reyes] [dev_cloud_asset_data][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [dev_cloud_asset_data][2] failed recovery
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [dev_cloud_asset_data][2] restore failed
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    ... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [dev_cloud_asset_data][2] failed to restore snapshot [snapshot_1]
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:165)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [dev_cloud_asset_data][2] Failed to recover index
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:787)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
    ... 5 more
Caused by: java.io.FileNotFoundException: com.sun.jersey.api.client.UniformInterfaceException: GET http://elasticittests.blob.core.windows.net/elasticsearch-snapshots/indices/dev_cloud_asset_data/2/__d returned a response status of 404 Not Found <?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:90e21b4b-0001-003f-3a03-e5b651000000
Time:2015-02-03T14:35:13.1309819Z</Message></Error>
Response Body: 
    at org.elasticsearch.cloud.azure.blobstore.AzureBlobContainer.openInput(AzureBlobContainer.java:77)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$PartSliceStream.openSlice(BlobStoreIndexShardRepository.java:674)
    at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.nextStream(SlicedInputStream.java:53)
    at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.currentStream(SlicedInputStream.java:67)
    at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:88)
    at java.io.InputStream.read(InputStream.java:101)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:834)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
    ... 6 more

When checking in azure console, I can see that blob indices/dev_cloud_asset_data/2/__d has not been created though the snapshot operation reported that everything was fine.

I'm going to try to reproduce this with the most recent changes in cloud-azure plugin.
Will update this thread.

@dadoonet
Copy link
Member

dadoonet commented Feb 4, 2015

I tried the same this morning with elasticsearch 1.4.2 and Azure plugin 2.5.2-SNAPSHOT and everything went well on the same dataset. Snapshot was complete (no missing file) and restore obviously went well.

My suspicion about the initial issue is that something was incorrect at snapshot time but we somehow swallow the exception thrown by Azure instead of failing the Snapshot or doing a retry.

While debugging that, I found some bugs in the way we currently do Snapshot and Restore in Azure plugin, so I'll fix issues I found so far. For example, when you remove a container from Azure console, even if Azure told you that it has been done, it appears to be an asynchronous deletion so you can hit error like can not initialize container [elasticsearch-snapshots]: [The specified container is being deleted. Try operation later.] but this error does not fail the snapshot...

We can close this issue in elasticsearch core now and follow up the discussion in Azure plugin: elastic/elasticsearch-cloud-azure#51

@dadoonet dadoonet closed this as completed Feb 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs feedback_needed
Projects
None yet
Development

No branches or pull requests

4 participants