Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several recoveries cause IndexShardGatewayRecoveryException #8340

Closed
asafc64 opened this issue Nov 4, 2014 · 8 comments · Fixed by #8545
Closed

Several recoveries cause IndexShardGatewayRecoveryException #8340

asafc64 opened this issue Nov 4, 2014 · 8 comments · Fixed by #8545
Assignees
Labels
discuss :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@asafc64
Copy link

asafc64 commented Nov 4, 2014

I have tests environment that restore the index from snapshot before every test.
After a few successful restorings, it fails on:

[2014-11-03 16:03:54,957][WARN ][indices.cluster ] [Baron Von Blitzschlag] [qs_rm_3][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [qs_rm_3][0] failed recovery
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [qs_rm_3][0] restore failed
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [qs_rm_3][0] failed to restore snapshot [qs_rm_alias]
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:159)
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [qs_rm_3][0] Failed to recover index
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:840)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:156)
... 5 more
Caused by: java.io.FileNotFoundException: C:\TestResults\QuickSearch\data\elasticsearch\nodes\0\indices\qs_rm_3\0\index_8.si (Access is denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(Unknown Source)
at java.io.FileOutputStream.(Unknown Source)
at org.apache.lucene.store.FSDirectory$FSIndexOutput.(FSDirectory.java:389)
at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:282)
at org.apache.lucene.store.RateLimitedFSDirectory.createOutput(RateLimitedFSDirectory.java:40)
at org.elasticsearch.index.store.DistributorDirectory.createOutput(DistributorDirectory.java:118)
at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:69)
at org.elasticsearch.index.store.Store.createVerifyingOutput(Store.java:298)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:887)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:830)
... 6 more

@clintongormley
Copy link

It appears you have some permission problem?

C:\TestResults\QuickSearch\data\elasticsearch\nodes\0\indices\qs_rm_3\0\index_8.si (Access is denied)

@imotov or could this be due to Windows locking the file when being access by another process?

@asafc64 what version are you on?

@clintongormley clintongormley added discuss :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Nov 4, 2014
@asafc64
Copy link
Author

asafc64 commented Nov 4, 2014

permission problem - I don't think so, since it succeeds most of the times.
access by another process - I verified that only one instance of ElasticSearch is running.
version - 1.3.4

@imotov
Copy link
Contributor

imotov commented Nov 4, 2014

@asafc64 could you tell us how your test is structured? What sort of cleanup are you doing before/after the test? In particular, do you close the index or delete it? How do you perform these operations? Do you wait for their completion? What operations are performed in the test preceding the test that fails?

@asafc64
Copy link
Author

asafc64 commented Nov 5, 2014

When the environment is loaded, its creates a snap shot, and before every test it's being restored.
Every 5-9 successful restorings, one fails.
Restore procedure:

  1. Close index.
  2. Restore index.
  3. Get RecoveryStatus until all shards are done.
  4. Open index.

By the way,
some times I get this error although I check that all shards are done:
[2014-11-05 07:57:21,288][WARN ][snapshots ] [Elektro] [EsBackups:qs_rm_alias] failed to restore snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [EsBackups:qs_rm_alias] Restore process is already running in this cluster
at org.elasticsearch.snapshots.RestoreService$1.execute(RestoreService.java:139)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:328)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

@imotov
Copy link
Contributor

imotov commented Nov 8, 2014

@asafc64 on the step 3 when shards are reported done there is still some additional operations that should take place before the restore process is really over. So, it's not very reliable method of checking restore status in tests. Opening index after restore is not needed, because restore will open automatically once restore is complete. I would recommend the following approach

  1. Close Index
  2. Running restore with wait_for_completion=true
  3. Run cluster health for the restored index with wait_for_status=green

@clintongormley
Copy link

@imotov is there some additional status that we should report in recovery status to indicate that the process isn't quite complete?

@imotov
Copy link
Contributor

imotov commented Nov 17, 2014

@clintongormley, I am not sure what we can/should do here since the recoveries are reported on the shard level. I think there is a couple of concurrency-related things going on there. First of all there is slight delay in shard status propagation through the cluster between the moment when a shard is done recovering and the moment when a cluster knows that it got started. There is also a global restore clean up stage that is performed once all shards are restored. So, when all shards are restored the restore process is still technically running. I think the recovery API is good to monitor the recovery/restore progress of shards, but it's not a good method to check if restore is done or not globally.

@clintongormley
Copy link

@imotov imotov self-assigned this Nov 19, 2014
imotov added a commit to imotov/elasticsearch that referenced this issue Nov 25, 2014
…or succesfully restored shards to get started

This commit ensures that restore operation with wait_for_completion=true doesn't return until all successfully restored shards are started. Before it was returning as soon as restore operation was over, which cause some shards to be unavailable immediately after restore completion.

Fixes elastic#8340
imotov added a commit that referenced this issue Nov 25, 2014
…or succesfully restored shards to get started

This commit ensures that restore operation with wait_for_completion=true doesn't return until all successfully restored shards are started. Before it was returning as soon as restore operation was over, which cause some shards to be unavailable immediately after restore completion.

Fixes #8340
imotov added a commit that referenced this issue Nov 25, 2014
…or succesfully restored shards to get started

This commit ensures that restore operation with wait_for_completion=true doesn't return until all successfully restored shards are started. Before it was returning as soon as restore operation was over, which cause some shards to be unavailable immediately after restore completion.

Fixes #8340
imotov added a commit that referenced this issue Nov 25, 2014
…or succesfully restored shards to get started

This commit ensures that restore operation with wait_for_completion=true doesn't return until all successfully restored shards are started. Before it was returning as soon as restore operation was over, which cause some shards to be unavailable immediately after restore completion.

Fixes #8340
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
…or succesfully restored shards to get started

This commit ensures that restore operation with wait_for_completion=true doesn't return until all successfully restored shards are started. Before it was returning as soon as restore operation was over, which cause some shards to be unavailable immediately after restore completion.

Fixes elastic#8340
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
…or succesfully restored shards to get started

This commit ensures that restore operation with wait_for_completion=true doesn't return until all successfully restored shards are started. Before it was returning as soon as restore operation was over, which cause some shards to be unavailable immediately after restore completion.

Fixes elastic#8340
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
3 participants