Snapshot succeeds even when some nodes cannot access shared repository #5657

RobbieHer · 2014-04-01T23:48:57Z

Hi,
I am testing out using the snapshot/restore feature in ES 1.0.0. I have created the following setup.
An ES cluster with three nodes. A shared file system which can only be accessed by Node 1[Master] and Node 2. This shared fs is registered as the repository for taking backups on Node 1.

I noticed that even if I
a) force some of my data to be present only on Node 3, and
b) ensure that Node 3 cannot access the shared repository

Taking a snapshot of the entire cluster still reports a success. However, when I browse the contents of the snapshot folder, I do not see any of the data from Node 3. I was expecting a "RepositoryMissing" exception to be thrown by Node 3. Have I misunderstood how ES snapshotting works?

Thanks!

imotov · 2014-04-02T02:06:19Z

When snapshot is finished, could you execute the following command to see if there are any shard failures there: curl -XGET "localhost:9200/_snapshot/repository_name/snapshot_name"? If you have replicas for these indices enabled it's also possible that all primary shards are located on node 1 and node 2, in this case the snapshot might be successful since only primary shards are getting snapshotted.

RobbieHer · 2014-04-02T02:17:57Z

Thanks! I did verify the case where I had a node in a cluster with only one shard with 0 replicas. This node could not connect to the repo and spewed out errors in the elastic search logs. However, the curl command to take a snapshot did not return any errors and reported all shards to be snapshots successfully.

imotov · 2014-04-02T02:19:59Z

Could you gist the errors from the log that you have seen?

RobbieHer · 2014-04-02T05:02:20Z

Sure. Here are the errors from the node which had the data , but could not write to the repository.

[2014-03-31 17:33:37,684][WARN ][repositories ] [Hermod] failed to create repository [fs][my_backup]
org.elasticsearch.common.inject.CreationException: Guice creation errors:

Error injecting constructor, org.elasticsearch.common.blobstore.BlobStoreException: Failed to create directory at [C:/CanAccessOnlyFromNode1]
at org.elasticsearch.repositories.fs.FsRepository.(Unknown Source)
while locating org.elasticsearch.repositories.fs.FsRepository
while locating org.elasticsearch.repositories.Repository

1 error
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:344)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:178)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
at org.elasticsearch.repositories.RepositoriesService.createRepositoryHolder(RepositoriesService.java:384)
at org.elasticsearch.repositories.RepositoriesService.clusterChanged(RepositoriesService.java:280)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:427)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.common.blobstore.BlobStoreException: Failed to create directory at [C:/CanAccessOnlyFromNode1]
at org.elasticsearch.common.blobstore.fs.FsBlobStore.(FsBlobStore.java:52)
at org.elasticsearch.repositories.fs.FsRepository.(FsRepository.java:83)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:52)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
... 10 more
[2014-03-31 17:33:37,699][WARN ][repositories ] [Hermod] failure updating cluster state
org.elasticsearch.repositories.RepositoryException: [my_backup] failed to create repository
at org.elasticsearch.repositories.RepositoriesService.createRepositoryHolder(RepositoriesService.java:394)
at org.elasticsearch.repositories.RepositoriesService.clusterChanged(RepositoriesService.java:280)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:427)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.common.inject.CreationException: Guice creation errors:

Error injecting constructor, org.elasticsearch.common.blobstore.BlobStoreException: Failed to create directory at [C:/CanAccessOnlyFromNode1]
at org.elasticsearch.repositories.fs.FsRepository.(Unknown Source)
while locating org.elasticsearch.repositories.fs.FsRepository
while locating org.elasticsearch.repositories.Repository

1 error
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:344)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:178)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
at org.elasticsearch.repositories.RepositoriesService.createRepositoryHolder(RepositoriesService.java:384)
... 6 more
Caused by: org.elasticsearch.common.blobstore.BlobStoreException: Failed to create directory at [C:/CanAccessOnlyFromNode1]
at org.elasticsearch.common.blobstore.fs.FsBlobStore.(FsBlobStore.java:52)
at org.elasticsearch.repositories.fs.FsRepository.(FsRepository.java:83)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:52)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
... 10 more
[2014-03-31 17:33:37,738][INFO ][gateway ] [Hermod] recovered [4] indices into cluster_state

imotov · 2014-04-02T12:05:12Z

I tried to reproduce this issue, but wasn't able to. Could you paste here the output of GET snapshot command for one of the snapshot that wasn't complete. The GET snapshot command look like this: curl -XGET "localhost:9200/_snapshot/my_backup/snapshot_name", please replace snapshot_name with the name of your test snapshot.

Repro for elastic#5657

RobbieHer · 2014-04-02T17:13:00Z

The Get command shows that there are no failures

$ curl -XGET "localhost:9200/_snapshot/my_backup/snapshot_13"
{"snapshots":[{"snapshot":"snapshot_13","indices":["restore","newstore","store","default-encryptor-963"],"state":"SUCCESS","start_time":"2014-04-02T17:07:25.634Z","start_time_in_millis":1396458445634,"end_time":"2014-04-02T17:07:26.069Z","end_time_in_millis":1396458446069,"duration_in_millis":435,"failures":[],"shards":{"total":16,"failed":0,"successful":16}}]}

RobbieHer · 2014-04-02T17:18:52Z

Could you also confirm if it is absolutely necessary for the same shared repository to be 'registered' from each of the ES nodes in a cluster? Currently, I registered the shared repository only from the node [Node 1] where I ran the backup command. When I run the 'register' command on Node 1, I see that the ES logs on Node 3 ( which cannot access the shared repo) show an error. However, the register command on Node 1 shows success.

$ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{
"type": "fs",
"settings": {
"location": "X:/CanAccessOnlyFromNode1",
"compress": true
}
}'
{"acknowledged":true}

imotov · 2014-04-03T04:44:05Z

As I mentioned before as long as you have this directory available on all nodes where primary shards are located, the snapshot will succeed. It looks like this is what happened to the snapshot that you provided. However, since you cannot control primary shard allocation, it's a good practice to have this directory available on all nodes.

geekpete · 2014-04-03T07:45:36Z

Hi Team,

I've also seen the same symptom of snapshot being marked as successful when some nodes cannot write to the repository.

I'm using 1.0.1 with the aws-cloud plugin to store snapshots to S3.

When performing a snapshot, the snapshot is stored in S3 and marked as Successful. It also shows as an available "Successful" snapshot on my other cluster that I'm testing restore from.

When attempting a restore, elasticsearch performs some sort of consistency check on the snapshot and determines it is incomplete and unable to be restored, returning:

{
  "readyState": 4,
  "responseText": "{\"error\":\"SnapshotRestoreException[[test_snapshots_repo:test_snapshot_2014-04-03.1046] index [my_test_index] wasn't fully snapshotted - cannot restore]\",\"status\":500}",
  "responseJSON": {
    "error": "SnapshotRestoreException[[test_snapshots_repo:test_snapshot_2014-04-03.1046] index [my_test_index] wasn't fully snapshotted - cannot restore]",
    "status": 500
  },
  "status": 500,
  "statusText": "Internal Server Error"
}

This is for the same reason explained above, where some nodes cannot access the storage location, so these nodes do not get to write their lucene segments to storage, but the other nodes do complete their pieces of the snapshot successfully. In my case, this was due to one or more nodes in my cluster being too far out of time sync that S3 would reject their connections. The cloud-aws plugin did throw a nice error in the logs for this from the nodes with the issue:

[2014-04-02 20:29:26,251][WARN ][snapshots                ] [elastic_node_1] [[my_test_index][8]] [my_test_repo:test_snapshot_2014-04-03.1046] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: [my_test_index][8] The difference between the request time and the current time is too large.

Which is great, except that though this error is detected by the plugin and notified in the logs, Elasticsearch still marks the snapshot as successful.

I fixed the time sync (ntp) issues on my nodes that were too far out of sync with S3 and tried again.

On trying again, one other node had a problem with the snapshot repository, some sort of problem where cluster state didn't update this one node with the repository config:

failed to load class with value [s3]; tried [s3, org.elasticsearch.repositories.S3RepositoryModule, org.elasticsearch.repositories.s3.S3RepositoryModule, org.elasticsearch.repositories.s3.S3RepositoryModule]

After restarting this node, it was then able to write to the repository fine and I got a successful backup that was able to be restored on my other cluster.

So a really nice feature/fix would be to identify whatever logic elasticsearch runs at restore time to verify the consistency of the snapshot, use this logic after a snapshot is taken to verify the snapshot worked ok and mark it as bad if it has failed. As well as that, return a message stating which nodes had issues performing their snapshot task if this is able to be fed back to the user, eg:

"Snapshot failed: nodes that were unable to write to the snapshot repository: elastic_node_4, elastic_node_7."

The user will then have the clues to go and investigate specific nodes to troubleshoot.

Cheers.

geekpete · 2014-04-03T07:47:55Z

Adding the consistency check of snapshots to the API would be a great feature as well, so users can run checks on snapshots in a repository manually and check to see if there are broken snapshots in there.

:)

geekpete · 2014-04-03T07:48:49Z

Also, if you wanted to reproduce my particular scenario, you could get a cluster of nodes, manually set one node to more than say 30 minutes out of sync with the rest and then try to snapshot to S3 and it should be rejected.

imotov · 2014-05-31T02:38:12Z

The new "PARTIAL" state that was added by #5792 should help to distinguish between snapshots that were completely successful and snapshots that contained some shards that failed to snapshot. Closing.

Rams20 · 2014-12-30T12:46:06Z

why primary shards are require for Elasticsearch backup snapshot...?

imotov self-assigned this Apr 2, 2014

imotov added the bug label Apr 2, 2014

imotov added a commit to imotov/elasticsearch that referenced this issue Apr 2, 2014

[TESTS] Add tests snapshotting into inaccessible file system directory

15518bf

Repro for elastic#5657

imotov mentioned this issue Apr 13, 2014

Add "PARTIAL" snapshot status #5792

Closed

imotov closed this as completed May 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot succeeds even when some nodes cannot access shared repository #5657

Snapshot succeeds even when some nodes cannot access shared repository #5657

RobbieHer commented Apr 1, 2014

imotov commented Apr 2, 2014

RobbieHer commented Apr 2, 2014

imotov commented Apr 2, 2014

RobbieHer commented Apr 2, 2014

imotov commented Apr 2, 2014

RobbieHer commented Apr 2, 2014

RobbieHer commented Apr 2, 2014

imotov commented Apr 3, 2014

geekpete commented Apr 3, 2014

geekpete commented Apr 3, 2014

geekpete commented Apr 3, 2014

imotov commented May 31, 2014

Rams20 commented Dec 30, 2014

Snapshot succeeds even when some nodes cannot access shared repository #5657

Snapshot succeeds even when some nodes cannot access shared repository #5657

Comments

RobbieHer commented Apr 1, 2014

imotov commented Apr 2, 2014

RobbieHer commented Apr 2, 2014

imotov commented Apr 2, 2014

RobbieHer commented Apr 2, 2014

imotov commented Apr 2, 2014

RobbieHer commented Apr 2, 2014

RobbieHer commented Apr 2, 2014

imotov commented Apr 3, 2014

geekpete commented Apr 3, 2014

geekpete commented Apr 3, 2014

geekpete commented Apr 3, 2014

imotov commented May 31, 2014

Rams20 commented Dec 30, 2014