Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot succeeds even when some nodes cannot access shared repository #5657

Closed
RobbieHer opened this issue Apr 1, 2014 · 13 comments
Closed
Assignees
Labels

Comments

@RobbieHer
Copy link

Hi,
I am testing out using the snapshot/restore feature in ES 1.0.0. I have created the following setup.
An ES cluster with three nodes. A shared file system which can only be accessed by Node 1[Master] and Node 2. This shared fs is registered as the repository for taking backups on Node 1.

I noticed that even if I
a) force some of my data to be present only on Node 3, and
b) ensure that Node 3 cannot access the shared repository

Taking a snapshot of the entire cluster still reports a success. However, when I browse the contents of the snapshot folder, I do not see any of the data from Node 3. I was expecting a "RepositoryMissing" exception to be thrown by Node 3. Have I misunderstood how ES snapshotting works?

Thanks!

@imotov imotov self-assigned this Apr 2, 2014
@imotov
Copy link
Contributor

imotov commented Apr 2, 2014

When snapshot is finished, could you execute the following command to see if there are any shard failures there: curl -XGET "localhost:9200/_snapshot/repository_name/snapshot_name"? If you have replicas for these indices enabled it's also possible that all primary shards are located on node 1 and node 2, in this case the snapshot might be successful since only primary shards are getting snapshotted.

@RobbieHer
Copy link
Author

Thanks! I did verify the case where I had a node in a cluster with only one shard with 0 replicas. This node could not connect to the repo and spewed out errors in the elastic search logs. However, the curl command to take a snapshot did not return any errors and reported all shards to be snapshots successfully.

@imotov imotov added the bug label Apr 2, 2014
@imotov
Copy link
Contributor

imotov commented Apr 2, 2014

Could you gist the errors from the log that you have seen?

@RobbieHer
Copy link
Author

Sure. Here are the errors from the node which had the data , but could not write to the repository.

[2014-03-31 17:33:37,684][WARN ][repositories ] [Hermod] failed to create repository [fs][my_backup]
org.elasticsearch.common.inject.CreationException: Guice creation errors:

  1. Error injecting constructor, org.elasticsearch.common.blobstore.BlobStoreException: Failed to create directory at [C:/CanAccessOnlyFromNode1]
    at org.elasticsearch.repositories.fs.FsRepository.(Unknown Source)
    while locating org.elasticsearch.repositories.fs.FsRepository
    while locating org.elasticsearch.repositories.Repository

1 error
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:344)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:178)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
at org.elasticsearch.repositories.RepositoriesService.createRepositoryHolder(RepositoriesService.java:384)
at org.elasticsearch.repositories.RepositoriesService.clusterChanged(RepositoriesService.java:280)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:427)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.common.blobstore.BlobStoreException: Failed to create directory at [C:/CanAccessOnlyFromNode1]
at org.elasticsearch.common.blobstore.fs.FsBlobStore.(FsBlobStore.java:52)
at org.elasticsearch.repositories.fs.FsRepository.(FsRepository.java:83)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:52)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
... 10 more
[2014-03-31 17:33:37,699][WARN ][repositories ] [Hermod] failure updating cluster state
org.elasticsearch.repositories.RepositoryException: [my_backup] failed to create repository
at org.elasticsearch.repositories.RepositoriesService.createRepositoryHolder(RepositoriesService.java:394)
at org.elasticsearch.repositories.RepositoriesService.clusterChanged(RepositoriesService.java:280)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:427)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.common.inject.CreationException: Guice creation errors:

  1. Error injecting constructor, org.elasticsearch.common.blobstore.BlobStoreException: Failed to create directory at [C:/CanAccessOnlyFromNode1]
    at org.elasticsearch.repositories.fs.FsRepository.(Unknown Source)
    while locating org.elasticsearch.repositories.fs.FsRepository
    while locating org.elasticsearch.repositories.Repository

1 error
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:344)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:178)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
at org.elasticsearch.repositories.RepositoriesService.createRepositoryHolder(RepositoriesService.java:384)
... 6 more
Caused by: org.elasticsearch.common.blobstore.BlobStoreException: Failed to create directory at [C:/CanAccessOnlyFromNode1]
at org.elasticsearch.common.blobstore.fs.FsBlobStore.(FsBlobStore.java:52)
at org.elasticsearch.repositories.fs.FsRepository.(FsRepository.java:83)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:52)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
... 10 more
[2014-03-31 17:33:37,738][INFO ][gateway ] [Hermod] recovered [4] indices into cluster_state

@imotov
Copy link
Contributor

imotov commented Apr 2, 2014

I tried to reproduce this issue, but wasn't able to. Could you paste here the output of GET snapshot command for one of the snapshot that wasn't complete. The GET snapshot command look like this: curl -XGET "localhost:9200/_snapshot/my_backup/snapshot_name", please replace snapshot_name with the name of your test snapshot.

imotov added a commit to imotov/elasticsearch that referenced this issue Apr 2, 2014
@RobbieHer
Copy link
Author

The Get command shows that there are no failures

$ curl -XGET "localhost:9200/_snapshot/my_backup/snapshot_13"
{"snapshots":[{"snapshot":"snapshot_13","indices":["restore","newstore","store","default-encryptor-963"],"state":"SUCCESS","start_time":"2014-04-02T17:07:25.634Z","start_time_in_millis":1396458445634,"end_time":"2014-04-02T17:07:26.069Z","end_time_in_millis":1396458446069,"duration_in_millis":435,"failures":[],"shards":{"total":16,"failed":0,"successful":16}}]}

@RobbieHer
Copy link
Author

Could you also confirm if it is absolutely necessary for the same shared repository to be 'registered' from each of the ES nodes in a cluster? Currently, I registered the shared repository only from the node [Node 1] where I ran the backup command. When I run the 'register' command on Node 1, I see that the ES logs on Node 3 ( which cannot access the shared repo) show an error. However, the register command on Node 1 shows success.

$ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{
"type": "fs",
"settings": {
"location": "X:/CanAccessOnlyFromNode1",
"compress": true
}
}'
{"acknowledged":true}

@imotov
Copy link
Contributor

imotov commented Apr 3, 2014

As I mentioned before as long as you have this directory available on all nodes where primary shards are located, the snapshot will succeed. It looks like this is what happened to the snapshot that you provided. However, since you cannot control primary shard allocation, it's a good practice to have this directory available on all nodes.

@geekpete
Copy link
Member

geekpete commented Apr 3, 2014

Hi Team,

I've also seen the same symptom of snapshot being marked as successful when some nodes cannot write to the repository.

I'm using 1.0.1 with the aws-cloud plugin to store snapshots to S3.

When performing a snapshot, the snapshot is stored in S3 and marked as Successful. It also shows as an available "Successful" snapshot on my other cluster that I'm testing restore from.

When attempting a restore, elasticsearch performs some sort of consistency check on the snapshot and determines it is incomplete and unable to be restored, returning:

{
  "readyState": 4,
  "responseText": "{\"error\":\"SnapshotRestoreException[[test_snapshots_repo:test_snapshot_2014-04-03.1046] index [my_test_index] wasn't fully snapshotted - cannot restore]\",\"status\":500}",
  "responseJSON": {
    "error": "SnapshotRestoreException[[test_snapshots_repo:test_snapshot_2014-04-03.1046] index [my_test_index] wasn't fully snapshotted - cannot restore]",
    "status": 500
  },
  "status": 500,
  "statusText": "Internal Server Error"
}

This is for the same reason explained above, where some nodes cannot access the storage location, so these nodes do not get to write their lucene segments to storage, but the other nodes do complete their pieces of the snapshot successfully. In my case, this was due to one or more nodes in my cluster being too far out of time sync that S3 would reject their connections. The cloud-aws plugin did throw a nice error in the logs for this from the nodes with the issue:

[2014-04-02 20:29:26,251][WARN ][snapshots                ] [elastic_node_1] [[my_test_index][8]] [my_test_repo:test_snapshot_2014-04-03.1046] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: [my_test_index][8] The difference between the request time and the current time is too large.

Which is great, except that though this error is detected by the plugin and notified in the logs, Elasticsearch still marks the snapshot as successful.

I fixed the time sync (ntp) issues on my nodes that were too far out of sync with S3 and tried again.

On trying again, one other node had a problem with the snapshot repository, some sort of problem where cluster state didn't update this one node with the repository config:

failed to load class with value [s3]; tried [s3, org.elasticsearch.repositories.S3RepositoryModule, org.elasticsearch.repositories.s3.S3RepositoryModule, org.elasticsearch.repositories.s3.S3RepositoryModule]

After restarting this node, it was then able to write to the repository fine and I got a successful backup that was able to be restored on my other cluster.

So a really nice feature/fix would be to identify whatever logic elasticsearch runs at restore time to verify the consistency of the snapshot, use this logic after a snapshot is taken to verify the snapshot worked ok and mark it as bad if it has failed. As well as that, return a message stating which nodes had issues performing their snapshot task if this is able to be fed back to the user, eg:

"Snapshot failed: nodes that were unable to write to the snapshot repository: elastic_node_4, elastic_node_7."

The user will then have the clues to go and investigate specific nodes to troubleshoot.

Cheers.

@geekpete
Copy link
Member

geekpete commented Apr 3, 2014

Adding the consistency check of snapshots to the API would be a great feature as well, so users can run checks on snapshots in a repository manually and check to see if there are broken snapshots in there.

:)

@geekpete
Copy link
Member

geekpete commented Apr 3, 2014

Also, if you wanted to reproduce my particular scenario, you could get a cluster of nodes, manually set one node to more than say 30 minutes out of sync with the rest and then try to snapshot to S3 and it should be rejected.

@imotov
Copy link
Contributor

imotov commented May 31, 2014

The new "PARTIAL" state that was added by #5792 should help to distinguish between snapshots that were completely successful and snapshots that contained some shards that failed to snapshot. Closing.

@imotov imotov closed this as completed May 31, 2014
@Rams20
Copy link

Rams20 commented Dec 30, 2014

why primary shards are require for Elasticsearch backup snapshot...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants