New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot succeeds even when some nodes cannot access shared repository #5657
Comments
When snapshot is finished, could you execute the following command to see if there are any shard failures there: |
Thanks! I did verify the case where I had a node in a cluster with only one shard with 0 replicas. This node could not connect to the repo and spewed out errors in the elastic search logs. However, the curl command to take a snapshot did not return any errors and reported all shards to be snapshots successfully. |
Could you gist the errors from the log that you have seen? |
Sure. Here are the errors from the node which had the data , but could not write to the repository. [2014-03-31 17:33:37,684][WARN ][repositories ] [Hermod] failed to create repository [fs][my_backup]
1 error
1 error |
I tried to reproduce this issue, but wasn't able to. Could you paste here the output of GET snapshot command for one of the snapshot that wasn't complete. The GET snapshot command look like this: |
The Get command shows that there are no failures $ curl -XGET "localhost:9200/_snapshot/my_backup/snapshot_13" |
Could you also confirm if it is absolutely necessary for the same shared repository to be 'registered' from each of the ES nodes in a cluster? Currently, I registered the shared repository only from the node [Node 1] where I ran the backup command. When I run the 'register' command on Node 1, I see that the ES logs on Node 3 ( which cannot access the shared repo) show an error. However, the register command on Node 1 shows success. $ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{ |
As I mentioned before as long as you have this directory available on all nodes where primary shards are located, the snapshot will succeed. It looks like this is what happened to the snapshot that you provided. However, since you cannot control primary shard allocation, it's a good practice to have this directory available on all nodes. |
Hi Team, I've also seen the same symptom of snapshot being marked as successful when some nodes cannot write to the repository. I'm using 1.0.1 with the aws-cloud plugin to store snapshots to S3. When performing a snapshot, the snapshot is stored in S3 and marked as Successful. It also shows as an available "Successful" snapshot on my other cluster that I'm testing restore from. When attempting a restore, elasticsearch performs some sort of consistency check on the snapshot and determines it is incomplete and unable to be restored, returning:
This is for the same reason explained above, where some nodes cannot access the storage location, so these nodes do not get to write their lucene segments to storage, but the other nodes do complete their pieces of the snapshot successfully. In my case, this was due to one or more nodes in my cluster being too far out of time sync that S3 would reject their connections. The cloud-aws plugin did throw a nice error in the logs for this from the nodes with the issue:
Which is great, except that though this error is detected by the plugin and notified in the logs, Elasticsearch still marks the snapshot as successful. I fixed the time sync (ntp) issues on my nodes that were too far out of sync with S3 and tried again. On trying again, one other node had a problem with the snapshot repository, some sort of problem where cluster state didn't update this one node with the repository config:
After restarting this node, it was then able to write to the repository fine and I got a successful backup that was able to be restored on my other cluster. So a really nice feature/fix would be to identify whatever logic elasticsearch runs at restore time to verify the consistency of the snapshot, use this logic after a snapshot is taken to verify the snapshot worked ok and mark it as bad if it has failed. As well as that, return a message stating which nodes had issues performing their snapshot task if this is able to be fed back to the user, eg:
The user will then have the clues to go and investigate specific nodes to troubleshoot. Cheers. |
Adding the consistency check of snapshots to the API would be a great feature as well, so users can run checks on snapshots in a repository manually and check to see if there are broken snapshots in there. :) |
Also, if you wanted to reproduce my particular scenario, you could get a cluster of nodes, manually set one node to more than say 30 minutes out of sync with the rest and then try to snapshot to S3 and it should be rejected. |
The new "PARTIAL" state that was added by #5792 should help to distinguish between snapshots that were completely successful and snapshots that contained some shards that failed to snapshot. Closing. |
why primary shards are require for Elasticsearch backup snapshot...? |
Hi,
I am testing out using the snapshot/restore feature in ES 1.0.0. I have created the following setup.
An ES cluster with three nodes. A shared file system which can only be accessed by Node 1[Master] and Node 2. This shared fs is registered as the repository for taking backups on Node 1.
I noticed that even if I
a) force some of my data to be present only on Node 3, and
b) ensure that Node 3 cannot access the shared repository
Taking a snapshot of the entire cluster still reports a success. However, when I browse the contents of the snapshot folder, I do not see any of the data from Node 3. I was expecting a "RepositoryMissing" exception to be thrown by Node 3. Have I misunderstood how ES snapshotting works?
Thanks!
The text was updated successfully, but these errors were encountered: