-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot/Restore: snapshot with missing metadata file cannot be deleted #7980
Comments
Hi, I just ran into this issue today and was wondering if you had any idea when this patch would be released. Is the target version 1.3.5 or 1.4? Thanks! |
We've run into (what I believe) is this issue. Running 1.4.4 and there is a snapshot that is: IN_PROGRESS when I check localhost:9200/_snapshot/backups/snapshot_name endpoint -XDELETE hangs when attempting to delete the snapshot The reason I believe it is this issue - we upgraded and restarted Elasticsearch around the time this snapshot was running. @imotov I also tried to use your cleanup script - however it returns with "No snapshots found" and the snapshot is still stuck in the same states above. Any other ideas on a way to force delete this snapshot? It is currently blocking us from getting any other snapshots created. |
@sarkis which type of repository and which version are you using? Could you post the snapshot part of the cluster state here? |
@imotov snapshot part of cluster state: https://gist.github.com/sarkis/f46de23dc81b1dba0d1a We're now on ES 1.4.4 the snapshot was started on ES 1.4.2, and we ran into troubles as the snapshot was running while upgrading 1.4.2 -> 1.4.4. We are using an fs snapshot, more info: {"backups_sea":{"type":"fs","settings":{"compress":"true","location":"/path/to/snapshot_dir"}} |
@sarkis how many nodes are in the cluster right now? Is the node bhwQqwZ2QuCUPjcZGrGpuQ still running? |
@imotov 1 gateway and 2 data nodes (1 set to master) I cannot find that node name - I assume it was renamed upon rolling restart or possibly from the upgrade? |
@sarkis Are you sure it doesn't appear in |
@imotov just double checked - nothing. |
@sarkis did you try running cleanup script after the upgrade or before? Did you restart master node during upgrade or it is still running 1.4.2? Does master node have proper access to the shared file system, or read/write operations with the shared files system still hang? |
@imotov I tried running the cleanup script after the upgrade. The master node was restarted and is running 1.4.4 - if I had known about this issue I would have stopped the snapshot before rolling restarts/upgrades :( The snapshot directory is a nfs mount and the "elasticsearch" user does have proper read/write perms. I just double checked this on all nodes in the cluster. Thanks a lot for the help and quick responses. |
@sarkis I am completely puzzled about what went wrong and I am still trying to figure out what happened and how to reproduce the issue. With a single master node, the snapshot should have disappeared during restart. There is simply no place for it to survive since snapshot information is not getting persisted on disk. Even if the snapshot somehow survived the restart, the cleanup script should have removed it. So, I feel that I am missing something important about the issue. When you said rolling restart, what did you mean? Could you describe the process in as many details as possible. Was snapshot stuck before the upgrade or was it simply taking long time. What was the upgrade process? Which nodes did you restart first? |
@imotov Sure - so we have 2 data nodes and 1 gateway node (total of 3 nodes). The rolling restart was done following the recommended way to do so via elasticsearch documentation:
I think we have tried everything we could at this point as well. Would you recommend removing/adding back the repo? What's the best way to just get around this? I understand you wanted to reproduce it on your end but I'd like to get snapshots working ASAP. Update: I know there isn't truly a "master" - I called the above nodes master and non-master based off of info from paramedic at the time of upgrades |
So they are all master-eligible nodes! That explains the first part - how snapshot survived restart. It doesn't explain how it happened in the first place, though. Removing and adding back the repo is not going to help. There are really only two ways to go - I can try to figure out what went wrong and fix cleanup script to clean the issue or you can do full cluster restart (shut down all master-eligible nodes and then start them back up). By the way, what do you mean by "gateway"? |
On the full cluster restart - do you mean to shut them all down at the same time and bring them back up? Does that mean there will be data loss in the time it takes to bring them both back up? |
@sarkis not sure I am following. Are you using non-local gateway on one of your nodes? Which gateway are you using there? How did you configure this node comparing to all other nodes? Could you send me your complete cluster state (you can send it to igor.motov@elasticsearch.com if you don't want to post it here). |
@imotov sorry for the confusion - our 3rd non-data, non-master node we refer to as a gateway is the entry point to the cluster. It's one and only purpose is to pass traffic through to the data nodes. Sending you the full cluster state via e-mail. |
OK, so "gateway" node doesn't have anything to do with gateways and it's simply a client node. Got it! I will be waiting for the cluster state to continue investigation. Thanks! Full cluster restart will make cluster unavailable for indexing and searching while nodes are restarting and shards are recovering. Depending on how you are indexing data it might or might not cause loss of data (if your client has a retry logic to reindex failed records, it shouldn't lead to any data loss). |
@imotov sent the cluster state - let me know if I can do anything else. I am looking for a window we can do a full restart to see if this will fix our problem. |
In case others come here with the same issue. @imotov's updated cleanup script (https://github.com/imotov/elasticsearch-snapshot-cleanup) for 1.4.4 worked in clearing up the ABORTED snapshots. |
Since it seems to be a different problem, I have created a separate issue for it. |
If snapshot metadata file disappears from a repository or it wasn't created due to network issues or master node crash during snapshot process, such snapshot cannot be deleted. Was originally reported in #5958 (comment)
The text was updated successfully, but these errors were encountered: