Snapshot/Restore: snapshot with missing metadata file cannot be deleted #7980

imotov · 2014-10-03T17:08:55Z

If snapshot metadata file disappears from a repository or it wasn't created due to network issues or master node crash during snapshot process, such snapshot cannot be deleted. Was originally reported in #5958 (comment)

…etadata file Fixes elastic#7980

…etadata file Fixes #7980

saahn · 2014-11-04T21:01:11Z

Hi, I just ran into this issue today and was wondering if you had any idea when this patch would be released. Is the target version 1.3.5 or 1.4? Thanks!

imotov · 2014-11-04T23:03:28Z

@saahn yes, 1.3.5 and 1.4.0. You can check labels on the issue #7981 to see all versions that it was merged into.

sarkis · 2015-02-26T10:05:40Z

We've run into (what I believe) is this issue. Running 1.4.4 and there is a snapshot that is:

IN_PROGRESS when I check localhost:9200/_snapshot/backups/snapshot_name endpoint
ABORTED when I check localhost:9200/_cluster/state

-XDELETE hangs when attempting to delete the snapshot

The reason I believe it is this issue - we upgraded and restarted Elasticsearch around the time this snapshot was running.

@imotov I also tried to use your cleanup script - however it returns with "No snapshots found" and the snapshot is still stuck in the same states above.

Any other ideas on a way to force delete this snapshot? It is currently blocking us from getting any other snapshots created.

imotov · 2015-02-26T12:57:46Z

@sarkis which type of repository and which version are you using? Could you post the snapshot part of the cluster state here?

sarkis · 2015-02-26T17:54:07Z

@imotov snapshot part of cluster state: https://gist.github.com/sarkis/f46de23dc81b1dba0d1a

We're now on ES 1.4.4 the snapshot was started on ES 1.4.2, and we ran into troubles as the snapshot was running while upgrading 1.4.2 -> 1.4.4. We are using an fs snapshot, more info:

{"backups_sea":{"type":"fs","settings":{"compress":"true","location":"/path/to/snapshot_dir"}}

imotov · 2015-02-26T17:59:30Z

@sarkis how many nodes are in the cluster right now? Is the node bhwQqwZ2QuCUPjcZGrGpuQ still running?

sarkis · 2015-02-26T18:04:38Z

@imotov 1 gateway and 2 data nodes (1 set to master)

I cannot find that node name - I assume it was renamed upon rolling restart or possibly from the upgrade?

imotov · 2015-02-26T18:12:25Z

@sarkis Are you sure it doesn't appear in curl "localhost:9200/_nodes?pretty" output?

sarkis · 2015-02-26T18:22:50Z

@imotov just double checked - nothing.

imotov · 2015-02-26T18:40:24Z

@sarkis did you try running cleanup script after the upgrade or before? Did you restart master node during upgrade or it is still running 1.4.2? Does master node have proper access to the shared file system, or read/write operations with the shared files system still hang?

sarkis · 2015-02-26T18:45:36Z

@imotov I tried running the cleanup script after the upgrade.

The master node was restarted and is running 1.4.4 - if I had known about this issue I would have stopped the snapshot before rolling restarts/upgrades :(

The snapshot directory is a nfs mount and the "elasticsearch" user does have proper read/write perms. I just double checked this on all nodes in the cluster.

Thanks a lot for the help and quick responses.

imotov · 2015-02-26T18:58:05Z

@sarkis I am completely puzzled about what went wrong and I am still trying to figure out what happened and how to reproduce the issue. With a single master node, the snapshot should have disappeared during restart. There is simply no place for it to survive since snapshot information is not getting persisted on disk. Even if the snapshot somehow survived the restart, the cleanup script should have removed it. So, I feel that I am missing something important about the issue.

When you said rolling restart, what did you mean? Could you describe the process in as many details as possible. Was snapshot stuck before the upgrade or was it simply taking long time. What was the upgrade process? Which nodes did you restart first?

sarkis · 2015-02-26T19:55:49Z

@imotov Sure - so we have 2 data nodes and 1 gateway node (total of 3 nodes). The rolling restart was done following the recommended way to do so via elasticsearch documentation:

turn off allocation
upgrade / restart gateway
turn on allocation (wait for green)
turn off allocation
upgrade / restart non-master data node
turn on allocation (wait for green)
turn off allocation
upgrade / restart master data node
turn on allocation

I think we have tried everything we could at this point as well. Would you recommend removing/adding back the repo? What's the best way to just get around this? I understand you wanted to reproduce it on your end but I'd like to get snapshots working ASAP.

Update: I know there isn't truly a "master" - I called the above nodes master and non-master based off of info from paramedic at the time of upgrades

imotov · 2015-02-26T20:22:07Z

So they are all master-eligible nodes! That explains the first part - how snapshot survived restart. It doesn't explain how it happened in the first place, though. Removing and adding back the repo is not going to help. There are really only two ways to go - I can try to figure out what went wrong and fix cleanup script to clean the issue or you can do full cluster restart (shut down all master-eligible nodes and then start them back up). By the way, what do you mean by "gateway"?

sarkis · 2015-02-26T20:26:49Z

~~We have a dedicated node for this: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html~~

On the full cluster restart - do you mean to shut them all down at the same time and bring them back up? Does that mean there will be data loss in the time it takes to bring them both back up?

imotov · 2015-02-26T20:38:05Z

@sarkis not sure I am following. Are you using non-local gateway on one of your nodes? Which gateway are you using there? How did you configure this node comparing to all other nodes? Could you send me your complete cluster state (you can send it to igor.motov@elasticsearch.com if you don't want to post it here).

sarkis · 2015-02-26T20:41:36Z

@imotov sorry for the confusion - our 3rd non-data, non-master node we refer to as a gateway is the entry point to the cluster. It's one and only purpose is to pass traffic through to the data nodes. Sending you the full cluster state via e-mail.

imotov · 2015-02-26T20:53:06Z

OK, so "gateway" node doesn't have anything to do with gateways and it's simply a client node. Got it! I will be waiting for the cluster state to continue investigation. Thanks!

Full cluster restart will make cluster unavailable for indexing and searching while nodes are restarting and shards are recovering. Depending on how you are indexing data it might or might not cause loss of data (if your client has a retry logic to reindex failed records, it shouldn't lead to any data loss).

sarkis · 2015-02-26T21:48:46Z

@imotov sent the cluster state - let me know if I can do anything else. I am looking for a window we can do a full restart to see if this will fix our problem.

sarkis · 2015-02-27T19:25:44Z

In case others come here with the same issue. @imotov's updated cleanup script (https://github.com/imotov/elasticsearch-snapshot-cleanup) for 1.4.4 worked in clearing up the ABORTED snapshots.

imotov · 2015-02-27T19:41:26Z

Since it seems to be a different problem, I have created a separate issue for it.

…etadata file Fixes elastic#7980

imotov added the >bug label Oct 3, 2014

imotov self-assigned this Oct 3, 2014

imotov mentioned this issue Oct 3, 2014

Snapshot aborted but but still in progress #5958

Closed

imotov mentioned this issue Oct 3, 2014

Make it possible to delete snapshots with missing metadata file #7981

Merged

imotov added a commit to imotov/elasticsearch that referenced this issue Oct 7, 2014

Snapshot/Restore: make it possible to delete snapshots with missing m…

c0129eb

…etadata file Fixes elastic#7980

imotov closed this as completed in #7981 Oct 7, 2014

imotov added a commit that referenced this issue Oct 7, 2014

Snapshot/Restore: make it possible to delete snapshots with missing m…

7663e67

…etadata file Fixes #7980

imotov added a commit that referenced this issue Oct 7, 2014

Snapshot/Restore: make it possible to delete snapshots with missing m…

3faee9e

…etadata file Fixes #7980

imotov added a commit that referenced this issue Oct 7, 2014

Snapshot/Restore: make it possible to delete snapshots with missing m…

76efdf8

…etadata file Fixes #7980

imotov mentioned this issue Feb 27, 2015

Snapshot/Restore: snapshot during rolling restart of a 2 node cluster might get stuck #9924

Closed

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Snapshot/Restore: make it possible to delete snapshots with missing m…

4aa8948

…etadata file Fixes elastic#7980

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Snapshot/Restore: make it possible to delete snapshots with missing m…

fbf7dd0

…etadata file Fixes elastic#7980

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot/Restore: snapshot with missing metadata file cannot be deleted #7980

Snapshot/Restore: snapshot with missing metadata file cannot be deleted #7980

imotov commented Oct 3, 2014

saahn commented Nov 4, 2014

imotov commented Nov 4, 2014

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

sarkis commented Feb 27, 2015

imotov commented Feb 27, 2015

Snapshot/Restore: snapshot with missing metadata file cannot be deleted #7980

Snapshot/Restore: snapshot with missing metadata file cannot be deleted #7980

Comments

imotov commented Oct 3, 2014

saahn commented Nov 4, 2014

imotov commented Nov 4, 2014

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

imotov commented Feb 26, 2015

sarkis commented Feb 26, 2015

sarkis commented Feb 27, 2015

imotov commented Feb 27, 2015