Unable to DELETE _snapshot even after a rolling restart #31624

TimHeckel · 2018-06-27T19:10:02Z

Elasticsearch version:
6.3.0

Plugins installed:
s3-repository
x-pack

JVM version
1.8

OS version
Amazon Linux

Description of the problem including expected versus actual behavior:
curl -XDELETE "http://localhost:9200/_snapshot/s3-6-backup/curator-20180623143002" hangs even after a full restart of the cluster.

I've tried turning trace logging on, but it unfortunately hasn't helped.

Here is the status information on the backup:
curl -GET "http://localhost:9200/_snapshot/s3-6-backup/_status?pretty

Relevant:

{
      "snapshot" : "curator-20180623143002",
      "repository" : "s3-6-backup",
      "uuid" : "CbzLTfmbRCaGc5tu90x-9A",
      "state" : "ABORTED",
      "include_global_state" : true,
      "shards_stats" : {
        "initializing" : 0,
        "started" : 0,
        "finalizing" : 0,
        "done" : 187,
        "failed" : 320,
        "total" : 507
      },

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-06-28T08:19:33Z

Pinging @elastic/es-distributed

bleskes · 2018-06-28T09:40:18Z

@tlrx can you take a look?

tlrx · 2018-06-28T10:25:30Z

@TimHeckel Do you have more information about the snapshot deletion? It was a currently running / unfinished snapshot that you tried to delete? How long did it hang before you tried to restart the cluster?

TimHeckel · 2018-07-28T23:13:09Z

@tlrx - apologies for my delay in getting back to you here; the snapshot had been running for many days before I attempted deleting it and/or restarting the cluster. Is there anything I can do to force the removal of this aborted snapshot attempt? Thanks

TimHeckel · 2018-09-19T00:56:21Z

Hi -- I've since upgraded from 6.3.0 to 6.4.1, but this one hanging snapshot remains. Below are all the responses I've gotten after upgrading. @tlrx - wondering if you could take another look or give me a pointer? I'd prefer not to have to migrate to a whole new cluster, but I simply cannot delete this hanging _snapshot, and that may be my last option.

GET /_snapshot/s3-6-backup/_status
{
  "snapshots": [
    {
      "snapshot": "curator-20180623143002",
      "repository": "s3-6-backup",
      "uuid": "CbzLTfmbRCaGc5tu90x-9A",
      "state": "ABORTED",
      "include_global_state": true,
      "shards_stats": {
        "initializing": 0,
        "started": 0,
        "finalizing": 0,
        "done": 187,
        "failed": 320,
        "total": 507
      },
      "stats": {
        "incremental": {
          "file_count": 0,
          "size_in_bytes": 0
        },
        "total": {
          "file_count": 0,
          "size_in_bytes": 0
        },
        "start_time_in_millis": 0,
        "time_in_millis": 0,
        "number_of_files": 0,
        "processed_files": 0,
        "total_size_in_bytes": 0,
        "processed_size_in_bytes": 0
      },
      "indices": { ...

GET /_snapshot/s3-6-backup/_current
{
  "snapshots": [
    {
      "snapshot": "curator-20180623143002",
      "uuid": "CbzLTfmbRCaGc5tu90x-9A",
      "version_id": 6040199,
      "version": "6.4.1",
      "indices": [ ... ],
      "include_global_state": true,
      "state": "IN_PROGRESS",
      "start_time": "2018-06-23T14:30:03.389Z",
      "start_time_in_millis": 1529764203389,
      "end_time": "1970-01-01T00:00:00.000Z",
      "end_time_in_millis": 0,
      "duration_in_millis": -1529764203389,
      "failures": [],
      "shards": {
        "total": 0,
        "failed": 0,
        "successful": 0
      }
    }
  ]
}

DELETE /_snapshot/s3-6-backup/curator-20180623143002

(eventually returns with a `curl: (52) Empty reply from server` on the server

DELETE /_snapshot/s3-6-backup
{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[main-2][10.0.1.61:9300][cluster:admin/repository/delete]"
      }
    ],
    "type": "illegal_state_exception",
    "reason": "trying to modify or unregister repository that is currently used "
  },
  "status": 500
}

ywelsch · 2018-10-01T10:58:36Z

The simplest solution to get rid of the stuck snapshot is to do a full cluster restart, i.e., all nodes down, and only then start them up again. This will clear the snapshot state, but will ofc also mean downtime. The more involved solution consists of the following: Look at the clusterstate, and check the snapshot entries that are marked as ABORTED. Check the node id associated with that entry. For example, let's look at:

{
            "index" : {
              "index_name" : "my_index",
              "index_uuid" : "VhUbEkzIQvggTmOXaRS3gQ"
            },
            "shard" : 0,
            "state" : "ABORTED",
            "node" : "zDggEbNfTj-DrAwJYjagcw"
          },

The node id is zDggEbNfTj-DrAwJYjagcw, so in order for this entry to be properly cleaned up, the node with that id needs to be shut down. Now you might think: But that's what we did when we performed the rolling upgrade? Unfortunately, things are a little more complex. You need to shut down the node with id zDggEbNfTj-DrAwJYjagcw, and then WAIT for the snapshot entries for that node to be cleared up in the cluster state (i.e. marked as FAILED) before starting up that node again. The reason for this odd behavior is that the clean-up logic is scheduled when the node leaves the cluster, but so are also possibly lots of other events (e.g. shard failures). The scheduled clean-up logic (which will appear in the pending tasks as update snapshot state after node removal) unfortunately runs at a quite low priority compared to some of the other tasks, which means that if the node rejoins the cluster before the task has gotten to execute, it will mistake the node for not having left, and not clean-up the ABORTED state.

BobBlank12 · 2018-10-04T12:36:58Z

I believe you may also have to look for shards that are in the INIT state, not just the ABORTED state. Please correct me if I am off on this.

ywelsch · 2018-10-04T15:11:58Z

In the situation outlined by @TimHeckel here, he had already issued a delete snapshot command. This moves the entries from INIT to the ABORTED state (but lets the delete snapshot request hang until the abort is fully completed and confirmed by the nodes). So a prerequisite to the above procedure is to first issue a delete snapshot command.

TimHeckel · 2018-10-05T13:30:04Z

@ywelsch - thank you so much for your help. I did attempt the second scenario, where I shut down the ABORTED node(s) and wait for the snapshot entries to turn into FAILED -- I did shut down just one of these nodes in my three node cluster, and unfortunately my first attempt caused the cluster to report:

{ "error" : { "root_cause" : [ { "type" : "master_not_discovered_exception", "reason" : null } ], "type" : "master_not_discovered_exception", "reason" : null }, "status" : 503 }

I think I will try for the full cluster restart tonight. At any rate, you've given me the first actionable advice, so I really appreciate it.

TimHeckel · 2018-10-06T18:46:13Z

@ywelsch - just to close this, the FULL restart of the cluster worked. Thanks again.

colings86 added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jun 28, 2018

tlrx self-assigned this Jun 28, 2018

ywelsch added the feedback_needed label Jul 11, 2018

backslasht mentioned this issue Jul 21, 2018

Update IndexShardSnapshotStatus when an exception is encountered #32265

Closed

colings86 removed the feedback_needed label Sep 3, 2018

TimHeckel closed this as completed Oct 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to DELETE _snapshot even after a rolling restart #31624

Unable to DELETE _snapshot even after a rolling restart #31624

TimHeckel commented Jun 27, 2018

elasticmachine commented Jun 28, 2018

bleskes commented Jun 28, 2018

tlrx commented Jun 28, 2018

TimHeckel commented Jul 28, 2018

TimHeckel commented Sep 19, 2018 •

edited

Loading

ywelsch commented Oct 1, 2018

BobBlank12 commented Oct 4, 2018

ywelsch commented Oct 4, 2018 •

edited

Loading

TimHeckel commented Oct 5, 2018 •

edited

Loading

TimHeckel commented Oct 6, 2018 •

edited

Loading

Unable to DELETE _snapshot even after a rolling restart #31624

Unable to DELETE _snapshot even after a rolling restart #31624

Comments

TimHeckel commented Jun 27, 2018

elasticmachine commented Jun 28, 2018

bleskes commented Jun 28, 2018

tlrx commented Jun 28, 2018

TimHeckel commented Jul 28, 2018

TimHeckel commented Sep 19, 2018 • edited Loading

ywelsch commented Oct 1, 2018

BobBlank12 commented Oct 4, 2018

ywelsch commented Oct 4, 2018 • edited Loading

TimHeckel commented Oct 5, 2018 • edited Loading

TimHeckel commented Oct 6, 2018 • edited Loading

TimHeckel commented Sep 19, 2018 •

edited

Loading

ywelsch commented Oct 4, 2018 •

edited

Loading

TimHeckel commented Oct 5, 2018 •

edited

Loading

TimHeckel commented Oct 6, 2018 •

edited

Loading