Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to DELETE _snapshot even after a rolling restart #31624

Closed
TimHeckel opened this issue Jun 27, 2018 · 10 comments
Closed

Unable to DELETE _snapshot even after a rolling restart #31624

TimHeckel opened this issue Jun 27, 2018 · 10 comments
Assignees
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@TimHeckel
Copy link
Contributor

Elasticsearch version:
6.3.0

Plugins installed:
s3-repository
x-pack

JVM version
1.8

OS version
Amazon Linux

Description of the problem including expected versus actual behavior:
curl -XDELETE "http://localhost:9200/_snapshot/s3-6-backup/curator-20180623143002" hangs even after a full restart of the cluster.

I've tried turning trace logging on, but it unfortunately hasn't helped.

Here is the status information on the backup:
curl -GET "http://localhost:9200/_snapshot/s3-6-backup/_status?pretty

Relevant:

{
      "snapshot" : "curator-20180623143002",
      "repository" : "s3-6-backup",
      "uuid" : "CbzLTfmbRCaGc5tu90x-9A",
      "state" : "ABORTED",
      "include_global_state" : true,
      "shards_stats" : {
        "initializing" : 0,
        "started" : 0,
        "finalizing" : 0,
        "done" : 187,
        "failed" : 320,
        "total" : 507
      },
@colings86 colings86 added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jun 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor

bleskes commented Jun 28, 2018

@tlrx can you take a look?

@tlrx tlrx self-assigned this Jun 28, 2018
@tlrx
Copy link
Member

tlrx commented Jun 28, 2018

@TimHeckel Do you have more information about the snapshot deletion? It was a currently running / unfinished snapshot that you tried to delete? How long did it hang before you tried to restart the cluster?

@TimHeckel
Copy link
Contributor Author

@tlrx - apologies for my delay in getting back to you here; the snapshot had been running for many days before I attempted deleting it and/or restarting the cluster. Is there anything I can do to force the removal of this aborted snapshot attempt? Thanks

@TimHeckel
Copy link
Contributor Author

TimHeckel commented Sep 19, 2018

Hi -- I've since upgraded from 6.3.0 to 6.4.1, but this one hanging snapshot remains. Below are all the responses I've gotten after upgrading. @tlrx - wondering if you could take another look or give me a pointer? I'd prefer not to have to migrate to a whole new cluster, but I simply cannot delete this hanging _snapshot, and that may be my last option.

GET /_snapshot/s3-6-backup/_status
{
  "snapshots": [
    {
      "snapshot": "curator-20180623143002",
      "repository": "s3-6-backup",
      "uuid": "CbzLTfmbRCaGc5tu90x-9A",
      "state": "ABORTED",
      "include_global_state": true,
      "shards_stats": {
        "initializing": 0,
        "started": 0,
        "finalizing": 0,
        "done": 187,
        "failed": 320,
        "total": 507
      },
      "stats": {
        "incremental": {
          "file_count": 0,
          "size_in_bytes": 0
        },
        "total": {
          "file_count": 0,
          "size_in_bytes": 0
        },
        "start_time_in_millis": 0,
        "time_in_millis": 0,
        "number_of_files": 0,
        "processed_files": 0,
        "total_size_in_bytes": 0,
        "processed_size_in_bytes": 0
      },
      "indices": { ...
GET /_snapshot/s3-6-backup/_current
{
  "snapshots": [
    {
      "snapshot": "curator-20180623143002",
      "uuid": "CbzLTfmbRCaGc5tu90x-9A",
      "version_id": 6040199,
      "version": "6.4.1",
      "indices": [ ... ],
      "include_global_state": true,
      "state": "IN_PROGRESS",
      "start_time": "2018-06-23T14:30:03.389Z",
      "start_time_in_millis": 1529764203389,
      "end_time": "1970-01-01T00:00:00.000Z",
      "end_time_in_millis": 0,
      "duration_in_millis": -1529764203389,
      "failures": [],
      "shards": {
        "total": 0,
        "failed": 0,
        "successful": 0
      }
    }
  ]
}
DELETE /_snapshot/s3-6-backup/curator-20180623143002

(eventually returns with a `curl: (52) Empty reply from server` on the server
DELETE /_snapshot/s3-6-backup
{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[main-2][10.0.1.61:9300][cluster:admin/repository/delete]"
      }
    ],
    "type": "illegal_state_exception",
    "reason": "trying to modify or unregister repository that is currently used "
  },
  "status": 500
}

@ywelsch
Copy link
Contributor

ywelsch commented Oct 1, 2018

The simplest solution to get rid of the stuck snapshot is to do a full cluster restart, i.e., all nodes down, and only then start them up again. This will clear the snapshot state, but will ofc also mean downtime. The more involved solution consists of the following: Look at the clusterstate, and check the snapshot entries that are marked as ABORTED. Check the node id associated with that entry. For example, let's look at:

{
            "index" : {
              "index_name" : "my_index",
              "index_uuid" : "VhUbEkzIQvggTmOXaRS3gQ"
            },
            "shard" : 0,
            "state" : "ABORTED",
            "node" : "zDggEbNfTj-DrAwJYjagcw"
          },

The node id is zDggEbNfTj-DrAwJYjagcw, so in order for this entry to be properly cleaned up, the node with that id needs to be shut down. Now you might think: But that's what we did when we performed the rolling upgrade? Unfortunately, things are a little more complex. You need to shut down the node with id zDggEbNfTj-DrAwJYjagcw, and then WAIT for the snapshot entries for that node to be cleared up in the cluster state (i.e. marked as FAILED) before starting up that node again. The reason for this odd behavior is that the clean-up logic is scheduled when the node leaves the cluster, but so are also possibly lots of other events (e.g. shard failures). The scheduled clean-up logic (which will appear in the pending tasks as update snapshot state after node removal) unfortunately runs at a quite low priority compared to some of the other tasks, which means that if the node rejoins the cluster before the task has gotten to execute, it will mistake the node for not having left, and not clean-up the ABORTED state.

@BobBlank12
Copy link
Contributor

I believe you may also have to look for shards that are in the INIT state, not just the ABORTED state. Please correct me if I am off on this.

@ywelsch
Copy link
Contributor

ywelsch commented Oct 4, 2018

In the situation outlined by @TimHeckel here, he had already issued a delete snapshot command. This moves the entries from INIT to the ABORTED state (but lets the delete snapshot request hang until the abort is fully completed and confirmed by the nodes). So a prerequisite to the above procedure is to first issue a delete snapshot command.

@TimHeckel
Copy link
Contributor Author

TimHeckel commented Oct 5, 2018

@ywelsch - thank you so much for your help. I did attempt the second scenario, where I shut down the ABORTED node(s) and wait for the snapshot entries to turn into FAILED -- I did shut down just one of these nodes in my three node cluster, and unfortunately my first attempt caused the cluster to report:

{ "error" : { "root_cause" : [ { "type" : "master_not_discovered_exception", "reason" : null } ], "type" : "master_not_discovered_exception", "reason" : null }, "status" : 503 }

I think I will try for the full cluster restart tonight. At any rate, you've given me the first actionable advice, so I really appreciate it.

@TimHeckel
Copy link
Contributor Author

TimHeckel commented Oct 6, 2018

@ywelsch - just to close this, the FULL restart of the cluster worked. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

7 participants