Snapshot stuck in IN_PROGRESS #29118

scottsom · 2018-03-16T17:58:34Z

Elasticsearch version (bin/elasticsearch --version): 6.1.1

Plugins installed: [analysis-icu, analysis-kuromoji, analysis-phonetic, analysis-smartcn, analysis-stempel, analysis-ukrainian, mapper-size, repository-s3]

Also a few custom plugins that add monitoring, security, and one that initiates a snapshot periodically.

JVM version (java -version): 1.8.0_144

OS version (uname -a if on a Unix-like system): Amazon Linux

Description of the problem including expected versus actual behavior:

We initiate an S3 snapshot request and wait for it complete. These are done daily and usually only take about 20 minutes. In this case, the snapshot never returned. Our backups are now effectively useless since the IN_PROGRESS snapshot cannot be deleted, the repository cannot be deleted, and no subsequent snapshots can be created (even in a different repository).

The snapshot seems to be out of sync with the cluster state which says the snapshot was ABORTED but the snapshot still says IN_PROGRESS.

I have restarted all 3 dedicated master nodes and the snapshot is still stuck. I have yet to try a full cluster restart.

Steps to reproduce:

Unfortunately, I haven't been able to identify the root cause or a way to reproduce this.

The cluster was GREEN before and after the snapshot started (17 hours later it went YELLOW). There were 4 shards being relocated leading up to and during the snapshot. The cluster.routing.allocation.disk.watermark.low setting was breached on a number of nodes. 17 hours after the snapshot was initiated I replaced the master node. All indexes were originally created in ES 5.5.0 and the current version was reached through a series of rolling upgrades (5.5.0 -> 5.6.2 -> 6.1.1).

Provide logs (if relevant):

GET _cat/snapshots?repository=repo-name

2018-02-16t00:00:00.000z     SUCCESS 1518739205 00:00:05 1518740265 00:17:45 17.6m 17 895 0 895
2018-02-17t00:00:00.000z     SUCCESS 1518825640 00:00:40 1518826775 00:19:35 18.9m 17 895 0 895
2018-02-18t00:00:00.000z     SUCCESS 1518912034 00:00:34 1518912988 00:16:28 15.9m 17 895 0 895
2018-02-19t00:00:00.000z     SUCCESS 1518998435 00:00:35 1518999342 00:15:42 15.1m 17 895 0 895
2018-02-20t00:00:00.000z     SUCCESS 1519084807 00:00:07 1519085803 00:16:43 16.6m 17 895 0 895
2018-02-21t00:00:00.000z     SUCCESS 1519171226 00:00:26 1519172313 00:18:33 18.1m 17 895 0 895
2018-02-22t00:00:00.000z     SUCCESS 1519257620 00:00:20 1519258622 00:17:02 16.7m 17 895 0 895
2018-02-23t00:00:00.000z IN_PROGRESS 1519344013 00:00:13 0          00:00:00 20.6d 17   0 0   0

GET _snapshot/repo-name

{
    "repo-name": {
        "type": "s3",
        "settings": {
            "bucket": "some-bucket",
            "base_path": "repo-name",
            "storage_class": "standard_ia"
        }
    }
}

GET _snapshot/repo-name/_status

{
    "snapshots": [
        {
            "snapshot": "2018-02-23t00:00:00.000z",
            "repository": "repo-name",
            "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
            "state": "ABORTED",
            "shards_stats": {
                "initializing": 0,
                "started": 0,
                "finalizing": 0,
                "done": 0,
                "failed": 2,
                "total": 2
            },
            "stats": {
                "number_of_files": 0,
                "processed_files": 0,
                "total_size_in_bytes": 0,
                "processed_size_in_bytes": 0,
                "start_time_in_millis": 0,
                "time_in_millis": 0
            },
            "indices": {
                "some_index": {
                    "shards_stats": {
                        "initializing": 0,
                        "started": 0,
                        "finalizing": 0,
                        "done": 0,
                        "failed": 2,
                        "total": 2
                    },
                    "stats": {
                        "number_of_files": 0,
                        "processed_files": 0,
                        "total_size_in_bytes": 0,
                        "processed_size_in_bytes": 0,
                        "start_time_in_millis": 0,
                        "time_in_millis": 0
                    },
                    "shards": {
                        "61": {
                            "stage": "FAILURE",
                            "stats": {
                                "number_of_files": 0,
                                "processed_files": 0,
                                "total_size_in_bytes": 0,
                                "processed_size_in_bytes": 0,
                                "start_time_in_millis": 0,
                                "time_in_millis": 0
                            }
                        },
                        "269": {
                            "stage": "FAILURE",
                            "stats": {
                                "number_of_files": 0,
                                "processed_files": 0,
                                "total_size_in_bytes": 0,
                                "processed_size_in_bytes": 0,
                                "start_time_in_millis": 0,
                                "time_in_millis": 0
                            }
                        }
                    }
                }
            }
        }
    ]
}

GET _snapshot/repo-name/_current

{
    "snapshots": [
        {
            "snapshot": "2018-02-23t00:00:00.000z",
            "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
            "version_id": 6010199,
            "version": "6.1.1",
            "indices": [ ... ],
            "state": "IN_PROGRESS",
            "start_time": "2018-02-23T00:00:13.754Z",
            "start_time_in_millis": 1519344013754,
            "end_time": "1970-01-01T00:00:00.000Z",
            "end_time_in_millis": 0,
            "duration_in_millis": -1519344013754,
            "failures": [],
            "shards": {
                "total": 0,
                "failed": 0,
                "successful": 0
            }
        }
    ]
}

GET _cluster/state

{
    "snapshots": {
        "snapshots": [
            {
                "repository": "repo-name",
                "snapshot": "2018-02-23t00:00:00.000z",
                "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
                "include_global_state": true,
                "partial": false,
                "state": "ABORTED",
                "indices": [ ... ],
                "start_time_millis": 1519344013754,
                "repository_state_id": 54,
                "shards": [
                    {
                        "index": {
                            "index_name": "some_index",
                            "index_uuid": "IDRyEj-lTSmr7imXFqdkgQ"
                        },
                        "shard": 61,
                        "state": "FAILED",
                        "node": "0sAovqveTv6q70IqfpYGDQ"
                    },
                    {
                        "index": {
                            "index_name": "some_index",
                            "index_uuid": "IDRyEj-lTSmr7imXFqdkgQ"
                        },
                        "shard": 269,
                        "state": "ABORTED",
                        "node": "arLNOMBtRnSu_dMZMxVxiw"
                    }
                ]
            }
        ]
    }
}

Only log statements I could find about one of the affected shards are:

[2018-02-23T17:43:22,336][WARN ][o.e.c.s.ClusterApplierService] [host_a] cluster state applier task [indices_store ([[some_index][61]] active fully on other nodes)] took [57.8s] above the warn threshold of 30s
[2018-02-23T17:41:13,470][WARN ][o.e.a.b.TransportShardBulkAction] [host_b] [[some_index][61]] failed to perform indices:data/write/bulk[s] on replica [some_index][61], node[86N5Eln9SsOj4f58HTaJzQ], relocating [0sAovqveTv6q70IqfpYGDQ], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=uAJap9BDTTyTUL2iDhx9aA, rId=JNI46MMfQBaTKQkFrb5Rkg], expected_shard_size[74700244115]

The cluster state says the snapshot was ABORTED since two of the shards in one of the indexes were marked as ABORTED.

I have ran hot_threads with threads=1000 across all nodes and I cannot find any indication of the snapshot running. I cannot find any references to snapshots in _tasks.

The text was updated successfully, but these errors were encountered:

scottsom · 2018-03-16T19:46:33Z

The cluster restart got it working again:

[2018-03-16T19:09:27,894][WARN ][o.e.s.SnapshotsService   ] [master_hostname] failing snapshot of shard [[some_index][269]] on closed node [arLNOMBtRnSu_dMZMxVxiw]
[2018-03-16T19:09:28,505][INFO ][o.e.s.SnapshotsService   ] [master_hostname] snapshot [repo-name:2018-02-23t00:00:00.000z/YtXSACL5Tm-kPadJ1I8JOw] completed with state [PARTIAL]

So it probably would have been sufficient to just restart the two data nodes holding the failed shards as indicated by _cluster/state.

elasticmachine · 2018-03-17T16:32:57Z

Pinging @elastic/es-distributed

imotov · 2018-03-19T13:57:09Z

So it probably would have been sufficient to just restart the two data nodes holding the failed shards as indicated by _cluster/state.

It would have been sufficient to just restart the node where the shard was stuck in the ABORTED state.

                        "state": "ABORTED",
                        "node": "arLNOMBtRnSu_dMZMxVxiw"

Did you have any node automatic node restarts or master node disconnects/switches while this snapshot was executed?

scottsom · 2018-03-19T22:00:45Z

@imotov No restarts or disconnects as far as I can tell from the logs & metrics. The master node was replaced 17 hours into the snapshot but I'd assume the problem had already happened at that point since our snapshots don't take that long.

I only see 2 log statements on the data node with the ABORTED shard (arLNOMBtRnSu_dMZMxVxiw) when looking at a few hours of logs leading up to and during the snapshot.

[2018-02-23T00:10:46,488][WARN ][o.e.t.TransportService   ] [arLNOMBtRnSu_dMZMxVxiw] Transport response handler not found of id [1670216582]
[2018-02-23T00:10:46,493][WARN ][o.e.t.TransportService   ] [arLNOMBtRnSu_dMZMxVxiw] Transport response handler not found of id [1670216582]

So this would be ~10 minutes into the snapshot.

imotov · 2018-03-19T22:22:58Z

Is there anything in the master logs around the same time?

scottsom · 2018-03-19T22:52:59Z

The logs are almost entirely low disk watermark warnings.

[2018-02-23T00:00:14,596][INFO ][o.e.s.SnapshotsService   ] snapshot [repo-name:2018-02-23t00:00:00.000z/YtXSACL5Tm-kPadJ1I8JOw] started
[2018-02-23T00:00:19,782][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205.4gb[11.7%], replicas will not be assigned to this node
[2018-02-23T00:00:19,782][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 256.6gb[14.7%], replicas will not be assigned to this node

... low disk watermark warnings repeated ...

[2018-02-23T00:09:54,176][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 234.9gb[13.4%], replicas will not be assigned to this node
[2018-02-23T00:09:54,176][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205.5gb[11.8%], replicas will not be assigned to this node
[2018-02-23T00:10:24,369][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 234.1gb[13.4%], replicas will not be assigned to this node
[2018-02-23T00:10:24,369][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205.3gb[11.7%], replicas will not be assigned to this node
[2018-02-23T00:10:54,620][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 233.1gb[13.3%], replicas will not be assigned to this node
[2018-02-23T00:10:54,620][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205.2gb[11.7%], replicas will not be assigned to this node
[2018-02-23T00:11:24,866][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205gb[11.7%], replicas will not be assigned to this node
[2018-02-23T00:11:24,867][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 232.1gb[13.3%], replicas will not be assigned to this node
[2018-02-23T00:11:55,078][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 230.9gb[13.2%], replicas will not be assigned to this node
[2018-02-23T00:11:55,078][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 204.8gb[11.7%], replicas will not be assigned to this node

... low disk watermark warnings repeated ...

Note: both of these warnings are for the nodes that had shards marked as FAILED & ABORTED.

macrowh · 2018-03-22T03:41:10Z

I got the same problem, but now I can't make the cluster full restart. Is there any good idea? thanks.

imotov · 2018-03-22T13:16:45Z

@macrowh as I mentioned in the comment above, if you have exactly the same issue with the same version of the elasticsearch, it should be sufficient to restart just the node where you see the shard in the ABORTED state.

macrowh · 2018-03-23T07:18:56Z

@imotov OK, thanks. I will have a try~

macrowh · 2018-03-27T01:11:00Z

@imotov hi, I have a try to restart the fail data node , then success. But once again, where I make plenty of snapshots, I stuck in IN_PROGRESS . I use the 2.4.0 version of ES. Any Idea, if upgrade ES can fix this problem. Thank you.

imotov · 2018-03-27T13:09:09Z

@macrowh if you have exactly the same issue as scottsom described above, the upgrade will unlikely to help. However, we fixed several other issues after 2.4.0 that had similar symptoms, so I would expect the upgrade to significantly reduce the number of incidents.

macrowh · 2018-03-28T03:44:40Z

@imotov Thanks for your reply. As scottsom reported problem, is there any plan to fix it, maybe the next version of ES? thank you. Our ES production cluster now every data node holds several TB datas, then restart one data node is very heavy and time costly, or maybe affect the bussiness. So hope for fixing this problem, Thank you.

ywelsch · 2019-01-16T09:10:44Z

Based on the provided information, I suspect this is the situation described here. We have a quick fix in #36113 (released in ES 6.5.3) and a more comprehensive fix will follow. Please upgrade to 6.5.3 or higher, and get back to us if you're still experiencing this issue.

smskhra · 2019-07-17T10:22:28Z

@imotov I am running ES 5.5.2, I got similar issue, snapshot got stuck and when I tried delete it is also stuck. does restart node helps to recover.?

imotov · 2019-07-17T12:58:19Z

@smskhra it is hard to say based on the limited information that you have provided. Please, ask this question on the discussion forum and somebody will try to help you there.

smskhra · 2019-07-17T13:36:49Z

Thanks for quick response. let me explain @imotov
We've been using the snapshot feature for a while. day before yesterday we hit the
endpoint to create a snapshot however it has been in progress for nearly 24
hours - this has never happened they generally tend to finish quite
quick. We can't seem to kill it having tried to delete it via the snapshot
api. Our main concern right now is to kill the snapshot and take another
one just so we can back it up. Any help appreciated.

We are using Elasticsearch 5.5.2 running on ubuntu, java8. We currently
have 8 data nodes and 2 eligible master nodes, one of which is the master.
curl localhost:9200/_snapshot/esDailyBackups/_status?pretty

{
  "snapshots" : [
    {
      "snapshot" : "abc_snapshot_15-07-2019-210310",
      "repository" : "esDailyBackups",
      "uuid" : "8d7phbpYQ_2qytG8Nl2U5g",
      "state" : "ABORTED",
      "shards_stats" : {
        "initializing" : 0,
        "started" : 1,
        "finalizing" : 0,
        "done" : 2,
        "failed" : 0,
        "total" : 3
      },
      "stats" : {
        "number_of_files" : 393,
        "processed_files" : 306,
        "total_size_in_bytes" : 11475784915,
        "processed_size_in_bytes" : 9861940780,
        "start_time_in_millis" : 1563224590743,
        "time_in_millis" : 181308
      },
      "indices" : {
        "abc_new1" : {
          "shards_stats" : {
            "initializing" : 0,
            "started" : 1,
            "finalizing" : 0,
            "done" : 2,
            "failed" : 0,
            "total" : 3
          },
          "stats" : {
            "number_of_files" : 393,
            "processed_files" : 306,
            "total_size_in_bytes" : 11475784915,
            "processed_size_in_bytes" : 9861940780,
            "start_time_in_millis" : 1563224590743,
            "time_in_millis" : 181308
          },
          "shards" : {
            "0" : {
              "stage" : "DONE",
              "stats" : {
                "number_of_files" : 134,
                "processed_files" : 134,
                "total_size_in_bytes" : 3958389437,
                "processed_size_in_bytes" : 3958389437,
                "start_time_in_millis" : 1563224590772,
                "time_in_millis" : 181279
              }
            },
            "1" : {
              "stage" : "STARTED",
              "stats" : {
                "number_of_files" : 115,
                "processed_files" : 28,
                "total_size_in_bytes" : 3867354713,
                "processed_size_in_bytes" : 2253510578,
                "start_time_in_millis" : 1563224590743,
                "time_in_millis" : 0
              },
              "node" : "fywgthIcQsmj-oYWooH6bw"
            },
            "2" : {
              "stage" : "DONE",
              "stats" : {
                "number_of_files" : 144,
                "processed_files" : 144,
                "total_size_in_bytes" : 3650040765,
                "processed_size_in_bytes" : 3650040765,
                "start_time_in_millis" : 1563224590746,
                "time_in_millis" : 170716
              }
            }
          }
        }
      }
    }
  ]
}

found in task api
curl -s -X GET "localhost:9200/_tasks?nodes=fywgthIcQsmj-oYWooH6bw"

{
    "nodes": {
        "fywgthIcQsmj-oYWooH6bw": {
            "host": "1.1.1.1",
            "ip": "1.1.1.1:9300",
            "name": "1.1.1.1",
            "roles": [
                "master",
                "data",
                "ingest"
            ],
            "tasks": {
                "fywgthIcQsmj-oYWooH6bw:172013274": {
                    "action": "cluster:admin/snapshot/create",
                    "cancellable": false,
                    "id": 172013274,
                    "node": "fywgthIcQsmj-oYWooH6bw",
                    "running_time_in_nanos": 145795133107222,
                    "start_time_in_millis": 1563224590339,
                    "type": "transport"
                },
                "fywgthIcQsmj-oYWooH6bw:176830544": {
                    "action": "cluster:admin/snapshot/delete",
                    "cancellable": false,
                    "id": 176830544,
                    "node": "fywgthIcQsmj-oYWooH6bw",
                    "running_time_in_nanos": 98041094656591,
                    "start_time_in_millis": 1563272344378,
                    "type": "transport"
                },

At last tried delete api for snapshot backup config esDailyBackups . no luck.

does restart of fywgthIcQsmj-oYWooH6bw node will help ..? or need to restart full cluster

imotov · 2019-07-17T13:44:41Z

Please, ask this question on the discussion forum and somebody will try to help you there.

smskhra · 2019-07-17T13:52:29Z

Sure.Thanks

jasontedor added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Mar 17, 2018

colings86 added the >bug label Apr 24, 2018

backslasht mentioned this issue Jul 21, 2018

Update IndexShardSnapshotStatus when an exception is encountered #32265

Closed

ywelsch closed this as completed Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot stuck in IN_PROGRESS #29118

Snapshot stuck in IN_PROGRESS #29118

scottsom commented Mar 16, 2018

scottsom commented Mar 16, 2018

elasticmachine commented Mar 17, 2018

imotov commented Mar 19, 2018

scottsom commented Mar 19, 2018

imotov commented Mar 19, 2018

scottsom commented Mar 19, 2018

macrowh commented Mar 22, 2018

imotov commented Mar 22, 2018

macrowh commented Mar 23, 2018

macrowh commented Mar 27, 2018

imotov commented Mar 27, 2018

macrowh commented Mar 28, 2018

ywelsch commented Jan 16, 2019

smskhra commented Jul 17, 2019

imotov commented Jul 17, 2019

smskhra commented Jul 17, 2019 •

edited

imotov commented Jul 17, 2019

smskhra commented Jul 17, 2019

Snapshot stuck in IN_PROGRESS #29118

Snapshot stuck in IN_PROGRESS #29118

Comments

scottsom commented Mar 16, 2018

scottsom commented Mar 16, 2018

elasticmachine commented Mar 17, 2018

imotov commented Mar 19, 2018

scottsom commented Mar 19, 2018

imotov commented Mar 19, 2018

scottsom commented Mar 19, 2018

macrowh commented Mar 22, 2018

imotov commented Mar 22, 2018

macrowh commented Mar 23, 2018

macrowh commented Mar 27, 2018

imotov commented Mar 27, 2018

macrowh commented Mar 28, 2018

ywelsch commented Jan 16, 2019

smskhra commented Jul 17, 2019

imotov commented Jul 17, 2019

smskhra commented Jul 17, 2019 • edited

imotov commented Jul 17, 2019

smskhra commented Jul 17, 2019

smskhra commented Jul 17, 2019 •

edited