Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot stuck in IN_PROGRESS #29118

Closed
scottsom opened this issue Mar 16, 2018 · 18 comments
Closed

Snapshot stuck in IN_PROGRESS #29118

scottsom opened this issue Mar 16, 2018 · 18 comments
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@scottsom
Copy link

Elasticsearch version (bin/elasticsearch --version): 6.1.1

Plugins installed: [analysis-icu, analysis-kuromoji, analysis-phonetic, analysis-smartcn, analysis-stempel, analysis-ukrainian, mapper-size, repository-s3]

Also a few custom plugins that add monitoring, security, and one that initiates a snapshot periodically.

JVM version (java -version): 1.8.0_144

OS version (uname -a if on a Unix-like system): Amazon Linux

Description of the problem including expected versus actual behavior:

We initiate an S3 snapshot request and wait for it complete. These are done daily and usually only take about 20 minutes. In this case, the snapshot never returned. Our backups are now effectively useless since the IN_PROGRESS snapshot cannot be deleted, the repository cannot be deleted, and no subsequent snapshots can be created (even in a different repository).

The snapshot seems to be out of sync with the cluster state which says the snapshot was ABORTED but the snapshot still says IN_PROGRESS.

I have restarted all 3 dedicated master nodes and the snapshot is still stuck. I have yet to try a full cluster restart.

Steps to reproduce:

Unfortunately, I haven't been able to identify the root cause or a way to reproduce this.

The cluster was GREEN before and after the snapshot started (17 hours later it went YELLOW). There were 4 shards being relocated leading up to and during the snapshot. The cluster.routing.allocation.disk.watermark.low setting was breached on a number of nodes. 17 hours after the snapshot was initiated I replaced the master node. All indexes were originally created in ES 5.5.0 and the current version was reached through a series of rolling upgrades (5.5.0 -> 5.6.2 -> 6.1.1).

Provide logs (if relevant):

GET _cat/snapshots?repository=repo-name

2018-02-16t00:00:00.000z     SUCCESS 1518739205 00:00:05 1518740265 00:17:45 17.6m 17 895 0 895
2018-02-17t00:00:00.000z     SUCCESS 1518825640 00:00:40 1518826775 00:19:35 18.9m 17 895 0 895
2018-02-18t00:00:00.000z     SUCCESS 1518912034 00:00:34 1518912988 00:16:28 15.9m 17 895 0 895
2018-02-19t00:00:00.000z     SUCCESS 1518998435 00:00:35 1518999342 00:15:42 15.1m 17 895 0 895
2018-02-20t00:00:00.000z     SUCCESS 1519084807 00:00:07 1519085803 00:16:43 16.6m 17 895 0 895
2018-02-21t00:00:00.000z     SUCCESS 1519171226 00:00:26 1519172313 00:18:33 18.1m 17 895 0 895
2018-02-22t00:00:00.000z     SUCCESS 1519257620 00:00:20 1519258622 00:17:02 16.7m 17 895 0 895
2018-02-23t00:00:00.000z IN_PROGRESS 1519344013 00:00:13 0          00:00:00 20.6d 17   0 0   0
GET _snapshot/repo-name

{
    "repo-name": {
        "type": "s3",
        "settings": {
            "bucket": "some-bucket",
            "base_path": "repo-name",
            "storage_class": "standard_ia"
        }
    }
}
GET _snapshot/repo-name/_status

{
    "snapshots": [
        {
            "snapshot": "2018-02-23t00:00:00.000z",
            "repository": "repo-name",
            "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
            "state": "ABORTED",
            "shards_stats": {
                "initializing": 0,
                "started": 0,
                "finalizing": 0,
                "done": 0,
                "failed": 2,
                "total": 2
            },
            "stats": {
                "number_of_files": 0,
                "processed_files": 0,
                "total_size_in_bytes": 0,
                "processed_size_in_bytes": 0,
                "start_time_in_millis": 0,
                "time_in_millis": 0
            },
            "indices": {
                "some_index": {
                    "shards_stats": {
                        "initializing": 0,
                        "started": 0,
                        "finalizing": 0,
                        "done": 0,
                        "failed": 2,
                        "total": 2
                    },
                    "stats": {
                        "number_of_files": 0,
                        "processed_files": 0,
                        "total_size_in_bytes": 0,
                        "processed_size_in_bytes": 0,
                        "start_time_in_millis": 0,
                        "time_in_millis": 0
                    },
                    "shards": {
                        "61": {
                            "stage": "FAILURE",
                            "stats": {
                                "number_of_files": 0,
                                "processed_files": 0,
                                "total_size_in_bytes": 0,
                                "processed_size_in_bytes": 0,
                                "start_time_in_millis": 0,
                                "time_in_millis": 0
                            }
                        },
                        "269": {
                            "stage": "FAILURE",
                            "stats": {
                                "number_of_files": 0,
                                "processed_files": 0,
                                "total_size_in_bytes": 0,
                                "processed_size_in_bytes": 0,
                                "start_time_in_millis": 0,
                                "time_in_millis": 0
                            }
                        }
                    }
                }
            }
        }
    ]
}
GET _snapshot/repo-name/_current

{
    "snapshots": [
        {
            "snapshot": "2018-02-23t00:00:00.000z",
            "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
            "version_id": 6010199,
            "version": "6.1.1",
            "indices": [ ... ],
            "state": "IN_PROGRESS",
            "start_time": "2018-02-23T00:00:13.754Z",
            "start_time_in_millis": 1519344013754,
            "end_time": "1970-01-01T00:00:00.000Z",
            "end_time_in_millis": 0,
            "duration_in_millis": -1519344013754,
            "failures": [],
            "shards": {
                "total": 0,
                "failed": 0,
                "successful": 0
            }
        }
    ]
}
GET _cluster/state

{
    "snapshots": {
        "snapshots": [
            {
                "repository": "repo-name",
                "snapshot": "2018-02-23t00:00:00.000z",
                "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
                "include_global_state": true,
                "partial": false,
                "state": "ABORTED",
                "indices": [ ... ],
                "start_time_millis": 1519344013754,
                "repository_state_id": 54,
                "shards": [
                    {
                        "index": {
                            "index_name": "some_index",
                            "index_uuid": "IDRyEj-lTSmr7imXFqdkgQ"
                        },
                        "shard": 61,
                        "state": "FAILED",
                        "node": "0sAovqveTv6q70IqfpYGDQ"
                    },
                    {
                        "index": {
                            "index_name": "some_index",
                            "index_uuid": "IDRyEj-lTSmr7imXFqdkgQ"
                        },
                        "shard": 269,
                        "state": "ABORTED",
                        "node": "arLNOMBtRnSu_dMZMxVxiw"
                    }
                ]
            }
        ]
    }
}

Only log statements I could find about one of the affected shards are:

[2018-02-23T17:43:22,336][WARN ][o.e.c.s.ClusterApplierService] [host_a] cluster state applier task [indices_store ([[some_index][61]] active fully on other nodes)] took [57.8s] above the warn threshold of 30s
[2018-02-23T17:41:13,470][WARN ][o.e.a.b.TransportShardBulkAction] [host_b] [[some_index][61]] failed to perform indices:data/write/bulk[s] on replica [some_index][61], node[86N5Eln9SsOj4f58HTaJzQ], relocating [0sAovqveTv6q70IqfpYGDQ], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=uAJap9BDTTyTUL2iDhx9aA, rId=JNI46MMfQBaTKQkFrb5Rkg], expected_shard_size[74700244115]

The cluster state says the snapshot was ABORTED since two of the shards in one of the indexes were marked as ABORTED.

I have ran hot_threads with threads=1000 across all nodes and I cannot find any indication of the snapshot running. I cannot find any references to snapshots in _tasks.

@scottsom
Copy link
Author

The cluster restart got it working again:

[2018-03-16T19:09:27,894][WARN ][o.e.s.SnapshotsService   ] [master_hostname] failing snapshot of shard [[some_index][269]] on closed node [arLNOMBtRnSu_dMZMxVxiw]
[2018-03-16T19:09:28,505][INFO ][o.e.s.SnapshotsService   ] [master_hostname] snapshot [repo-name:2018-02-23t00:00:00.000z/YtXSACL5Tm-kPadJ1I8JOw] completed with state [PARTIAL]

So it probably would have been sufficient to just restart the two data nodes holding the failed shards as indicated by _cluster/state.

@jasontedor jasontedor added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Mar 17, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@imotov
Copy link
Contributor

imotov commented Mar 19, 2018

So it probably would have been sufficient to just restart the two data nodes holding the failed shards as indicated by _cluster/state.

It would have been sufficient to just restart the node where the shard was stuck in the ABORTED state.

                        "state": "ABORTED",
                        "node": "arLNOMBtRnSu_dMZMxVxiw"

Did you have any node automatic node restarts or master node disconnects/switches while this snapshot was executed?

@scottsom
Copy link
Author

@imotov No restarts or disconnects as far as I can tell from the logs & metrics. The master node was replaced 17 hours into the snapshot but I'd assume the problem had already happened at that point since our snapshots don't take that long.

I only see 2 log statements on the data node with the ABORTED shard (arLNOMBtRnSu_dMZMxVxiw) when looking at a few hours of logs leading up to and during the snapshot.

[2018-02-23T00:10:46,488][WARN ][o.e.t.TransportService   ] [arLNOMBtRnSu_dMZMxVxiw] Transport response handler not found of id [1670216582]
[2018-02-23T00:10:46,493][WARN ][o.e.t.TransportService   ] [arLNOMBtRnSu_dMZMxVxiw] Transport response handler not found of id [1670216582]

So this would be ~10 minutes into the snapshot.

@imotov
Copy link
Contributor

imotov commented Mar 19, 2018

Is there anything in the master logs around the same time?

@scottsom
Copy link
Author

The logs are almost entirely low disk watermark warnings.

[2018-02-23T00:00:14,596][INFO ][o.e.s.SnapshotsService   ] snapshot [repo-name:2018-02-23t00:00:00.000z/YtXSACL5Tm-kPadJ1I8JOw] started
[2018-02-23T00:00:19,782][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205.4gb[11.7%], replicas will not be assigned to this node
[2018-02-23T00:00:19,782][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 256.6gb[14.7%], replicas will not be assigned to this node

... low disk watermark warnings repeated ...

[2018-02-23T00:09:54,176][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 234.9gb[13.4%], replicas will not be assigned to this node
[2018-02-23T00:09:54,176][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205.5gb[11.8%], replicas will not be assigned to this node
[2018-02-23T00:10:24,369][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 234.1gb[13.4%], replicas will not be assigned to this node
[2018-02-23T00:10:24,369][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205.3gb[11.7%], replicas will not be assigned to this node
[2018-02-23T00:10:54,620][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 233.1gb[13.3%], replicas will not be assigned to this node
[2018-02-23T00:10:54,620][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205.2gb[11.7%], replicas will not be assigned to this node
[2018-02-23T00:11:24,866][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 205gb[11.7%], replicas will not be assigned to this node
[2018-02-23T00:11:24,867][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 232.1gb[13.3%], replicas will not be assigned to this node
[2018-02-23T00:11:55,078][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [arLNOMBtRnSu_dMZMxVxiw][hostname][data path] free: 230.9gb[13.2%], replicas will not be assigned to this node
[2018-02-23T00:11:55,078][INFO ][o.e.c.r.a.DiskThresholdMonitor] low disk watermark [85%] exceeded on [0sAovqveTv6q70IqfpYGDQ][hostname][data path] free: 204.8gb[11.7%], replicas will not be assigned to this node

... low disk watermark warnings repeated ...

Note: both of these warnings are for the nodes that had shards marked as FAILED & ABORTED.

@macrowh
Copy link

macrowh commented Mar 22, 2018

I got the same problem, but now I can't make the cluster full restart. Is there any good idea? thanks.

@imotov
Copy link
Contributor

imotov commented Mar 22, 2018

@macrowh as I mentioned in the comment above, if you have exactly the same issue with the same version of the elasticsearch, it should be sufficient to restart just the node where you see the shard in the ABORTED state.

@macrowh
Copy link

macrowh commented Mar 23, 2018

@imotov OK, thanks. I will have a try~

@macrowh
Copy link

macrowh commented Mar 27, 2018

@imotov hi, I have a try to restart the fail data node , then success. But once again, where I make plenty of snapshots, I stuck in IN_PROGRESS . I use the 2.4.0 version of ES. Any Idea, if upgrade ES can fix this problem. Thank you.

@imotov
Copy link
Contributor

imotov commented Mar 27, 2018

@macrowh if you have exactly the same issue as scottsom described above, the upgrade will unlikely to help. However, we fixed several other issues after 2.4.0 that had similar symptoms, so I would expect the upgrade to significantly reduce the number of incidents.

@macrowh
Copy link

macrowh commented Mar 28, 2018

@imotov Thanks for your reply. As scottsom reported problem, is there any plan to fix it, maybe the next version of ES? thank you. Our ES production cluster now every data node holds several TB datas, then restart one data node is very heavy and time costly, or maybe affect the bussiness. So hope for fixing this problem, Thank you.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 16, 2019

Based on the provided information, I suspect this is the situation described here. We have a quick fix in #36113 (released in ES 6.5.3) and a more comprehensive fix will follow. Please upgrade to 6.5.3 or higher, and get back to us if you're still experiencing this issue.

@ywelsch ywelsch closed this as completed Jan 16, 2019
@smskhra
Copy link

smskhra commented Jul 17, 2019

@imotov I am running ES 5.5.2, I got similar issue, snapshot got stuck and when I tried delete it is also stuck. does restart node helps to recover.?

@imotov
Copy link
Contributor

imotov commented Jul 17, 2019

@smskhra it is hard to say based on the limited information that you have provided. Please, ask this question on the discussion forum and somebody will try to help you there.

@smskhra
Copy link

smskhra commented Jul 17, 2019

Thanks for quick response. let me explain @imotov
We've been using the snapshot feature for a while. day before yesterday we hit the
endpoint to create a snapshot however it has been in progress for nearly 24
hours - this has never happened they generally tend to finish quite
quick. We can't seem to kill it having tried to delete it via the snapshot
api. Our main concern right now is to kill the snapshot and take another
one just so we can back it up. Any help appreciated.

We are using Elasticsearch 5.5.2 running on ubuntu, java8. We currently
have 8 data nodes and 2 eligible master nodes, one of which is the master.
curl localhost:9200/_snapshot/esDailyBackups/_status?pretty

{
  "snapshots" : [
    {
      "snapshot" : "abc_snapshot_15-07-2019-210310",
      "repository" : "esDailyBackups",
      "uuid" : "8d7phbpYQ_2qytG8Nl2U5g",
      "state" : "ABORTED",
      "shards_stats" : {
        "initializing" : 0,
        "started" : 1,
        "finalizing" : 0,
        "done" : 2,
        "failed" : 0,
        "total" : 3
      },
      "stats" : {
        "number_of_files" : 393,
        "processed_files" : 306,
        "total_size_in_bytes" : 11475784915,
        "processed_size_in_bytes" : 9861940780,
        "start_time_in_millis" : 1563224590743,
        "time_in_millis" : 181308
      },
      "indices" : {
        "abc_new1" : {
          "shards_stats" : {
            "initializing" : 0,
            "started" : 1,
            "finalizing" : 0,
            "done" : 2,
            "failed" : 0,
            "total" : 3
          },
          "stats" : {
            "number_of_files" : 393,
            "processed_files" : 306,
            "total_size_in_bytes" : 11475784915,
            "processed_size_in_bytes" : 9861940780,
            "start_time_in_millis" : 1563224590743,
            "time_in_millis" : 181308
          },
          "shards" : {
            "0" : {
              "stage" : "DONE",
              "stats" : {
                "number_of_files" : 134,
                "processed_files" : 134,
                "total_size_in_bytes" : 3958389437,
                "processed_size_in_bytes" : 3958389437,
                "start_time_in_millis" : 1563224590772,
                "time_in_millis" : 181279
              }
            },
            "1" : {
              "stage" : "STARTED",
              "stats" : {
                "number_of_files" : 115,
                "processed_files" : 28,
                "total_size_in_bytes" : 3867354713,
                "processed_size_in_bytes" : 2253510578,
                "start_time_in_millis" : 1563224590743,
                "time_in_millis" : 0
              },
              "node" : "fywgthIcQsmj-oYWooH6bw"
            },
            "2" : {
              "stage" : "DONE",
              "stats" : {
                "number_of_files" : 144,
                "processed_files" : 144,
                "total_size_in_bytes" : 3650040765,
                "processed_size_in_bytes" : 3650040765,
                "start_time_in_millis" : 1563224590746,
                "time_in_millis" : 170716
              }
            }
          }
        }
      }
    }
  ]
}

found in task api
curl -s -X GET "localhost:9200/_tasks?nodes=fywgthIcQsmj-oYWooH6bw"

{
    "nodes": {
        "fywgthIcQsmj-oYWooH6bw": {
            "host": "1.1.1.1",
            "ip": "1.1.1.1:9300",
            "name": "1.1.1.1",
            "roles": [
                "master",
                "data",
                "ingest"
            ],
            "tasks": {
                "fywgthIcQsmj-oYWooH6bw:172013274": {
                    "action": "cluster:admin/snapshot/create",
                    "cancellable": false,
                    "id": 172013274,
                    "node": "fywgthIcQsmj-oYWooH6bw",
                    "running_time_in_nanos": 145795133107222,
                    "start_time_in_millis": 1563224590339,
                    "type": "transport"
                },
                "fywgthIcQsmj-oYWooH6bw:176830544": {
                    "action": "cluster:admin/snapshot/delete",
                    "cancellable": false,
                    "id": 176830544,
                    "node": "fywgthIcQsmj-oYWooH6bw",
                    "running_time_in_nanos": 98041094656591,
                    "start_time_in_millis": 1563272344378,
                    "type": "transport"
                },

At last tried delete api for snapshot backup config esDailyBackups . no luck.

does restart of fywgthIcQsmj-oYWooH6bw node will help ..? or need to restart full cluster

@imotov
Copy link
Contributor

imotov commented Jul 17, 2019

Please, ask this question on the discussion forum and somebody will try to help you there.

@smskhra
Copy link

smskhra commented Jul 17, 2019

Sure.Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

8 participants