New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot stuck in IN_PROGRESS #29118
Comments
The cluster restart got it working again:
So it probably would have been sufficient to just restart the two data nodes holding the failed shards as indicated by |
Pinging @elastic/es-distributed |
It would have been sufficient to just restart the node where the shard was stuck in the ABORTED state.
Did you have any node automatic node restarts or master node disconnects/switches while this snapshot was executed? |
@imotov No restarts or disconnects as far as I can tell from the logs & metrics. The master node was replaced 17 hours into the snapshot but I'd assume the problem had already happened at that point since our snapshots don't take that long. I only see 2 log statements on the data node with the ABORTED shard (
So this would be ~10 minutes into the snapshot. |
Is there anything in the master logs around the same time? |
The logs are almost entirely low disk watermark warnings.
Note: both of these warnings are for the nodes that had shards marked as FAILED & ABORTED. |
I got the same problem, but now I can't make the cluster full restart. Is there any good idea? thanks. |
@macrowh as I mentioned in the comment above, if you have exactly the same issue with the same version of the elasticsearch, it should be sufficient to restart just the node where you see the shard in the ABORTED state. |
@imotov OK, thanks. I will have a try~ |
@imotov hi, I have a try to restart the fail data node , then success. But once again, where I make plenty of snapshots, I stuck in IN_PROGRESS . I use the 2.4.0 version of ES. Any Idea, if upgrade ES can fix this problem. Thank you. |
@macrowh if you have exactly the same issue as scottsom described above, the upgrade will unlikely to help. However, we fixed several other issues after 2.4.0 that had similar symptoms, so I would expect the upgrade to significantly reduce the number of incidents. |
@imotov Thanks for your reply. As scottsom reported problem, is there any plan to fix it, maybe the next version of ES? thank you. Our ES production cluster now every data node holds several TB datas, then restart one data node is very heavy and time costly, or maybe affect the bussiness. So hope for fixing this problem, Thank you. |
Based on the provided information, I suspect this is the situation described here. We have a quick fix in #36113 (released in ES 6.5.3) and a more comprehensive fix will follow. Please upgrade to 6.5.3 or higher, and get back to us if you're still experiencing this issue. |
@imotov I am running ES 5.5.2, I got similar issue, snapshot got stuck and when I tried delete it is also stuck. does restart node helps to recover.? |
@smskhra it is hard to say based on the limited information that you have provided. Please, ask this question on the discussion forum and somebody will try to help you there. |
Thanks for quick response. let me explain @imotov We are using Elasticsearch 5.5.2 running on ubuntu, java8. We currently
found in task api
At last tried delete api for snapshot backup config esDailyBackups . no luck. does restart of fywgthIcQsmj-oYWooH6bw node will help ..? or need to restart full cluster |
|
Sure.Thanks |
Elasticsearch version (
bin/elasticsearch --version
): 6.1.1Plugins installed: [analysis-icu, analysis-kuromoji, analysis-phonetic, analysis-smartcn, analysis-stempel, analysis-ukrainian, mapper-size, repository-s3]
Also a few custom plugins that add monitoring, security, and one that initiates a snapshot periodically.
JVM version (
java -version
): 1.8.0_144OS version (
uname -a
if on a Unix-like system): Amazon LinuxDescription of the problem including expected versus actual behavior:
We initiate an S3 snapshot request and wait for it complete. These are done daily and usually only take about 20 minutes. In this case, the snapshot never returned. Our backups are now effectively useless since the IN_PROGRESS snapshot cannot be deleted, the repository cannot be deleted, and no subsequent snapshots can be created (even in a different repository).
The snapshot seems to be out of sync with the cluster state which says the snapshot was ABORTED but the snapshot still says IN_PROGRESS.
I have restarted all 3 dedicated master nodes and the snapshot is still stuck. I have yet to try a full cluster restart.
Steps to reproduce:
Unfortunately, I haven't been able to identify the root cause or a way to reproduce this.
The cluster was GREEN before and after the snapshot started (17 hours later it went YELLOW). There were 4 shards being relocated leading up to and during the snapshot. The
cluster.routing.allocation.disk.watermark.low
setting was breached on a number of nodes. 17 hours after the snapshot was initiated I replaced the master node. All indexes were originally created in ES 5.5.0 and the current version was reached through a series of rolling upgrades (5.5.0 -> 5.6.2 -> 6.1.1).Provide logs (if relevant):
Only log statements I could find about one of the affected shards are:
The cluster state says the snapshot was ABORTED since two of the shards in one of the indexes were marked as ABORTED.
I have ran hot_threads with threads=1000 across all nodes and I cannot find any indication of the snapshot running. I cannot find any references to snapshots in
_tasks
.The text was updated successfully, but these errors were encountered: