New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
master: "CurrentState[RELOCATED] operation only allowed when started/recovering" (and three stuck shards) #22043
Comments
@blinken how much time is there between the beginning of log spamming and the relocation command to move shards from hot to cold? |
The index logstash-na-runit-2016.12.07 would have been created at 2016-12-07 00:01 UTC. Our curator policy is to move indices to cold storage 24 hours after they are created, and the next curator run was at 2016-12-08 00:07:01 UTC - so I'd expect the index to be moved off at that time. The master started spamming the logs at 2016-12-07 08:31:39 UTC, only eight hours after the index was created. Here's a list of curator runs, and I'm not sure I see much of a correlation with the issue starting at 2016-12-07 08:31:39. (last few runs) |
Thanks all, I notice this has a fix going in for 5.0.3 / 5.1.2. I'll test when those versions are released and advise if we run into this again. Appreciate the quick response! |
@blinken The fix covers the log file spamming. I don't have a good theory though for the three shards which seemed to be stuck in an endless loop trying to relocate. Please report back if anything like this occurs again. |
@ywelsch ok, here's a similar I just noticed. logstash-na-nginx-2016.12.12 shard 6 replica has been unassigned for at least 24 hours now, and on investigation I find that in Marvel it's looping between INIT and TRANSLOG (shown). I note that there's a second entry in Marvel for the same shard, with the same translog data - which seems to be a UI bug because I suspect it's trying to show that the primary is also relocating:
Correct me if I'm wrong, but my understanding is that a primary should not be relocating while the secondary shard is in the process of initializing? I don't mind leaving this in this state overnight (UTC) as we still seem to be ingesting data - let me know if you would like me to collect any specific logs etc. |
that's actually ok. We allow this to happen concurrently. Once primary relocation completes, the replica fails over to the new primary to resume initialization.
It looks like an issue with primary relocation. At the end of primary relocation there is a (usually) short phase where replica recoveries are delayed (this would explain the switching back and forth of the replica from INIT to TRANSLOG). The primary relocation seems to be stuck. To investigate this further I need:
If you don't want to share this information publicly you can send it to my e-mail address (first_name@elastic.co). |
No problem, I'll send via email. Unfortunately it seems dal111 (GFhQ_bRuQgyTxSDih69efw) has generated 208GB of logs since midnight UTC. I'll attach the last 100k lines unless there's specific data you would like. |
sounds good. The logs might not reveal much (the interesting bits in this case are only logged at TRACE level, but increasing logging to that level would probably bring the cluster down). The stack traces / task lists on the other hand could reveal some interesting stuff. |
I've sent the data through except the stack traces (which I'll follow up with tomorrow morning). |
Recoveries are tracked on the target node using RecoveryTarget objects that are kept in a RecoveriesCollection. Each recovery has a unique id that is communicated from the recovery target to the source so that it can call back to the target and execute actions using the right recovery context. In case of a network disconnect, recoveries are retried. At the moment, the same recovery id is reused for the restarted recovery. This can lead to confusion though if the disconnect is unilateral and the recovery source continues with the recovery process. If the target reuses the same recovery id while doing a second attempt, there might be two concurrent recoveries running on the source for the same target. This commit changes the recovery retry process to use a fresh recovery id. It also waits for the first recovery attempt to be fully finished (all resources locally freed) to further prevent concurrent access to the shard. Finally, in case of primary relocation, it also fails a second recovery attempt if the first attempt moved past the finalization step, as the relocation source can then be moved to RELOCATED state and start indexing as primary into the target shard (see TransportReplicationAction). Resetting the target shard in this state could mean that indexing is halted until the recovery retry attempt is completed and could also destroy existing documents indexed and acknowledged before the reset. Relates to #22043
Recoveries are tracked on the target node using RecoveryTarget objects that are kept in a RecoveriesCollection. Each recovery has a unique id that is communicated from the recovery target to the source so that it can call back to the target and execute actions using the right recovery context. In case of a network disconnect, recoveries are retried. At the moment, the same recovery id is reused for the restarted recovery. This can lead to confusion though if the disconnect is unilateral and the recovery source continues with the recovery process. If the target reuses the same recovery id while doing a second attempt, there might be two concurrent recoveries running on the source for the same target. This commit changes the recovery retry process to use a fresh recovery id. It also waits for the first recovery attempt to be fully finished (all resources locally freed) to further prevent concurrent access to the shard. Finally, in case of primary relocation, it also fails a second recovery attempt if the first attempt moved past the finalization step, as the relocation source can then be moved to RELOCATED state and start indexing as primary into the target shard (see TransportReplicationAction). Resetting the target shard in this state could mean that indexing is halted until the recovery retry attempt is completed and could also destroy existing documents indexed and acknowledged before the reset. Relates to #22043
Recoveries are tracked on the target node using RecoveryTarget objects that are kept in a RecoveriesCollection. Each recovery has a unique id that is communicated from the recovery target to the source so that it can call back to the target and execute actions using the right recovery context. In case of a network disconnect, recoveries are retried. At the moment, the same recovery id is reused for the restarted recovery. This can lead to confusion though if the disconnect is unilateral and the recovery source continues with the recovery process. If the target reuses the same recovery id while doing a second attempt, there might be two concurrent recoveries running on the source for the same target. This commit changes the recovery retry process to use a fresh recovery id. It also waits for the first recovery attempt to be fully finished (all resources locally freed) to further prevent concurrent access to the shard. Finally, in case of primary relocation, it also fails a second recovery attempt if the first attempt moved past the finalization step, as the relocation source can then be moved to RELOCATED state and start indexing as primary into the target shard (see TransportReplicationAction). Resetting the target shard in this state could mean that indexing is halted until the recovery retry attempt is completed and could also destroy existing documents indexed and acknowledged before the reset. Relates to #22043
As some of the conversation went by e-mail (sharing private logs etc.), I'm going to quickly summarize our findings here. The endless recovery loop can happen when the initial connection between recovery target and source is prematurely closed. A fix has been made in #22325 that detects this situation and correctly initiates a second recovery attempt instead of looping endlessly. The fix will be released as part of ES v5.1.2. We've also investigated why the connection could have been prematurely closed (and suggest a fix, see below). In this specific case we think that is was triggered by an inactivity timeout on the connection. The reason for this can be best explained by a visualization of how the recovery process proceeds at the connection level:
This should illustrate that while the files are being sent from the recovery source to the target there is an idling channel (channel 1) waiting for the recovery to finish. @blinken confirmed that recoveries / relocations take multiple hours (large shards over slow connections). We therefore suggest configuring / enabling of the following connection keep-alive options:
|
Just to close the circle on my end, the issue has not reoccurred in the last two weeks after setting the following sysctls - net.ipv4.tcp_keepalive_time=300 I also changed cluster.routing.allocation.node_concurrent_recoveries from 8 to 4, which would reduce network load. I have not deployed transport.ping_schedule (but I plan to). Thanks again to everyone for the assistance here. |
@blinken Thanks for the update. I'm closing the issue now as we have a fix for v5.1.2 and a confirmed workaround. |
Hi,
Overnight our ElasticSearch 5.0.0 cluster stopped ingesting data. There seemed to be two (possibly related) faults.
Firstly, the master had a 47GB log file and was spamming these messages as fast as it could:
Note our master-eligible nodes also hold data, however the shard referenced is hosted on a different node (dal154). There were no errors in the logs on that node. Disabling allocation and restarting dal154 and re-enabling allocation did not stop the master printing these error messages. I then did the same disable/restart/enable of the master (which caused it to fail over to another node) which seems to have resolved the issue.
At the same time, we had three indices which seemed to be stuck in an endless loop trying to relocate. Here's a screenshot of the Marvel interface (stage would oscillate between "INIT" and "TRANSLOG" without the bytes/files indicators increasing)
At the time, our cluster routing allocation settings required that these indices move on to {dal276,dal277}-* (cold storage).
Here's the entries for these shards from /_cluster/state/routing_table (I've translated the node IDs)
To fix these shards, I tried running /_cluster/reroute commands "move" and "cancel", but both the source and the destination nodes denied owning the shard:
Typical response:
^ same error for node=dal276-2 and the other two broken shards
I put the whole cluster into debug mode and captured the following logs on dal276-2:
There were no logs mentioning any of the three indices on dal140 or the master.
In the end I stopped allocation and restarted dal276-2 and dal276-3 and the issue appears to be resolved - the following three full recoveries are taking place as the cluster boots up again (only one recovery is for one of the original broken indices):
Here's the /_cluster/routing/allocation output for the three shards now:
Elasticsearch version:
5.0.0
Plugins installed:
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [aggs-matrix-stats]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [ingest-common]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [lang-expression]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [lang-groovy]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [lang-mustache]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [lang-painless]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [percolator]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [reindex]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [transport-netty3]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded module [transport-netty4]
[2016-12-08T03:11:50,498][INFO ][o.e.p.PluginsService ] [dal276-3] loaded plugin [x-pack]
JVM version:
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
OS version:
Debian GNU/Linux 7.10 (wheezy)
Linux dal276 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
The ElasticSearch cluster should not stop ingesting logs. Relocation should not get into a state requiring a node restart to resolve.
Steps to reproduce:
Unknown.
Provide logs (if relevant):
See above
The text was updated successfully, but these errors were encountered: