New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable ES 6.3.0 cluster due to persistent shard lock acquisition exceptions #39259
Comments
Some instability is to be expected during a rolling upgrade, especially with ongoing ingestion and searches whilst upgrading. The cluster goes in a yellow state when a node is stopped and when a node is started back up then the after some time the cluster should go back in a green state. While this happens node disconnects and error logs are expected too. As long as you're able to get back into a green state, there shouldn't be anything to worry about.
This statement is not clear to me. So after the 6th / 7th node is restarted, extra unexpected instability occurs? Can you describe the difference when upgrading other nodes? Are you not able to get the cluster in a green state after starting a node and re-enabling allocation?
A scroll search is going to fail when it has a scroll on a node that gets stopped. There isn't much that can be done about this during a rolling upgrade. The only thing that I can think about is, to move all shards away from a particular node before stopping it. But that is going to increase the time to takes to do the rolling upgrade. |
Thanks @martijnvg - yes, we do see the normal yellow/green behaviour during the first few node reboots. But after updating a number of nodes, we see extra instability with cluster ending up in yellow state for a prolonged period of time (can be several hours) with shard rebalancing happening all the time, and the particular exceptions above logged during that time frame. This in turn forces us to delay other update operations significantly. We would be fine if the exceptions would be related to the node that got rebooted, however node disconnect exceptions and shard locking exceptions are raised across a variety of unrelated data nodes so we are not sure what is going on. |
It is difficult to say why the yellow state prolongs for a long period of time and why many shard rebalancings are happening. Have you tried running the allocation explain api while this is happening? This perhaps provide more details on why specific shards are not allocated immediately. Also are there shards that just take a long time to recover? (This can check via recovery api) |
We've just had another occurrence. Running the allocation API didn't produce much:
Recovery API is reporting that a few shards are initialising. At least one of those initialising shards (redacted to
The node the failing shard was on [OYKCoUQOSVCODtzYYGEgiQ] was a different one to the node that got last got restarted [GpTaBHPhTDawpGkTIf3lRw]. |
This is good, which means that all shards have been allocated or are in the process of being allocated.
Errors like that one are likely to happen when nodes are being stopped and there are inflight write requests. It means it was unable to replicate a write to a replica shard, because the node holding it, is no longer available (when that replica shard will re-appear on another node, it will have that write).
Other nodes can log warnings / errors because a node is restarted (most likely the elected master node, because it coordinates a lot of things, for example shard allocation). Maybe getting back into a green state sometimes takes longer, because shards are being rebalanced. There is a different setting that controls shard rebalancing. Can you set |
Pinging @elastic/es-distributed |
So, we've tried to change the approach by setting
The requests which fail and in turn cause shards to be marked as failed are all bulk index operations (all of our indexing operations on this cluster use bulk indexing):
|
After yet another occurrence of the issue, we managed to find the exact log lines of the data node that held the primary shard and tried the failing data/write/bulk operation on the node that held the replica (the replica node was one of the nodes on which the failed shard request was flagged at the time):
If we are reading these correctly, looking at the timestamps it seems as if the retry policy is really aggressive (3 tries in less than 100ms?), so if there are any network issues it is really likely requests would fail. Is this retry behaviour to be expected? Is there any way we can tweak replica writes to make them more resilient to network issues? |
Hi @andrejbl, the messages you are questioning are to be expected when a node shuts down while there is still ongoing indexing. They mean that the primary tried to replicate an indexing request but then discovered the replica was no longer there. This is normal. The three messages you quote are not retries, they result from three different indexing requests. If an indexing request fails to reach a replica then there are no retries. I think this conversation is probably better suited to the discussion forum - we prefer to keep Github for confirmed bug reports and feature requests. Please could you open a thread on the discussion forum and link to it from here to continue the conversation? |
@DaveCTurner - thanks for the clarification. Before I go further and open a discussion in the forum, there are two points I would like get clarity on from you, as I still consider this a wrong behaviour (and hence a bug):
|
@andrejbl these are both good questions for the forum, and we'll do our best to answer them there. |
Hi @andrejbl I've not seen a post from you in the forum but perhaps I've missed it. Could you add a link to your thread here? |
Just created a post there: https://discuss.elastic.co/t/elasticsearch-6-3-0-doesnt-retry-on-index-replica-bulk-write-failure/174305 - appreciate any answers. |
Elasticsearch version (
bin/elasticsearch --version
):Plugins installed: [repository-s3, discovery-ec2]
JVM version (
java -version
):OS version (
uname -a
if on a Unix-like system):Description of the problem including expected versus actual behavior:
Rolling upgrade (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/rolling-upgrades.html) of a large cluster (40 data nodes, 240 shards, approx. 200GB of data per node) is causing cluster instability issues (yellow/green state flip-flopping for several hours) with lots of shard rebalancing operations going on. Shard lock failure exceptions are being logged on the master node during that time:
We are also seeing a large number of node disconnected exceptions across multiple nodes logged by master at the same time:
Steps to reproduce:
We are performing the following sequence of automated operations during each node maintenance:
sudo -i service elasticsearch stop
)sudo -i service elasticsearch start
)The issue happens consistently after the sequence above goes through 6-7 nodes. We have ingestion and search production load running against the cluster at all times, as we are not in a position to disable that during the rolling upgrade. Search load contains a significant amount of search scroll requests (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html).
How can we prevent this from happening? Are there any optimisations we can make to the process or settings we can tweak during the node updates?
The text was updated successfully, but these errors were encountered: