Unstable ES 6.3.0 cluster due to persistent shard lock acquisition exceptions #39259

andrejbl · 2019-02-21T17:10:00Z

Elasticsearch version (bin/elasticsearch --version):

curl -s -XGET localhost:9200
{
  "name" : "redacted",
  "cluster_name" : "redacted",
  "cluster_uuid" : "vqMCItjLT5--emXo9fjQlA",
  "version" : {
    "number" : "6.3.0",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "424e937",
    "build_date" : "2018-06-11T23:38:03.357887Z",
    "build_snapshot" : false,
    "lucene_version" : "7.3.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Plugins installed: [repository-s3, discovery-ec2]

JVM version (java -version):

java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux 4.14.97-74.72.amzn1.x86_64 #1 SMP Tue Feb 5 20:59:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Rolling upgrade (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/rolling-upgrades.html) of a large cluster (40 data nodes, 240 shards, approx. 200GB of data per node) is causing cluster instability issues (yellow/green state flip-flopping for several hours) with lots of shard rebalancing operations going on. Shard lock failure exceptions are being logged on the master node during that time:

java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:392) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:515) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:144) ~[elasticsearch-6.3.0.jar:6.3.0]
...
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [redacted][2]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:678) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:597) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:329) ~[elasticsearch-6.3.0.jar:6.3.0]

We are also seeing a large number of node disconnected exceptions across multiple nodes logged by master at the same time:

[2019-02-21T14:53:16,707][WARN ][o.e.c.r.a.AllocationService] [redacted] failing shard [failed shard, shard [redacted][10], node[eyYZXHsETW-lBHB6y0SfzQ], [R], s[STARTED], a[id=PFHSM67dT2e1QSSfAR_LIg], message [failed to perform indices:data/write/bulk[s] on replica [redacted][10], node[eyYZXHsETW-lBHB6y0SfzQ], [R], s[STARTED], a[id=PFHSM67dT2e1QSSfAR_LIg]], failure [NodeNotConnectedException[[redacted][10.0.64.221:9300] Node not connected]], markAsStale [true]]

Steps to reproduce:

We are performing the following sequence of automated operations during each node maintenance:

Disable shard allocation:

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": "none"
  }
}
'

stop elasticsearch service (using sudo -i service elasticsearch stop)
system reboot the node for maintenance (takes around 1-2 minutes)
start elasticsearch service (using sudo -i service elasticsearch start)
Enable shard allocation:

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}
'

wait for the cluster to get into green state
wait 3 minutes before moving onto next node

The issue happens consistently after the sequence above goes through 6-7 nodes. We have ingestion and search production load running against the cluster at all times, as we are not in a position to disable that during the rolling upgrade. Search load contains a significant amount of search scroll requests (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html).

How can we prevent this from happening? Are there any optimisations we can make to the process or settings we can tweak during the node updates?

The text was updated successfully, but these errors were encountered:

martijnvg · 2019-02-22T08:56:31Z

Some instability is to be expected during a rolling upgrade, especially with ongoing ingestion and searches whilst upgrading. The cluster goes in a yellow state when a node is stopped and when a node is started back up then the after some time the cluster should go back in a green state. While this happens node disconnects and error logs are expected too. As long as you're able to get back into a green state, there shouldn't be anything to worry about.

The issue happens consistently after the sequence above goes through 6-7 nodes.

This statement is not clear to me. So after the 6th / 7th node is restarted, extra unexpected instability occurs? Can you describe the difference when upgrading other nodes? Are you not able to get the cluster in a green state after starting a node and re-enabling allocation?

Search load contains a significant amount of search scroll requests

A scroll search is going to fail when it has a scroll on a node that gets stopped. There isn't much that can be done about this during a rolling upgrade. The only thing that I can think about is, to move all shards away from a particular node before stopping it. But that is going to increase the time to takes to do the rolling upgrade.

andrejbl · 2019-02-22T09:38:21Z

Thanks @martijnvg - yes, we do see the normal yellow/green behaviour during the first few node reboots. But after updating a number of nodes, we see extra instability with cluster ending up in yellow state for a prolonged period of time (can be several hours) with shard rebalancing happening all the time, and the particular exceptions above logged during that time frame. This in turn forces us to delay other update operations significantly.

We would be fine if the exceptions would be related to the node that got rebooted, however node disconnect exceptions and shard locking exceptions are raised across a variety of unrelated data nodes so we are not sure what is going on.

martijnvg · 2019-02-22T10:30:38Z

It is difficult to say why the yellow state prolongs for a long period of time and why many shard rebalancings are happening. Have you tried running the allocation explain api while this is happening? This perhaps provide more details on why specific shards are not allocated immediately. Also are there shards that just take a long time to recover? (This can check via recovery api)

andrejbl · 2019-02-22T13:07:59Z

We've just had another occurrence. Running the allocation API didn't produce much:

curl -s -XGET localhost:9200/_cluster/allocation/explain?pretty
{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[es-master-3.localdomain][redacted:9300][cluster:monitor/allocation/explain]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}

Recovery API is reporting that a few shards are initialising. At least one of those initialising shards (redacted to shard_name for PII purposes) was reported as failed in the cluster master logs before the shard initialisation started with the error failed to perform indices:data/write/bulk[s] on replica:

2019-02-22T12:32:38,022][WARN ][o.e.c.r.a.AllocationService] [es-master-3.localdomain] failing shard [failed shard, shard [shard_name][17], node[OYKCoUQOSVCODtzYYGEgiQ], [R], s[STARTED], a[id=cr3JRYRyRWG2XEoKE5ygnQ], message [failed to perform indices:data/write/bulk[s] on replica [shard_name][17], node[OYKCoUQOSVCODtzYYGEgiQ], [R], s[STARTED], a[id=cr3JRYRyRWG2XEoKE5ygnQ]], failure [NodeNotConnectedException[[es-data-4.localdomain][redacted:9300] Node not connected]], markAsStale [true]]

The node the failing shard was on [OYKCoUQOSVCODtzYYGEgiQ] was a different one to the node that got last got restarted [GpTaBHPhTDawpGkTIf3lRw].

martijnvg · 2019-02-22T14:27:34Z

"reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"

This is good, which means that all shards have been allocated or are in the process of being allocated.

reported as failed in the cluster master logs before the shard initialisation started with the error failed to perform indices:data/write/bulk[s]

Errors like that one are likely to happen when nodes are being stopped and there are inflight write requests. It means it was unable to replicate a write to a replica shard, because the node holding it, is no longer available (when that replica shard will re-appear on another node, it will have that write).

The node the failing shard was on [OYKCoUQOSVCODtzYYGEgiQ] was a different one to the node that got last got restarted [GpTaBHPhTDawpGkTIf3lRw].

Other nodes can log warnings / errors because a node is restarted (most likely the elected master node, because it coordinates a lot of things, for example shard allocation).

Maybe getting back into a green state sometimes takes longer, because shards are being rebalanced. There is a different setting that controls shard rebalancing. Can you set cluster.routing.rebalance.enable to none (via cluster update settings api) and set it back to null when all nodes have been upgraded?

elasticmachine · 2019-02-25T14:48:19Z

Pinging @elastic/es-distributed

andrejbl · 2019-02-27T16:02:31Z

So, we've tried to change the approach by setting cluster.routing.rebalance.enable setting to none before the maintenance window and back to null once the rolling upgrade operation completed on all nodes. That indeed minimised the instability between individual node restarts, but:

we are still seeing the failed shard exceptions causing the shard re-initalization in between the node restarts:

failing shard [failed shard, shard [redacted][5], node[GLG7S_x_Sa25ey3vjmX_PA], [R], s[STARTED], a[id=3jAVAHveT1eAsf75zN4hmw], message [failed to perform indices:data/write/bulk[s] on replica [redacted][5]

once we set cluster.routing.rebalance.enable to none after all the nodes are upgraded, the cluster gets unstable for a really long period of time (the first orange bar indicates the point in time when we enabled rebalancing, red bar spikes are node restarts):

After we enable rebalancing we see lots of node disconnect and shard lock exceptions causing yet more failed shards and shard initialisations.

The requests which fail and in turn cause shards to be marked as failed are all bulk index operations (all of our indexing operations on this cluster use bulk indexing):

failed to perform indices:data/write/bulk[s]

andrejbl · 2019-03-01T10:36:30Z

After yet another occurrence of the issue, we managed to find the exact log lines of the data node that held the primary shard and tried the failing data/write/bulk operation on the node that held the replica (the replica node was one of the nodes on which the failed shard request was flagged at the time):

[2019-02-28T11:50:27,603][WARN ][o.e.a.b.TransportShardBulkAction] [redacted] [[redacted][5]] failed to perform indices:data/write/bulk[s] on replica [redacted][6], node[uTjDq2o6Qq6Y0J-LRucdnQ], [R], s[STARTED], a[id=HuBuiJ1ORvyF4qDr2CNpug]
[2019-02-28T11:50:27,614][WARN ][o.e.a.b.TransportShardBulkAction] [redacted] [[redacted][5]] failed to perform indices:data/write/bulk[s] on replica [redacted][6], node[uTjDq2o6Qq6Y0J-LRucdnQ], [R], s[STARTED], a[id=HuBuiJ1ORvyF4qDr2CNpug]
[2019-02-28T11:50:27,661][WARN ][o.e.a.b.TransportShardBulkAction] [redacted] [[redacted][5]] failed to perform indices:data/write/bulk[s] on replica [redacted][6], node[uTjDq2o6Qq6Y0J-LRucdnQ], [R], s[STARTED], a[id=HuBuiJ1ORvyF4qDr2CNpug]

If we are reading these correctly, looking at the timestamps it seems as if the retry policy is really aggressive (3 tries in less than 100ms?), so if there are any network issues it is really likely requests would fail. Is this retry behaviour to be expected? Is there any way we can tweak replica writes to make them more resilient to network issues?

DaveCTurner · 2019-03-20T13:26:44Z

Hi @andrejbl, the messages you are questioning are to be expected when a node shuts down while there is still ongoing indexing. They mean that the primary tried to replicate an indexing request but then discovered the replica was no longer there. This is normal.

The three messages you quote are not retries, they result from three different indexing requests. If an indexing request fails to reach a replica then there are no retries.

I think this conversation is probably better suited to the discussion forum - we prefer to keep Github for confirmed bug reports and feature requests. Please could you open a thread on the discussion forum and link to it from here to continue the conversation?

andrejbl · 2019-03-20T16:40:01Z

@DaveCTurner - thanks for the clarification. Before I go further and open a discussion in the forum, there are two points I would like get clarity on from you, as I still consider this a wrong behaviour (and hence a bug):

The messages I've posted where not from when the node that shut down. They were actually from a live node that was not related to the node shutting down, but the failed indexing operation coincided with a node going offline (the node going offline held neither primary nor any replicas of the shard that failed)
You say that If an indexing request fails to reach a replica then there are no retries. - is there a particular design reason for such a non-hardened behaviour? In case of intermittent network failures that would lead to this exact behaviour we are seeing (shard failures and rebalancing).

DaveCTurner · 2019-03-21T07:46:15Z

@andrejbl these are both good questions for the forum, and we'll do our best to answer them there.

DaveCTurner · 2019-03-22T13:20:17Z

Hi @andrejbl I've not seen a post from you in the forum but perhaps I've missed it. Could you add a link to your thread here?

andrejbl · 2019-03-28T10:33:10Z

Just created a post there: https://discuss.elastic.co/t/elasticsearch-6-3-0-doesnt-retry-on-index-replica-bulk-write-failure/174305 - appreciate any answers.

martijnvg added the feedback_needed label Feb 22, 2019

colings86 added the :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. label Feb 25, 2019

DaveCTurner closed this as completed Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable ES 6.3.0 cluster due to persistent shard lock acquisition exceptions #39259

Unstable ES 6.3.0 cluster due to persistent shard lock acquisition exceptions #39259

andrejbl commented Feb 21, 2019

martijnvg commented Feb 22, 2019

andrejbl commented Feb 22, 2019

martijnvg commented Feb 22, 2019

andrejbl commented Feb 22, 2019 •

edited

martijnvg commented Feb 22, 2019

elasticmachine commented Feb 25, 2019

andrejbl commented Feb 27, 2019 •

edited

andrejbl commented Mar 1, 2019 •

edited

DaveCTurner commented Mar 20, 2019

andrejbl commented Mar 20, 2019

DaveCTurner commented Mar 21, 2019

DaveCTurner commented Mar 22, 2019

andrejbl commented Mar 28, 2019

Unstable ES 6.3.0 cluster due to persistent shard lock acquisition exceptions #39259

Unstable ES 6.3.0 cluster due to persistent shard lock acquisition exceptions #39259

Comments

andrejbl commented Feb 21, 2019

martijnvg commented Feb 22, 2019

andrejbl commented Feb 22, 2019

martijnvg commented Feb 22, 2019

andrejbl commented Feb 22, 2019 • edited

martijnvg commented Feb 22, 2019

elasticmachine commented Feb 25, 2019

andrejbl commented Feb 27, 2019 • edited

andrejbl commented Mar 1, 2019 • edited

DaveCTurner commented Mar 20, 2019

andrejbl commented Mar 20, 2019

DaveCTurner commented Mar 21, 2019

DaveCTurner commented Mar 22, 2019

andrejbl commented Mar 28, 2019

andrejbl commented Feb 22, 2019 •

edited

andrejbl commented Feb 27, 2019 •

edited

andrejbl commented Mar 1, 2019 •

edited