Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable ES 6.3.0 cluster due to persistent shard lock acquisition exceptions #39259

Closed
andrejbl opened this issue Feb 21, 2019 · 13 comments
Closed
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. feedback_needed

Comments

@andrejbl
Copy link

Elasticsearch version (bin/elasticsearch --version):

curl -s -XGET localhost:9200
{
  "name" : "redacted",
  "cluster_name" : "redacted",
  "cluster_uuid" : "vqMCItjLT5--emXo9fjQlA",
  "version" : {
    "number" : "6.3.0",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "424e937",
    "build_date" : "2018-06-11T23:38:03.357887Z",
    "build_snapshot" : false,
    "lucene_version" : "7.3.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Plugins installed: [repository-s3, discovery-ec2]

JVM version (java -version):

java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux 4.14.97-74.72.amzn1.x86_64 #1 SMP Tue Feb 5 20:59:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Rolling upgrade (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/rolling-upgrades.html) of a large cluster (40 data nodes, 240 shards, approx. 200GB of data per node) is causing cluster instability issues (yellow/green state flip-flopping for several hours) with lots of shard rebalancing operations going on. Shard lock failure exceptions are being logged on the master node during that time:

java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:392) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:515) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:144) ~[elasticsearch-6.3.0.jar:6.3.0]
...
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [redacted][2]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:678) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:597) ~[elasticsearch-6.3.0.jar:6.3.0]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:329) ~[elasticsearch-6.3.0.jar:6.3.0]

We are also seeing a large number of node disconnected exceptions across multiple nodes logged by master at the same time:

[2019-02-21T14:53:16,707][WARN ][o.e.c.r.a.AllocationService] [redacted] failing shard [failed shard, shard [redacted][10], node[eyYZXHsETW-lBHB6y0SfzQ], [R], s[STARTED], a[id=PFHSM67dT2e1QSSfAR_LIg], message [failed to perform indices:data/write/bulk[s] on replica [redacted][10], node[eyYZXHsETW-lBHB6y0SfzQ], [R], s[STARTED], a[id=PFHSM67dT2e1QSSfAR_LIg]], failure [NodeNotConnectedException[[redacted][10.0.64.221:9300] Node not connected]], markAsStale [true]]

Steps to reproduce:

We are performing the following sequence of automated operations during each node maintenance:

  1. Disable shard allocation:
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": "none"
  }
}
'
  1. stop elasticsearch service (using sudo -i service elasticsearch stop)
  2. system reboot the node for maintenance (takes around 1-2 minutes)
  3. start elasticsearch service (using sudo -i service elasticsearch start)
  4. Enable shard allocation:
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}
'
  1. wait for the cluster to get into green state
  2. wait 3 minutes before moving onto next node

The issue happens consistently after the sequence above goes through 6-7 nodes. We have ingestion and search production load running against the cluster at all times, as we are not in a position to disable that during the rolling upgrade. Search load contains a significant amount of search scroll requests (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html).

How can we prevent this from happening? Are there any optimisations we can make to the process or settings we can tweak during the node updates?

@martijnvg
Copy link
Member

Some instability is to be expected during a rolling upgrade, especially with ongoing ingestion and searches whilst upgrading. The cluster goes in a yellow state when a node is stopped and when a node is started back up then the after some time the cluster should go back in a green state. While this happens node disconnects and error logs are expected too. As long as you're able to get back into a green state, there shouldn't be anything to worry about.

The issue happens consistently after the sequence above goes through 6-7 nodes.

This statement is not clear to me. So after the 6th / 7th node is restarted, extra unexpected instability occurs? Can you describe the difference when upgrading other nodes? Are you not able to get the cluster in a green state after starting a node and re-enabling allocation?

Search load contains a significant amount of search scroll requests

A scroll search is going to fail when it has a scroll on a node that gets stopped. There isn't much that can be done about this during a rolling upgrade. The only thing that I can think about is, to move all shards away from a particular node before stopping it. But that is going to increase the time to takes to do the rolling upgrade.

@andrejbl
Copy link
Author

Thanks @martijnvg - yes, we do see the normal yellow/green behaviour during the first few node reboots. But after updating a number of nodes, we see extra instability with cluster ending up in yellow state for a prolonged period of time (can be several hours) with shard rebalancing happening all the time, and the particular exceptions above logged during that time frame. This in turn forces us to delay other update operations significantly.

We would be fine if the exceptions would be related to the node that got rebooted, however node disconnect exceptions and shard locking exceptions are raised across a variety of unrelated data nodes so we are not sure what is going on.

@martijnvg
Copy link
Member

It is difficult to say why the yellow state prolongs for a long period of time and why many shard rebalancings are happening. Have you tried running the allocation explain api while this is happening? This perhaps provide more details on why specific shards are not allocated immediately. Also are there shards that just take a long time to recover? (This can check via recovery api)

@andrejbl
Copy link
Author

andrejbl commented Feb 22, 2019

We've just had another occurrence. Running the allocation API didn't produce much:

curl -s -XGET localhost:9200/_cluster/allocation/explain?pretty
{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[es-master-3.localdomain][redacted:9300][cluster:monitor/allocation/explain]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}

Recovery API is reporting that a few shards are initialising. At least one of those initialising shards (redacted to shard_name for PII purposes) was reported as failed in the cluster master logs before the shard initialisation started with the error failed to perform indices:data/write/bulk[s] on replica:

2019-02-22T12:32:38,022][WARN ][o.e.c.r.a.AllocationService] [es-master-3.localdomain] failing shard [failed shard, shard [shard_name][17], node[OYKCoUQOSVCODtzYYGEgiQ], [R], s[STARTED], a[id=cr3JRYRyRWG2XEoKE5ygnQ], message [failed to perform indices:data/write/bulk[s] on replica [shard_name][17], node[OYKCoUQOSVCODtzYYGEgiQ], [R], s[STARTED], a[id=cr3JRYRyRWG2XEoKE5ygnQ]], failure [NodeNotConnectedException[[es-data-4.localdomain][redacted:9300] Node not connected]], markAsStale [true]]

The node the failing shard was on [OYKCoUQOSVCODtzYYGEgiQ] was a different one to the node that got last got restarted [GpTaBHPhTDawpGkTIf3lRw].

@martijnvg
Copy link
Member

"reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"

This is good, which means that all shards have been allocated or are in the process of being allocated.

reported as failed in the cluster master logs before the shard initialisation started with the error failed to perform indices:data/write/bulk[s]

Errors like that one are likely to happen when nodes are being stopped and there are inflight write requests. It means it was unable to replicate a write to a replica shard, because the node holding it, is no longer available (when that replica shard will re-appear on another node, it will have that write).

The node the failing shard was on [OYKCoUQOSVCODtzYYGEgiQ] was a different one to the node that got last got restarted [GpTaBHPhTDawpGkTIf3lRw].

Other nodes can log warnings / errors because a node is restarted (most likely the elected master node, because it coordinates a lot of things, for example shard allocation).

Maybe getting back into a green state sometimes takes longer, because shards are being rebalanced. There is a different setting that controls shard rebalancing. Can you set cluster.routing.rebalance.enable to none (via cluster update settings api) and set it back to null when all nodes have been upgraded?

@colings86 colings86 added the :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. label Feb 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrejbl
Copy link
Author

andrejbl commented Feb 27, 2019

So, we've tried to change the approach by setting cluster.routing.rebalance.enable setting to none before the maintenance window and back to null once the rolling upgrade operation completed on all nodes. That indeed minimised the instability between individual node restarts, but:

  1. we are still seeing the failed shard exceptions causing the shard re-initalization in between the node restarts:
failing shard [failed shard, shard [redacted][5], node[GLG7S_x_Sa25ey3vjmX_PA], [R], s[STARTED], a[id=3jAVAHveT1eAsf75zN4hmw], message [failed to perform indices:data/write/bulk[s] on replica [redacted][5]
  1. once we set cluster.routing.rebalance.enable to none after all the nodes are upgraded, the cluster gets unstable for a really long period of time (the first orange bar indicates the point in time when we enabled rebalancing, red bar spikes are node restarts):

screen shot 2019-02-27 at 15 56 26

  1. After we enable rebalancing we see lots of node disconnect and shard lock exceptions causing yet more failed shards and shard initialisations.

The requests which fail and in turn cause shards to be marked as failed are all bulk index operations (all of our indexing operations on this cluster use bulk indexing):

failed to perform indices:data/write/bulk[s]

@andrejbl
Copy link
Author

andrejbl commented Mar 1, 2019

After yet another occurrence of the issue, we managed to find the exact log lines of the data node that held the primary shard and tried the failing data/write/bulk operation on the node that held the replica (the replica node was one of the nodes on which the failed shard request was flagged at the time):

[2019-02-28T11:50:27,603][WARN ][o.e.a.b.TransportShardBulkAction] [redacted] [[redacted][5]] failed to perform indices:data/write/bulk[s] on replica [redacted][6], node[uTjDq2o6Qq6Y0J-LRucdnQ], [R], s[STARTED], a[id=HuBuiJ1ORvyF4qDr2CNpug]
[2019-02-28T11:50:27,614][WARN ][o.e.a.b.TransportShardBulkAction] [redacted] [[redacted][5]] failed to perform indices:data/write/bulk[s] on replica [redacted][6], node[uTjDq2o6Qq6Y0J-LRucdnQ], [R], s[STARTED], a[id=HuBuiJ1ORvyF4qDr2CNpug]
[2019-02-28T11:50:27,661][WARN ][o.e.a.b.TransportShardBulkAction] [redacted] [[redacted][5]] failed to perform indices:data/write/bulk[s] on replica [redacted][6], node[uTjDq2o6Qq6Y0J-LRucdnQ], [R], s[STARTED], a[id=HuBuiJ1ORvyF4qDr2CNpug]

If we are reading these correctly, looking at the timestamps it seems as if the retry policy is really aggressive (3 tries in less than 100ms?), so if there are any network issues it is really likely requests would fail. Is this retry behaviour to be expected? Is there any way we can tweak replica writes to make them more resilient to network issues?

@DaveCTurner
Copy link
Contributor

Hi @andrejbl, the messages you are questioning are to be expected when a node shuts down while there is still ongoing indexing. They mean that the primary tried to replicate an indexing request but then discovered the replica was no longer there. This is normal.

The three messages you quote are not retries, they result from three different indexing requests. If an indexing request fails to reach a replica then there are no retries.

I think this conversation is probably better suited to the discussion forum - we prefer to keep Github for confirmed bug reports and feature requests. Please could you open a thread on the discussion forum and link to it from here to continue the conversation?

@andrejbl
Copy link
Author

@DaveCTurner - thanks for the clarification. Before I go further and open a discussion in the forum, there are two points I would like get clarity on from you, as I still consider this a wrong behaviour (and hence a bug):

  1. The messages I've posted where not from when the node that shut down. They were actually from a live node that was not related to the node shutting down, but the failed indexing operation coincided with a node going offline (the node going offline held neither primary nor any replicas of the shard that failed)
  2. You say that If an indexing request fails to reach a replica then there are no retries. - is there a particular design reason for such a non-hardened behaviour? In case of intermittent network failures that would lead to this exact behaviour we are seeing (shard failures and rebalancing).

@DaveCTurner
Copy link
Contributor

@andrejbl these are both good questions for the forum, and we'll do our best to answer them there.

@DaveCTurner
Copy link
Contributor

Hi @andrejbl I've not seen a post from you in the forum but perhaps I've missed it. Could you add a link to your thread here?

@andrejbl
Copy link
Author

Just created a post there: https://discuss.elastic.co/t/elasticsearch-6-3-0-doesnt-retry-on-index-replica-bulk-write-failure/174305 - appreciate any answers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. feedback_needed
Projects
None yet
Development

No branches or pull requests

5 participants