Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster hangs and node disconnects due to exessive traffic on transport layer network card stopping pings #19646

Closed
Cardy165 opened this issue Jul 28, 2016 · 4 comments
Labels
discuss :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one.

Comments

@Cardy165
Copy link

Cardy165 commented Jul 28, 2016

Elasticsearch version:
Issue Tested on v 2.3.1 & v2.3.4 of Elasticsearch

JVM version:
"version" : "1.8.0_92",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.92-b14",

OS version:
CentOS Linux release 7.2.1511 (Core)

Description of the problem including expected versus actual behavior:

Background information
Our clusters consist of approximately 1500 or so indexes (5 shards per index), we are running a group of aggregated queries across 1175 of the available indexes. On the test system in question (although the issue affects both our development and much more powerful live environment) there is between 1.5 and 2.5 TB of data (including 1 replica per shard).

Expected
The frontend of our system issues complex queries and often runs multiple queries at once. The queries are complex with a number of aggregations. The queries normally run on the ES backend and the frontend code then renders the results to the users.

Cluster state remains Green.

Actual
Staring the queries from the frontend with the cluster running and occasionally logging GC entries.

The cluster starts processing as expected. After a short amount of time one or more cluster nodes get disconnected from the cluster. This is seen in the logs on both the data node and the master instance it was attempting to communicate with.

Master log entry

[2016-07-27 15:51:14,198][DEBUG][action.admin.cluster.node.stats] [qa-es-01-master] failed to execute on node [0Ob8bwWzRJ2MmFoteZ-o1g]
ReceiveTimeoutTransportException[[qa-es-02-01][192.168.253.2:9300][cluster:monitor/nodes/stats[n]] request_id [177174] timed out after [15000ms]]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2016-07-27 15:52:06,645][INFO ][cluster.routing.allocation] [qa-es-01-master] Cluster health status changed from [GREEN] to [RED] (reason: [[{qa-es-02-01}{0Ob8bwWzRJ2MmFoteZ-o1g}{192.168.253.2}{192.168.253.2:9300}{host=qa-es-02, master=false}] failed]).
[2016-07-27 15:52:06,650][INFO ][cluster.service          ] [qa-es-01-master] removed {{qa-es-02-01}{0Ob8bwWzRJ2MmFoteZ-o1g}{192.168.253.2}{192.168.253.2:9300}{host=qa-es-02, master=false},}, reason: zen-disco-node_failed({qa-es-02-01}{0Ob8bwWzRJ2MmFoteZ-o1g}{192.168.253.2}{192.168.253.2:9300}{host=qa-es-02, master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout

Failed node log entry

[2016-07-27 15:52:28,547][WARN ][monitor.jvm              ] [qa-es-02-01] [gc][old][8175][2] duration [1.9m], collections [1]/[1.9m], total [1.9m]/[1.9m], memory [23.3gb]->[17.2gb]/[23.8gb], all_pools {[young] [838.6mb]->[19.2mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [22.4gb]->[17.2gb]/[22.9gb]}
[2016-07-27 15:52:28,618][INFO ][discovery.zen            ] [qa-es-02-01] master_left [{qa-es-01-master}{bt2JqIWeRvaLg1MdExafRQ}{192.168.253.1}{192.168.253.1:9302}{host=qa-es-01, data=false, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2016-07-27 15:52:28,621][WARN ][discovery.zen            ] [qa-es-02-01] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{magnesium}{ucSxWsLbS_CFHzrgL-9bLw}{10.91.119.10}{10.91.119.10:9300}{data=false, master=false},{qa-es-02-master}{N-b9DErbTz2-N7DOYVx3yQ}{192.168.253.2}{192.168.253.2:9302}{host=qa-es-02, data=false, master=true},{qa-es-03-02}{M5sKFMoNSEKQFTFQjwQqyA}{192.168.253.3}{192.168.253.3:9301}{host=qa-es-03, master=false},{qa-es-01-01}{oM7dd8SjTNyTTXPn-65k6g}{192.168.253.1}{192.168.253.1:9300}{host=qa-es-01, master=false},{qa-es-03-master}{fQD12LrVRHWNmdTBHP-tNg}{192.168.253.3}{192.168.253.3:9302}{host=qa-es-03, data=false, master=true},{qa-es-01-02}{pgqJ5ybRQH2RGnxK7VkYrA}{192.168.253.1}{192.168.253.1:9301}{host=qa-es-01, master=false},{qa-es-03-01}{pCwkC8fCQj-B-wIWxCCxww}{192.168.253.3}{192.168.253.3:9300}{host=qa-es-03, master=false},{qa-es-02-01}{0Ob8bwWzRJ2MmFoteZ-o1g}{192.168.253.2}{192.168.253.2:9300}{host=qa-es-02, master=false},{qa-es-02-02}{Vi5LgogOQNehMD5Yze5l5g}{192.168.253.2}{192.168.253.2:9301}{host=qa-es-02, master=false},}
[2016-07-27 15:52:28,625][INFO ][cluster.service          ] [qa-es-02-01] removed {{qa-es-01-master}{bt2JqIWeRvaLg1MdExafRQ}{192.168.253.1}{192.168.253.1:9302}{host=qa-es-01, data=false, master=true},}, reason: zen-disco-master_failed ({qa-es-01-master}{bt2JqIWeRvaLg1MdExafRQ}{192.168.253.1}{192.168.253.1:9302}{host=qa-es-01, data=false, master=true})

Each physical host has 3 nodes running on it. 1 Master node and 2 data nodes, host awareness is set for the nodes also.

The cluster state changes to red (even though there are replicas available), normally from this point on the cluster will respond to queries such as /_cluster/health and /_nodes but the failed nodes will not rejoin the cluster.

If i try to use the OS command to stop nodes that have timed out the command is ignored and just hangs. I have to kill -9 the process to stop the instance, this needs to be done on both the master and the node that failed. Usually even killing those 2 instances does not help, the cluster continues to throw errors about being unable to ping

Even once stopped and restarting the individual instances they still fail to connect back to the rest of the cluster reporting timeouts to other instance's IPs however I can ping all the IPs of the cluster.

I eventually realised that the problem is the bandwidth available in the network card used for the transport layer. Once this becomes saturated with traffic from the cluster the pings between nodes become queued at the network interface. By the time they are processed the other instances have already timed out the expected ping.

To confirm this was the issue I reset our cluster and modified it so rather than having 2 network cards em1 for the http traffic and em2 for the transport layer traffic i setup the machine so that it has

em1 - http traffic (1 Gb/s)
em2 and em3 as bonded interface bond0 with mode 0 round robin giving me a single (2Gb/s) interface.

Running the same query as above allowed the query to run without issue and there were no errors on the cluster. This would seem like the solution however the number of layers of aggregation in our system can change dynamically and even with the 2 x 1Gb/s interfaces acting as one adding another aggregation then caused the same original problem.

Steps to reproduce:

  1. On a cluster with a large amount of data and a large number of indexes and shards start a complex aggregation query. The query needs to create enough traffic to overwhelm the network card's available bandwidth.
  2. Because of the saturation of the network card used for the transport layer the cluster pings fail to arrive in a timely manor resulting in the node being removed from the cluster by the master and the node thinking the master has gone away.

Provide logs (if relevant):

Thoughts on a solution
In previous experience with clusters would normally have some disk quorum device which arbitrates similar issues, I think with elasticsearch this isn't required but being able to define a dedicated LAN (possibly LANs to allow redundancy) like you can split http and transport traffic would remove this problem entirely. As an example a solution such as:

em1 - HTTP Traffic
em2 - Interconnect between nodes traffic (possibly including cluster state traffic)
em3 - transport layer traffic (i.e. results of searches, indexing etc......)

In this sense bonding or teaming of cards a the OS level would allow a user to provide resiliency whilst the segregation of traffic will protect the cluster from saturation of the transport layer with network IO.

I also have a post on the discussion forum for the same issue: https://discuss.elastic.co/t/cluster-nodes-get-disconnected-and-out-of-sync-due-to-ping-timeouts-caused-by-transport-load/56505/

Kind Regards

Lee

@clintongormley
Copy link

The problem here is the long GCs you're getting, essentially because you're overwhelming the heap with your aggregations (probably because you're generating way too many buckets).

You need to simplify your aggs structure. Imagine if you have 1,000 buckets and each has another 1,000 buckets, and each of those has another 1,000 buckets. That's a trillion buckets! It's just not going to work. In fact, we've recently added a circuit breaker which will prevent such aggs from running #19394

@Cardy165
Copy link
Author

That would make sense if our query was generating thousands of buckets.

The first aggregation returns 2 buckets
the second as documented in the linked discussion is fixed at 5 buckets.
The last aggregation could be one of several and at worst case would be 8000 buckets.

There is GC going on on the cluster however doubling the capacity of the interface which is used for transports allows the query to run and return results, if the issue was purely a GC one then increasing the bandwidth available to the transport layer would not change the original behaviour.

@clintongormley
Copy link

8000 buckets x 1175 indices x 5 shards that have to be handled by the coordinating node.

Most of this traffic is across the transport layer, not http, so I very much doubt that separating the interfaces would really help here. That said, I'll reopen this for further discussion.

/cc @bleskes

@bleskes
Copy link
Contributor

bleskes commented Jul 28, 2016

The first aggregation returns 2 buckets
the second as documented in the linked discussion is fixed at 5 buckets.
The last aggregation could be one of several and at worst case would be 8000 buckets

This is much more than @clintongormley said, as the 8000 buckets on the lowest level need to be multiplied by the upper levels as well - assuming sex (see later how I came up with) has just two values, that will be 2 x 5 (for age ranges) x lower level buckets x 1175 x 5 shards.

Most of this traffic is across the transport layer, not http, so I very much doubt that separating the interfaces would really help here. That said, I'll reopen this for further discussion.

Agreed. All the traffic here is highly likely on the transport layer, so separation is not a big deal. Also you can try it out since you can bind the http host to another IP than the transport one. See here.

All in all I think you have two issues:

  1. Long GCs cause nodes to not respond to pings - I see a 1.9m GC, cleaning about 6GB of mem.
  2. Network saturation (presumably by queries) which prevents the cluster to normally operate. For pings and any other reasons, ES nodes should be able to freely communicate with each other.

You should either increase the capacity of your cluster to meet what you do with it or try to reduce the number of shards and see how it helps.

Closing this again. I suggest you continue the discussion on the thread you already have going with @danielmitterdorfer on discuss.elastic.co : https://discuss.elastic.co/t/cluster-nodes-get-disconnected-and-out-of-sync-due-to-ping-timeouts-caused-by-transport-load/56505

@bleskes bleskes closed this as completed Jul 28, 2016
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one.
Projects
None yet
Development

No branches or pull requests

3 participants