Nodes leaving cluster #17501

NicoYUE · 2016-04-04T09:39:39Z

Hello,

I'm using Elasticsearch 2.3 with JVM 1.7 on CentOS 6.6.

I recently set up an Elasticsearch cluster with 14 nodes but I met a little problem I can't understand.

Sometimes I have nodes leaving the cluster for no apparent reason and they won't come back unless I restart the instance.

It's kind of annoying because it messes up all the sharding and things like that.

I don't really understand because of every single instance, I filled

discovery.zen.ping.unicast.hosts:

with all the instance's hosts.

bleskes · 2016-04-04T09:48:36Z

It think it's important to understand why they nodes left the cluster. Do see anything in the logs like long GC pauzes? do you monitor the memory usage of the nodes? Also, when the need leaves, do you see anything of note in the node's logs?

NicoYUE · 2016-04-04T10:30:39Z

I just checked my logs and I don't see anything specific for leaving the cluster. And I have to do something like a query or checking KOPF to realize they are missing.

I remember when I just started setting it up, the nodes would leave if they couldn't find an active host in:

discovery.zen.ping.unicast.hosts

About memory, I've set an ES_HEAP_SIZE of 32g but I don't know if it's relevant.

If one of node leave the cluster again, I'll try to get more info

NicoYUE · 2016-04-04T13:01:42Z

Just happened again, this is what my logs says

[2016-04-04 14:43:59,950][WARN ][cluster.service          ] [hdp1.1.prod2.es.xxx] failed to disconnect to node [{hdp6.2.prod2.es.xxx}{-fIZEZwvTmmHFKGMgxXCdA}{192.168.10.6}{192.168.10.6:9301}]
java.lang.OutOfMemoryError: unable to create new native thread
[2016-04-04 14:44:05,318][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:05,322][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:05,324][DEBUG][action.admin.indices.get ] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:05,325][DEBUG][action.admin.cluster.health] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:09,901][WARN ][threadpool               ] [hdp1.1.prod2.es.xxx] failed to run [threaded] org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes@58f4c6e4
java.lang.OutOfMemoryError: unable to create new native thread
[2016-04-04 14:44:16,089][WARN ][threadpool               ] [hdp1.1.prod2.es.xxx] failed to run [threaded] org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout@778292ad
java.lang.OutOfMemoryError: unable to create new native thread
[2016-04-04 14:44:29,442][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:29,442][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:35,320][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2016-04-04 14:44:35,322][WARN ][rest.suppressed          ] /_cluster/state/master_node,routing_table,blocks/ Params: {metric=master_node,routing_table,blocks}
MasterNotDiscoveredException[null]
[2016-04-04 14:44:35,323][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2016-04-04 14:44:35,324][WARN ][rest.suppressed          ] /_cluster/settings Params: {}
MasterNotDiscoveredException[null]
[2016-04-04 14:44:35,326][WARN ][threadpool               ] [hdp1.1.prod2.es.xxx] failed to run [threaded] org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout@3e9e7840
java.lang.OutOfMemoryError: unable to create new native thread
[2016-04-04 14:44:35,325][DEBUG][action.admin.indices.get ] [hdp1.1.prod2.es.xxx] timed out while retrying [indices:admin/get] after failure (timeout [30s])
[2016-04-04 14:44:35,327][WARN ][rest.suppressed          ] /_aliases Params: {index=_aliases}MasterNotDiscoveredException[null]

[2016-04-04 14:44:35,328][DEBUG][action.admin.indices.get ] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:35,332][DEBUG][action.admin.cluster.health] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:59,443][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2016-04-04 14:44:59,443][WARN ][threadpool               ] [hdp1.1.prod2.es.xxx] failed to run [threaded] org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout@189b1bb5
java.lang.OutOfMemoryError: unable to create new native thread

It tells me memory problem but I've set 32g which should be enough and from KOPF, i'm not using a lof of memory either.

Each of my instances have a discovery zen so I don't know why they can't find the master anymore which didnt leave my cluster.

Also, I noticed that the only nodes that are leaving everytime are those I had set on triple instances on one machine, with 32g for each instances, but I should have at least 100g unused RAM on these machines.

bleskes · 2016-04-04T14:37:05Z

This line indicates you are indeed having memory problems:

java.lang.OutOfMemoryError: unable to create new native thread

Once a node goes out of memory it becomes unreliable and indeed needs to be restarted. You mention KOPF indicates you have enough memory. Can you elaborate?

jasontedor · 2016-04-04T14:44:51Z

java.lang.OutOfMemoryError: unable to create new native thread

When you see the message "unable to create new native thread" this generally means that you have an issue limiting the number of processes that the elasticsearch user can create. The exact resolution varies from system to system, but in general on Linux you would look at /etc/security/limits.conf and ulimit -u.

jasontedor · 2016-04-04T14:49:17Z

About memory, I've set an ES_HEAP_SIZE of 32g but I don't know if it's relevant.

It's not relevant to the issue that you're seeing in the logs ("unable to create new native threads"). However, you'll actually see better results if you drop the heap slightly below 32g. On the version of Elasticsearch that you're running, when a node starts up you'll see a message that says

[2016-04-04 10:47:27,618][INFO ][env                      ] [Masked Marauder] heap size [31.9gb], compressed ordinary object pointers [false]

but if you drop the heap below 32g you'll be able to take advantage of compressed oops

[2016-04-04 10:48:20,133][INFO ][env                      ] [Brute II] heap size [31.1gb], compressed ordinary object pointers [true]

This actually gives you more useable heap, and the smaller pointers are friendlier to memory bandwidth and CPU caches.

NicoYUE · 2016-04-04T15:29:39Z

On KOPF, I can check the HEAP usage where my instances usually are using around 1Gb for 31.81Gb Max, it's far from that.

That's why I find this really weird. i do have some nodes with a low max RAM but they're not those who leave my cluster and are actually quite stable

Haven't check limit.conf yet, will do it ASAP

clintongormley · 2016-04-06T09:56:59Z

@jasontedor has already provided the answer, and given that this topic is more suited to the forums than github, I'm going to close

bleskes added the feedback_needed label Apr 4, 2016

clintongormley closed this as completed Apr 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes leaving cluster #17501

Nodes leaving cluster #17501

NicoYUE commented Apr 4, 2016

bleskes commented Apr 4, 2016

NicoYUE commented Apr 4, 2016

NicoYUE commented Apr 4, 2016

bleskes commented Apr 4, 2016

jasontedor commented Apr 4, 2016

jasontedor commented Apr 4, 2016

NicoYUE commented Apr 4, 2016

clintongormley commented Apr 6, 2016

Nodes leaving cluster #17501

Nodes leaving cluster #17501

Comments

NicoYUE commented Apr 4, 2016

bleskes commented Apr 4, 2016

NicoYUE commented Apr 4, 2016

NicoYUE commented Apr 4, 2016

bleskes commented Apr 4, 2016

jasontedor commented Apr 4, 2016

jasontedor commented Apr 4, 2016

NicoYUE commented Apr 4, 2016

clintongormley commented Apr 6, 2016