Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes leaving cluster #17501

Closed
NicoYUE opened this issue Apr 4, 2016 · 8 comments
Closed

Nodes leaving cluster #17501

NicoYUE opened this issue Apr 4, 2016 · 8 comments

Comments

@NicoYUE
Copy link

NicoYUE commented Apr 4, 2016

Hello,

I'm using Elasticsearch 2.3 with JVM 1.7 on CentOS 6.6.

I recently set up an Elasticsearch cluster with 14 nodes but I met a little problem I can't understand.

Sometimes I have nodes leaving the cluster for no apparent reason and they won't come back unless I restart the instance.

It's kind of annoying because it messes up all the sharding and things like that.

I don't really understand because of every single instance, I filled

discovery.zen.ping.unicast.hosts:

with all the instance's hosts.

@bleskes
Copy link
Contributor

bleskes commented Apr 4, 2016

It think it's important to understand why they nodes left the cluster. Do see anything in the logs like long GC pauzes? do you monitor the memory usage of the nodes? Also, when the need leaves, do you see anything of note in the node's logs?

@NicoYUE
Copy link
Author

NicoYUE commented Apr 4, 2016

I just checked my logs and I don't see anything specific for leaving the cluster. And I have to do something like a query or checking KOPF to realize they are missing.

I remember when I just started setting it up, the nodes would leave if they couldn't find an active host in:

discovery.zen.ping.unicast.hosts

About memory, I've set an ES_HEAP_SIZE of 32g but I don't know if it's relevant.

If one of node leave the cluster again, I'll try to get more info

@NicoYUE
Copy link
Author

NicoYUE commented Apr 4, 2016

Just happened again, this is what my logs says

[2016-04-04 14:43:59,950][WARN ][cluster.service          ] [hdp1.1.prod2.es.xxx] failed to disconnect to node [{hdp6.2.prod2.es.xxx}{-fIZEZwvTmmHFKGMgxXCdA}{192.168.10.6}{192.168.10.6:9301}]
java.lang.OutOfMemoryError: unable to create new native thread
[2016-04-04 14:44:05,318][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:05,322][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:05,324][DEBUG][action.admin.indices.get ] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:05,325][DEBUG][action.admin.cluster.health] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:09,901][WARN ][threadpool               ] [hdp1.1.prod2.es.xxx] failed to run [threaded] org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes@58f4c6e4
java.lang.OutOfMemoryError: unable to create new native thread
[2016-04-04 14:44:16,089][WARN ][threadpool               ] [hdp1.1.prod2.es.xxx] failed to run [threaded] org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout@778292ad
java.lang.OutOfMemoryError: unable to create new native thread
[2016-04-04 14:44:29,442][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:29,442][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:35,320][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2016-04-04 14:44:35,322][WARN ][rest.suppressed          ] /_cluster/state/master_node,routing_table,blocks/ Params: {metric=master_node,routing_table,blocks}
MasterNotDiscoveredException[null]
[2016-04-04 14:44:35,323][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2016-04-04 14:44:35,324][WARN ][rest.suppressed          ] /_cluster/settings Params: {}
MasterNotDiscoveredException[null]
[2016-04-04 14:44:35,326][WARN ][threadpool               ] [hdp1.1.prod2.es.xxx] failed to run [threaded] org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout@3e9e7840
java.lang.OutOfMemoryError: unable to create new native thread
[2016-04-04 14:44:35,325][DEBUG][action.admin.indices.get ] [hdp1.1.prod2.es.xxx] timed out while retrying [indices:admin/get] after failure (timeout [30s])
[2016-04-04 14:44:35,327][WARN ][rest.suppressed          ] /_aliases Params: {index=_aliases}MasterNotDiscoveredException[null]

[2016-04-04 14:44:35,328][DEBUG][action.admin.indices.get ] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:35,332][DEBUG][action.admin.cluster.health] [hdp1.1.prod2.es.xxx] no known master node, scheduling a retry
[2016-04-04 14:44:59,443][DEBUG][action.admin.cluster.state] [hdp1.1.prod2.es.xxx] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2016-04-04 14:44:59,443][WARN ][threadpool               ] [hdp1.1.prod2.es.xxx] failed to run [threaded] org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout@189b1bb5
java.lang.OutOfMemoryError: unable to create new native thread

It tells me memory problem but I've set 32g which should be enough and from KOPF, i'm not using a lof of memory either.

Each of my instances have a discovery zen so I don't know why they can't find the master anymore which didnt leave my cluster.

Also, I noticed that the only nodes that are leaving everytime are those I had set on triple instances on one machine, with 32g for each instances, but I should have at least 100g unused RAM on these machines.

@bleskes
Copy link
Contributor

bleskes commented Apr 4, 2016

This line indicates you are indeed having memory problems:

java.lang.OutOfMemoryError: unable to create new native thread

Once a node goes out of memory it becomes unreliable and indeed needs to be restarted. You mention KOPF indicates you have enough memory. Can you elaborate?

@jasontedor
Copy link
Member

java.lang.OutOfMemoryError: unable to create new native thread

When you see the message "unable to create new native thread" this generally means that you have an issue limiting the number of processes that the elasticsearch user can create. The exact resolution varies from system to system, but in general on Linux you would look at /etc/security/limits.conf and ulimit -u.

@jasontedor
Copy link
Member

About memory, I've set an ES_HEAP_SIZE of 32g but I don't know if it's relevant.

It's not relevant to the issue that you're seeing in the logs ("unable to create new native threads"). However, you'll actually see better results if you drop the heap slightly below 32g. On the version of Elasticsearch that you're running, when a node starts up you'll see a message that says

[2016-04-04 10:47:27,618][INFO ][env                      ] [Masked Marauder] heap size [31.9gb], compressed ordinary object pointers [false]

but if you drop the heap below 32g you'll be able to take advantage of compressed oops

[2016-04-04 10:48:20,133][INFO ][env                      ] [Brute II] heap size [31.1gb], compressed ordinary object pointers [true]

This actually gives you more useable heap, and the smaller pointers are friendlier to memory bandwidth and CPU caches.

@NicoYUE
Copy link
Author

NicoYUE commented Apr 4, 2016

On KOPF, I can check the HEAP usage where my instances usually are using around 1Gb for 31.81Gb Max, it's far from that.

That's why I find this really weird. i do have some nodes with a low max RAM but they're not those who leave my cluster and are actually quite stable

Haven't check limit.conf yet, will do it ASAP

@clintongormley
Copy link

@jasontedor has already provided the answer, and given that this topic is more suited to the forums than github, I'm going to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants