Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master node was force to rejoin #12415

Closed
chenryn opened this issue Jul 23, 2015 · 9 comments
Closed

master node was force to rejoin #12415

chenryn opened this issue Jul 23, 2015 · 9 comments
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. feedback_needed

Comments

@chenryn
Copy link

chenryn commented Jul 23, 2015

Elasticsearch 1.6.0

The master node is 10.19.0.100, the es.log record as follow. It discover itself as a also master but with an older cluster_state, then force itself to rejoin...

[2015-07-23 15:00:12,976][INFO ][cluster.service          ] [10.19.0.100] new_master [10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}, reason: zen-disco-join (elected_as_master)
[2015-07-23 15:00:14,356][INFO ][cluster.service          ] [10.19.0.100] added {[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-receive(join from node[[10.19.0.69][Kn1ghOA4SFyI12Qo9ZEXEg][esnode069.mweibo.bx.sinanode.com][inet[/10.19.0.69:9300]]{max_local_storage_nodes=1, data=false, master=false}])
[2015-07-23 15:00:44,366][WARN ][discovery.zen.publish    ] [10.19.0.100] timed out waiting for all nodes to process published state [357982] (timeout [30s], pending nodes: [[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false}])
[2015-07-23 15:00:44,369][WARN ][cluster.service          ] [10.19.0.100] cluster state update task [zen-disco-receive(join from node[[10.19.0.69][Kn1ghOA4SFyI12Qo9ZEXEg][esnode069.mweibo.bx.sinanode.com][inet[/10.19.0.69:9300]]{max_local_storage_nodes=1, data=false, master=false}])] took 30s above the warn threshold of 30s
[2015-07-23 15:00:44,383][INFO ][cluster.service          ] [10.19.0.100] removed {[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-node_failed([10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false}), reason failed to ping, tried [3] times, each with maximum [1.6m] timeout
[2015-07-23 15:00:44,464][WARN ][discovery.zen            ] [10.19.0.100] discovered [[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] which is also master but with an older cluster_state, telling [[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] to rejoin the cluster ([via a new cluster state])
[2015-07-23 15:00:44,465][WARN ][discovery.zen            ] [10.19.0.100] received a request to rejoin the cluster from [144L7QgTSMahE2MVpWDffw], current nodes: {[10.19.0.81][IbIErGUAQja_oh3Pu6CiHQ][esnode081.mweibo.bx.sinanode.com][inet[/10.19.0.81:9300]]{max_local_storage_nodes=1, master=false},[10.19.0.82][HCzM5N_1Rd6_V551rXA4fA][esnode082.mweibo.bx.sinanode.com][inet[/10.19.0.82:9300]]{max_local_storage_nodes=1, master=false},[10.19.0.72][Ww2_E4LQT-K8Au8boXT2Fg][esnode072.mweibo.bx.sinanode.com][inet[/10.19.0.72:9300]]{max_local_storage_nodes=1, master=false},[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true},...(many nodes here)...[10.19.0.99][zcn7tAnoS4SrsalwzmK_Jw][localhost][inet[/10.19.0.99:9300]]{max_local_storage_nodes=1, data=false, master=true},}
[2015-07-23 15:02:24,499][INFO ][cluster.service          ] [10.19.0.100] new_master [10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}, reason: zen-disco-join (elected_as_master)
[2015-07-23 15:02:24,897][INFO ][cluster.service          ] [10.19.0.100] added {[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-receive(join from node[[10.19.0.80][DycB4xRISgSN4xUGe7bhog][esnode080.mweibo.bx.sinanode.com][inet[/10.19.0.80:9300]]{max_local_storage_nodes=1, master=false}])
[2015-07-23 15:02:54,906][WARN ][discovery.zen.publish    ] [10.19.0.100] timed out waiting for all nodes to process published state [357985] (timeout [30s], pending nodes: [[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false}])
[2015-07-23 15:02:54,909][WARN ][cluster.service          ] [10.19.0.100] cluster state update task [zen-disco-receive(join from node[[10.19.0.80][DycB4xRISgSN4xUGe7bhog][esnode080.mweibo.bx.sinanode.com][inet[/10.19.0.80:9300]]{max_local_storage_nodes=1, master=false}])] took 30s above the warn threshold of 30s
[2015-07-23 15:02:54,923][INFO ][cluster.service          ] [10.19.0.100] removed {[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-node_failed([10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false}), reason failed to ping, tried [3] times, each with maximum [1.6m] timeout
[2015-07-23 15:02:55,070][WARN ][discovery.zen            ] [10.19.0.100] discovered [[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] which is also master but with an older cluster_state, telling [[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] to rejoin the cluster ([via a new cluster state])
[2015-07-23 15:02:55,080][WARN ][discovery.zen            ] [10.19.0.100] received a request to rejoin the cluster from [144L7QgTSMahE2MVpWDffw], current nodes: {[10.19.0.81][IbIErGUAQja_oh3Pu6CiHQ][esnode081.mweibo.bx.sinanode.com][inet[/10.19.0.81:9300]]{max_local_storage_nodes=1, master=false...(many nodes here)

This happened time after time.

I had try to restart 10.19.0.100 but no effect. Then I had to stop such master, restart all other nodes to detect another master, start this node. Now the cluster health is green.

@bleskes
Copy link
Contributor

bleskes commented Jul 23, 2015

@chenryn can you share you cluster state on a gist? you can get it via GET _cluster/state .

Also, can you post the complete logs? you redacted some things for brevity (...(many nodes here)...) but it is important to get a complete picture...

@chenryn
Copy link
Author

chenryn commented Jul 23, 2015

@bleskes I upload cluster state and one circle rejoin logs to https://gist.github.com/chenryn/0aa3ba4742b3741d1f01

@bleskes
Copy link
Contributor

bleskes commented Jul 24, 2015

@chenryn thx. The cluster state in the ES is the same on all nodes except for a little flag indicating witch of the nodes is the local node. Your cluster state misses that flag, which causes the master to publish a new cluster state to itself (which we shouldn't do). This cause it the think there is another master active, responding with telling the other master to stop down. The other master (i.e., the same node) receives the command and steps down only to re-elect it self.

The biggest question here is how did the node end up not having a local flag set. Do you have any custom plugins installed? Was anything else out of order before this started happening?

@chenryn
Copy link
Author

chenryn commented Jul 31, 2015

No plugin installed. There was one client node died and reboot before the first rejoin happen, the "10.19.0.96" in above log.

@chenryn
Copy link
Author

chenryn commented Jul 31, 2015

btw: what the local flag like? I check the state of another cluster, seems no different of this cluster.

@chenryn
Copy link
Author

chenryn commented Aug 17, 2015

I got the same problem again:

[2015-08-16 19:05:27,898][WARN ][discovery.zen            ] [10.19.0.100] discovered [[10.19.0.100][DCTdoPzARimCnC3ZAdq2yQ][esnode100.mweibo.bx.sinanode.com][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] which is also master but with an old
er cluster_state, telling [[10.19.0.100][DCTdoPzARimCnC3ZAdq2yQ][esnode100.mweibo.bx.sinanode.com][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] to rejoin the cluster ([via a new cluster state])

@bleskes
Copy link
Contributor

bleskes commented Aug 17, 2015

@chenryn sorry for not getting back to you - I was out for two weeks. The flag is something internal and is not serialized to the rest api. You can see it if you connect via the Java API. You say you don't use any plugins. Is there anything else in your deployment that may be unusual? Do you embed ES? Can you reproduce this by any chance with a small setup you can share?

@clintongormley
Copy link

No more feedback - closing

@chenryn
Copy link
Author

chenryn commented Jan 26, 2016

yes, I didn;t reproduce this too.

@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. feedback_needed
Projects
None yet
Development

No branches or pull requests

3 participants