master node was force to rejoin #12415

chenryn · 2015-07-23T10:03:17Z

Elasticsearch 1.6.0

The master node is 10.19.0.100, the es.log record as follow. It discover itself as a also master but with an older cluster_state, then force itself to rejoin...

[2015-07-23 15:00:12,976][INFO ][cluster.service          ] [10.19.0.100] new_master [10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}, reason: zen-disco-join (elected_as_master)
[2015-07-23 15:00:14,356][INFO ][cluster.service          ] [10.19.0.100] added {[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-receive(join from node[[10.19.0.69][Kn1ghOA4SFyI12Qo9ZEXEg][esnode069.mweibo.bx.sinanode.com][inet[/10.19.0.69:9300]]{max_local_storage_nodes=1, data=false, master=false}])
[2015-07-23 15:00:44,366][WARN ][discovery.zen.publish    ] [10.19.0.100] timed out waiting for all nodes to process published state [357982] (timeout [30s], pending nodes: [[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false}])
[2015-07-23 15:00:44,369][WARN ][cluster.service          ] [10.19.0.100] cluster state update task [zen-disco-receive(join from node[[10.19.0.69][Kn1ghOA4SFyI12Qo9ZEXEg][esnode069.mweibo.bx.sinanode.com][inet[/10.19.0.69:9300]]{max_local_storage_nodes=1, data=false, master=false}])] took 30s above the warn threshold of 30s
[2015-07-23 15:00:44,383][INFO ][cluster.service          ] [10.19.0.100] removed {[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-node_failed([10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false}), reason failed to ping, tried [3] times, each with maximum [1.6m] timeout
[2015-07-23 15:00:44,464][WARN ][discovery.zen            ] [10.19.0.100] discovered [[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] which is also master but with an older cluster_state, telling [[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] to rejoin the cluster ([via a new cluster state])
[2015-07-23 15:00:44,465][WARN ][discovery.zen            ] [10.19.0.100] received a request to rejoin the cluster from [144L7QgTSMahE2MVpWDffw], current nodes: {[10.19.0.81][IbIErGUAQja_oh3Pu6CiHQ][esnode081.mweibo.bx.sinanode.com][inet[/10.19.0.81:9300]]{max_local_storage_nodes=1, master=false},[10.19.0.82][HCzM5N_1Rd6_V551rXA4fA][esnode082.mweibo.bx.sinanode.com][inet[/10.19.0.82:9300]]{max_local_storage_nodes=1, master=false},[10.19.0.72][Ww2_E4LQT-K8Au8boXT2Fg][esnode072.mweibo.bx.sinanode.com][inet[/10.19.0.72:9300]]{max_local_storage_nodes=1, master=false},[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true},...(many nodes here)...[10.19.0.99][zcn7tAnoS4SrsalwzmK_Jw][localhost][inet[/10.19.0.99:9300]]{max_local_storage_nodes=1, data=false, master=true},}
[2015-07-23 15:02:24,499][INFO ][cluster.service          ] [10.19.0.100] new_master [10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}, reason: zen-disco-join (elected_as_master)
[2015-07-23 15:02:24,897][INFO ][cluster.service          ] [10.19.0.100] added {[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-receive(join from node[[10.19.0.80][DycB4xRISgSN4xUGe7bhog][esnode080.mweibo.bx.sinanode.com][inet[/10.19.0.80:9300]]{max_local_storage_nodes=1, master=false}])
[2015-07-23 15:02:54,906][WARN ][discovery.zen.publish    ] [10.19.0.100] timed out waiting for all nodes to process published state [357985] (timeout [30s], pending nodes: [[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false}])
[2015-07-23 15:02:54,909][WARN ][cluster.service          ] [10.19.0.100] cluster state update task [zen-disco-receive(join from node[[10.19.0.80][DycB4xRISgSN4xUGe7bhog][esnode080.mweibo.bx.sinanode.com][inet[/10.19.0.80:9300]]{max_local_storage_nodes=1, master=false}])] took 30s above the warn threshold of 30s
[2015-07-23 15:02:54,923][INFO ][cluster.service          ] [10.19.0.100] removed {[10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false},}, reason: zen-disco-node_failed([10.19.0.96][KUI5p0MhTFyjQyhIAj-sHA][localhost.localdomain][inet[/127.0.0.1:9300]]{max_local_storage_nodes=1, data=false, master=false}), reason failed to ping, tried [3] times, each with maximum [1.6m] timeout
[2015-07-23 15:02:55,070][WARN ][discovery.zen            ] [10.19.0.100] discovered [[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] which is also master but with an older cluster_state, telling [[10.19.0.100][144L7QgTSMahE2MVpWDffw][localhost][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] to rejoin the cluster ([via a new cluster state])
[2015-07-23 15:02:55,080][WARN ][discovery.zen            ] [10.19.0.100] received a request to rejoin the cluster from [144L7QgTSMahE2MVpWDffw], current nodes: {[10.19.0.81][IbIErGUAQja_oh3Pu6CiHQ][esnode081.mweibo.bx.sinanode.com][inet[/10.19.0.81:9300]]{max_local_storage_nodes=1, master=false...(many nodes here)

This happened time after time.

I had try to restart 10.19.0.100 but no effect. Then I had to stop such master, restart all other nodes to detect another master, start this node. Now the cluster health is green.

The text was updated successfully, but these errors were encountered:

bleskes · 2015-07-23T10:14:38Z

@chenryn can you share you cluster state on a gist? you can get it via GET _cluster/state .

Also, can you post the complete logs? you redacted some things for brevity (...(many nodes here)...) but it is important to get a complete picture...

chenryn · 2015-07-23T12:39:59Z

@bleskes I upload cluster state and one circle rejoin logs to https://gist.github.com/chenryn/0aa3ba4742b3741d1f01

bleskes · 2015-07-24T09:20:39Z

@chenryn thx. The cluster state in the ES is the same on all nodes except for a little flag indicating witch of the nodes is the local node. Your cluster state misses that flag, which causes the master to publish a new cluster state to itself (which we shouldn't do). This cause it the think there is another master active, responding with telling the other master to stop down. The other master (i.e., the same node) receives the command and steps down only to re-elect it self.

The biggest question here is how did the node end up not having a local flag set. Do you have any custom plugins installed? Was anything else out of order before this started happening?

chenryn · 2015-07-31T09:14:35Z

No plugin installed. There was one client node died and reboot before the first rejoin happen, the "10.19.0.96" in above log.

chenryn · 2015-07-31T09:25:29Z

btw: what the local flag like? I check the state of another cluster, seems no different of this cluster.

chenryn · 2015-08-17T05:53:21Z

I got the same problem again:

[2015-08-16 19:05:27,898][WARN ][discovery.zen            ] [10.19.0.100] discovered [[10.19.0.100][DCTdoPzARimCnC3ZAdq2yQ][esnode100.mweibo.bx.sinanode.com][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] which is also master but with an old
er cluster_state, telling [[10.19.0.100][DCTdoPzARimCnC3ZAdq2yQ][esnode100.mweibo.bx.sinanode.com][inet[/10.19.0.100:9300]]{max_local_storage_nodes=1, data=false, master=true}] to rejoin the cluster ([via a new cluster state])

bleskes · 2015-08-17T12:34:25Z

@chenryn sorry for not getting back to you - I was out for two weeks. The flag is something internal and is not serialized to the rest api. You can see it if you connect via the Java API. You say you don't use any plugins. Is there anything else in your deployment that may be unusual? Do you embed ES? Can you reproduce this by any chance with a small setup you can share?

clintongormley · 2016-01-26T14:59:52Z

No more feedback - closing

chenryn · 2016-01-26T15:56:36Z

yes, I didn;t reproduce this too.

clintongormley added feedback_needed :Cluster discuss and removed feedback_needed labels Jul 23, 2015

bleskes added feedback_needed and removed discuss labels Jul 24, 2015

clintongormley assigned bleskes Aug 5, 2015

clintongormley closed this as completed Jan 26, 2016

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master node was force to rejoin #12415

master node was force to rejoin #12415

chenryn commented Jul 23, 2015

bleskes commented Jul 23, 2015

chenryn commented Jul 23, 2015

bleskes commented Jul 24, 2015

chenryn commented Jul 31, 2015

chenryn commented Jul 31, 2015

chenryn commented Aug 17, 2015

bleskes commented Aug 17, 2015

clintongormley commented Jan 26, 2016

chenryn commented Jan 26, 2016

master node was force to rejoin #12415

master node was force to rejoin #12415

Comments

chenryn commented Jul 23, 2015

bleskes commented Jul 23, 2015

chenryn commented Jul 23, 2015

bleskes commented Jul 24, 2015

chenryn commented Jul 31, 2015

chenryn commented Jul 31, 2015

chenryn commented Aug 17, 2015

bleskes commented Aug 17, 2015

clintongormley commented Jan 26, 2016

chenryn commented Jan 26, 2016