Met split-brain issue when the azure vm connection was lost #13727

OsmondX · 2015-09-23T05:19:14Z

Hi all,

Recently we met 3 times split-brain issue of my elasticsearch cluster hold on azure VM.

I have three nodes, each node can be data//master node. I have set the discovery.zen.minimum_master_nodes to 2 and use azure discovery plugin.

Recently azure often maintain their host machines, this cause the network instability between VMs and also caused split-brain issue for my elasticsearch cluster.

We found that the node which has already joined in to one master can be forced to rejoin to another master, and then the split-brain issue happened!

From the log we can see that:
09-22 16:38:27, node1 lost connection.
09-22 16:45:15, node2 lost connection, node 3 became master(node1 has recovered)
09-22 16:45:24, node3 lost connection, node 2 became master.
09-22 16:45:43, node3 recovered and became master again!.

Below is the error or warns for these three nodes when is split-brain issue happened:

Node1(search-prod-wus1):

[2015-09-22 16:45:15,618][INFO ][cluster.service ] [caps-prod-wus1] master {new [caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]], previous [caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]}, removed {[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]],}, reason: zen-disco-receive(from master [[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]]])
[2015-09-22 16:45:20,566][WARN ][index.store ] [caps-prod-wus1] [32c3c289eef54e42be5913a63dfd280a][2] Can't open file to read checksums
java.io.FileNotFoundException: No such file [_89i0_es090_0.doc]
[2015-09-22 16:45:21,019][WARN ][index.store ] [caps-prod-wus1] [32c3c289eef54e42be5913a63dfd280a][2] Can't open file to read checksums
java.io.FileNotFoundException: No such file [_89i0_es090_0.doc]
[2015-09-22 16:45:24,364][INFO ][cluster.service ] [caps-prod-wus1] master {new [caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]], previous [caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]]}, removed {[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]],}, added {[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]],}, reason: zen-disco-receive(from master [[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]])
[2015-09-22 16:45:43,059][INFO ][cluster.service ] [caps-prod-wus1] master {new [caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]], previous [caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]}, removed {[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]],}, added {[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]],}, reason: zen-disco-receive(from master [[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]]])

Nodes(search-prod-wus2):

[2015-09-22 16:38:37,162][WARN ][action.index ] [caps-prod-wus2] Failed to perform index on remote replica [caps-prod-wus1][vcBwbCeJTQ61Mw2aOqaflg][search-prod-wbp][inet[/10.3.0.5:9300]][32c3c289eef54e42be5913a63dfd280a][3]
org.elasticsearch.transport.NodeDisconnectedException: [caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica] disconnected
[2015-09-22 16:38:37,162][WARN ][cluster.action.shard ] [caps-prod-wus2] [32c3c289eef54e42be5913a63dfd280a][3] sending failed shard for [32c3c289eef54e42be5913a63dfd280a][3], node[vcBwbCeJTQ61Mw2aOqaflg], [R], s[STARTED], indexUUID [vLXa7OmETyGPGrI82L4SVg], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica] disconnected]]]
[2015-09-22 16:38:37,162][WARN ][action.index ] [caps-prod-wus2] Failed to perform index on remote replica [caps-prod-wus1][vcBwbCeJTQ61Mw2aOqaflg][search-prod-wbp][inet[/10.3.0.5:9300]][32c3c289eef54e42be5913a63dfd280a][3]
org.elasticsearch.transport.SendRequestTransportException: [caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica]
[2015-09-22 16:38:37,162][WARN ][cluster.action.shard ] [caps-prod-wus2] [32c3c289eef54e42be5913a63dfd280a][3] sending failed shard for [32c3c289eef54e42be5913a63dfd280a][3], node[vcBwbCeJTQ61Mw2aOqaflg], [R], s[STARTED], indexUUID [vLXa7OmETyGPGrI82L4SVg], reason [Failed to perform [index] on replica, message [SendRequestTransportException[[caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica]]; nested: NodeNotConnectedException[[caps-prod-wus1][inet[/10.3.0.5:9300]] Node not connected]; ]]

[2015-09-22 16:45:24,005][INFO ][cluster.service ] [caps-prod-wus2] removed {[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]],}, reason: zen-disco-node_failed([caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]]), reason transport disconnected (with verified connect)

Node3(search-prod-wus3):

[2015-09-22 16:38:38,677][WARN ][action.index ] [caps-prod-wus3] Failed to perform index on remote replica [caps-prod-wus1][vcBwbCeJTQ61Mw2aOqaflg][search-prod-wbp][inet[/10.3.0.5:9300]][32c3c289eef54e42be5913a63dfd280a][2]
org.elasticsearch.transport.SendRequestTransportException: [caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica]
[2015-09-22 16:45:15,156][INFO ][discovery.azure ] [caps-prod-wus3] master_left [[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]], reason [transport disconnected (with verified connect)]
[2015-09-22 16:45:15,156][INFO ][cluster.service ] [caps-prod-wus3] master {new [caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]], previous [caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]}, removed {[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]],}, reason: zen-disco-master_failed ([caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]])

[2015-09-22 16:45:43,126][WARN ][index.store ] [caps-prod-wus3] [0222d67c4146405497a70df65629e634][0] Can't open file to read checksums
java.io.FileNotFoundException: No such file [_3app.fdt]

bleskes · 2015-09-23T07:12:22Z

Thanks for reporting. Which is ES version are you using?

OsmondX · 2015-09-23T08:41:52Z

My elastic search version is 1.3.2.

OsmondX · 2015-09-23T08:42:25Z

This also happened to my 1.3.9 version of elastic search cluster

bleskes · 2015-09-23T08:44:28Z

I see. So I suspect this is #2488, which was fixed in 1.4 . I suggest you upgrade (to 1.7.2) as soon as possible. Many many things have been fixed since 1.3.2.

bleskes · 2015-09-23T08:45:08Z

I'm closing this now. Please reopen if it happens again after upgrading...

bleskes closed this as completed Sep 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Met split-brain issue when the azure vm connection was lost #13727

Met split-brain issue when the azure vm connection was lost #13727

OsmondX commented Sep 23, 2015

bleskes commented Sep 23, 2015

OsmondX commented Sep 23, 2015

OsmondX commented Sep 23, 2015

bleskes commented Sep 23, 2015

bleskes commented Sep 23, 2015

Met split-brain issue when the azure vm connection was lost #13727

Met split-brain issue when the azure vm connection was lost #13727

Comments

OsmondX commented Sep 23, 2015

Node1(search-prod-wus1):

Nodes(search-prod-wus2):

Node3(search-prod-wus3):

bleskes commented Sep 23, 2015

OsmondX commented Sep 23, 2015

OsmondX commented Sep 23, 2015

bleskes commented Sep 23, 2015

bleskes commented Sep 23, 2015