Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Met split-brain issue when the azure vm connection was lost #13727

Closed
OsmondX opened this issue Sep 23, 2015 · 5 comments
Closed

Met split-brain issue when the azure vm connection was lost #13727

OsmondX opened this issue Sep 23, 2015 · 5 comments

Comments

@OsmondX
Copy link

OsmondX commented Sep 23, 2015

Hi all,

Recently we met 3 times split-brain issue of my elasticsearch cluster hold on azure VM.

I have three nodes, each node can be data//master node. I have set the discovery.zen.minimum_master_nodes to 2 and use azure discovery plugin.

Recently azure often maintain their host machines, this cause the network instability between VMs and also caused split-brain issue for my elasticsearch cluster.

We found that the node which has already joined in to one master can be forced to rejoin to another master, and then the split-brain issue happened!

From the log we can see that:
09-22 16:38:27, node1 lost connection.
09-22 16:45:15, node2 lost connection, node 3 became master(node1 has recovered)
09-22 16:45:24, node3 lost connection, node 2 became master.
09-22 16:45:43, node3 recovered and became master again!.

Below is the error or warns for these three nodes when is split-brain issue happened:

Node1(search-prod-wus1):

[2015-09-22 16:45:15,618][INFO ][cluster.service ] [caps-prod-wus1] master {new [caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]], previous [caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]}, removed {[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]],}, reason: zen-disco-receive(from master [[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]]])
[2015-09-22 16:45:20,566][WARN ][index.store ] [caps-prod-wus1] [32c3c289eef54e42be5913a63dfd280a][2] Can't open file to read checksums
java.io.FileNotFoundException: No such file [_89i0_es090_0.doc]
[2015-09-22 16:45:21,019][WARN ][index.store ] [caps-prod-wus1] [32c3c289eef54e42be5913a63dfd280a][2] Can't open file to read checksums
java.io.FileNotFoundException: No such file [_89i0_es090_0.doc]
[2015-09-22 16:45:24,364][INFO ][cluster.service ] [caps-prod-wus1] master {new [caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]], previous [caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]]}, removed {[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]],}, added {[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]],}, reason: zen-disco-receive(from master [[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]])
[2015-09-22 16:45:43,059][INFO ][cluster.service ] [caps-prod-wus1] master {new [caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]], previous [caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]}, removed {[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]],}, added {[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]],}, reason: zen-disco-receive(from master [[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]]])

Nodes(search-prod-wus2):

[2015-09-22 16:38:37,162][WARN ][action.index ] [caps-prod-wus2] Failed to perform index on remote replica [caps-prod-wus1][vcBwbCeJTQ61Mw2aOqaflg][search-prod-wbp][inet[/10.3.0.5:9300]][32c3c289eef54e42be5913a63dfd280a][3]
org.elasticsearch.transport.NodeDisconnectedException: [caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica] disconnected
[2015-09-22 16:38:37,162][WARN ][cluster.action.shard ] [caps-prod-wus2] [32c3c289eef54e42be5913a63dfd280a][3] sending failed shard for [32c3c289eef54e42be5913a63dfd280a][3], node[vcBwbCeJTQ61Mw2aOqaflg], [R], s[STARTED], indexUUID [vLXa7OmETyGPGrI82L4SVg], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica] disconnected]]]
[2015-09-22 16:38:37,162][WARN ][action.index ] [caps-prod-wus2] Failed to perform index on remote replica [caps-prod-wus1][vcBwbCeJTQ61Mw2aOqaflg][search-prod-wbp][inet[/10.3.0.5:9300]][32c3c289eef54e42be5913a63dfd280a][3]
org.elasticsearch.transport.SendRequestTransportException: [caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica]
[2015-09-22 16:38:37,162][WARN ][cluster.action.shard ] [caps-prod-wus2] [32c3c289eef54e42be5913a63dfd280a][3] sending failed shard for [32c3c289eef54e42be5913a63dfd280a][3], node[vcBwbCeJTQ61Mw2aOqaflg], [R], s[STARTED], indexUUID [vLXa7OmETyGPGrI82L4SVg], reason [Failed to perform [index] on replica, message [SendRequestTransportException[[caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica]]; nested: NodeNotConnectedException[[caps-prod-wus1][inet[/10.3.0.5:9300]] Node not connected]; ]]

[2015-09-22 16:45:24,005][INFO ][cluster.service ] [caps-prod-wus2] removed {[caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]],}, reason: zen-disco-node_failed([caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]]), reason transport disconnected (with verified connect)

Node3(search-prod-wus3):

[2015-09-22 16:38:38,677][WARN ][action.index ] [caps-prod-wus3] Failed to perform index on remote replica [caps-prod-wus1][vcBwbCeJTQ61Mw2aOqaflg][search-prod-wbp][inet[/10.3.0.5:9300]][32c3c289eef54e42be5913a63dfd280a][2]
org.elasticsearch.transport.SendRequestTransportException: [caps-prod-wus1][inet[/10.3.0.5:9300]][index/replica]
[2015-09-22 16:45:15,156][INFO ][discovery.azure ] [caps-prod-wus3] master_left [[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]], reason [transport disconnected (with verified connect)]
[2015-09-22 16:45:15,156][INFO ][cluster.service ] [caps-prod-wus3] master {new [caps-prod-wus3][qQYWMttZScaqbAfBPkV5gw][search-prod-wbu][inet[/10.3.0.6:9300]], previous [caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]]}, removed {[caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]],}, reason: zen-disco-master_failed ([caps-prod-wus2][9v_FW_4AQ7KQ4fA5CLPdTg][search-prod-wus][inet[/10.3.0.4:9300]])

[2015-09-22 16:45:43,126][WARN ][index.store ] [caps-prod-wus3] [0222d67c4146405497a70df65629e634][0] Can't open file to read checksums
java.io.FileNotFoundException: No such file [_3app.fdt]

@bleskes
Copy link
Contributor

bleskes commented Sep 23, 2015

Thanks for reporting. Which is ES version are you using?

@OsmondX
Copy link
Author

OsmondX commented Sep 23, 2015

My elastic search version is 1.3.2.

@OsmondX
Copy link
Author

OsmondX commented Sep 23, 2015

This also happened to my 1.3.9 version of elastic search cluster

@bleskes
Copy link
Contributor

bleskes commented Sep 23, 2015

I see. So I suspect this is #2488, which was fixed in 1.4 . I suggest you upgrade (to 1.7.2) as soon as possible. Many many things have been fixed since 1.3.2.

@bleskes
Copy link
Contributor

bleskes commented Sep 23, 2015

I'm closing this now. Please reopen if it happens again after upgrading...

@bleskes bleskes closed this as completed Sep 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants