Do not use versions to optimize cluster state copying for a first update from a new master #6466

bleskes · 2014-06-11T14:24:55Z

We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different.

NOTE: this a PR for the feature/improve_zen branch

… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different.

martijnvg · 2014-06-11T15:34:57Z

src/test/java/org/elasticsearch/test/ElasticsearchIntegrationTest.java

@@ -560,6 +558,10 @@ public static Client client() {
        return client;
    }

+    public static Client client(@Nullable String node) {
+       return node == null? client() : cluster().client(node);


I think node can't be null here.

See: https://github.com/elasticsearch/elasticsearch/pull/6466/files#diff-7e704e723d968699a880ef44818125efR484

argh... didn't read it properly... nevermind

martijnvg · 2014-06-11T18:27:21Z

LGTM

… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes #6466

bleskes · 2014-06-12T09:05:10Z

Pushed in with a2ca26e

… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes #6466

… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes elastic#6466

… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes #6466

… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes elastic#6466

… a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes #6466

martijnvg reviewed Jun 11, 2014
View reviewed changes

added randomization between responsive/unresponsive network disconnects

647dad9

bleskes closed this Jun 12, 2014

bleskes deleted the always_update_cs_from_master branch June 12, 2014 09:05

bleskes added v1.4.0 labels Sep 2, 2014

clintongormley changed the title ~~[Discovery] do not use versions to optimize cluster state copying for a first update from a new master~~ Resiliency: Do not use versions to optimize cluster state copying for a first update from a new master Sep 8, 2014

clintongormley added the >enhancement label Sep 11, 2014

clintongormley added the :Cluster label Jun 7, 2015

clintongormley changed the title ~~Resiliency: Do not use versions to optimize cluster state copying for a first update from a new master~~ Do not use versions to optimize cluster state copying for a first update from a new master Jun 7, 2015

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not use versions to optimize cluster state copying for a first update from a new master #6466

Do not use versions to optimize cluster state copying for a first update from a new master #6466

bleskes commented Jun 11, 2014

martijnvg Jun 11, 2014

bleskes Jun 11, 2014

martijnvg Jun 11, 2014

martijnvg commented Jun 11, 2014

bleskes commented Jun 12, 2014

Do not use versions to optimize cluster state copying for a first update from a new master #6466

Do not use versions to optimize cluster state copying for a first update from a new master #6466

Conversation

bleskes commented Jun 11, 2014

martijnvg Jun 11, 2014

Choose a reason for hiding this comment

bleskes Jun 11, 2014

Choose a reason for hiding this comment

martijnvg Jun 11, 2014

Choose a reason for hiding this comment

martijnvg commented Jun 11, 2014

bleskes commented Jun 12, 2014