Only retry join when other node is not (yet) a master #8972

bleskes · 2014-12-16T12:11:43Z

When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for ElasticsearchIllegalStateException. However, local ElasticsearchIllegalStateException can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

The PR adds an explicit NotMasterException to indicate the remote node is not a master. A similarly named exception (but meaning something else) in the master fault detection code was given a better name. Also clean up some other exceptions while at it.

See http://build-us-00.elasticsearch.org/job/es_g1gc_master_metal/499/testReport/junit/org.elasticsearch.discovery.zen/ZenDiscoveryTests/testNodeFailuresAreProcessedOnce/ for a test that gets confused by the extra join

When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detec this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for ElasticsearchIllegalStateException. However, local ElasticsearchIllegalStateException can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

bleskes · 2014-12-16T21:58:51Z

@kimchy I updated the PR to use an explicit exception. Also cleaned up some unused exceptions. I'll update the PR description + commit msg once the review is done.

kimchy · 2014-12-16T22:04:01Z

LGTM

When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for `ElasticsearchIllegalStateException`. However, local `ElasticsearchIllegalStateException` can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry. Since we can't introduce new exceptions in a BWC manner, we are forced to check the message of the exception. Relates to elastic#8972

When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for `ElasticsearchIllegalStateException`. However, local `ElasticsearchIllegalStateException` can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry. Since we can't introduce new exceptions in a BWC manner, we are forced to check the message of the exception. Relates to #8972 Closes #8979

bleskes added v1.5.0 v2.0.0-beta1 >bug review labels Dec 16, 2014

bleskes force-pushed the zen_disco_only_retry_on_remote_exception branch from 4af41e7 to d7fad69 Compare December 16, 2014 21:50

Moved to an exception based solution.

de2bb69

bleskes force-pushed the zen_disco_only_retry_on_remote_exception branch from d7fad69 to de2bb69 Compare December 16, 2014 21:52

bleskes removed the v1.5.0 label Dec 16, 2014

bleskes closed this in 8f146f9 Dec 16, 2014

bleskes deleted the zen_disco_only_retry_on_remote_exception branch December 16, 2014 22:13

bleskes changed the title ~~Discovery: only retry join for remote exceptions~~ Discovery: only retry join when other node is not (yet) a master Dec 16, 2014

bleskes mentioned this pull request Dec 16, 2014

Discovery: only retry join when other node is not (yet) a master #8979

Closed

bleskes added the resiliency label Feb 2, 2015

clintongormley removed the review label Mar 19, 2015

clintongormley added the :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure label Jun 7, 2015

clintongormley changed the title ~~Discovery: only retry join when other node is not (yet) a master~~ Only retry join when other node is not (yet) a master Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only retry join when other node is not (yet) a master #8972

Only retry join when other node is not (yet) a master #8972

bleskes commented Dec 16, 2014

bleskes commented Dec 16, 2014

kimchy commented Dec 16, 2014

Only retry join when other node is not (yet) a master #8972

Only retry join when other node is not (yet) a master #8972

Conversation

bleskes commented Dec 16, 2014

bleskes commented Dec 16, 2014

kimchy commented Dec 16, 2014