Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only retry join when other node is not (yet) a master #8972

Closed

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Dec 16, 2014

When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for ElasticsearchIllegalStateException. However, local ElasticsearchIllegalStateException can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

The PR adds an explicit NotMasterException to indicate the remote node is not a master. A similarly named exception (but meaning something else) in the master fault detection code was given a better name. Also clean up some other exceptions while at it.

See http://build-us-00.elasticsearch.org/job/es_g1gc_master_metal/499/testReport/junit/org.elasticsearch.discovery.zen/ZenDiscoveryTests/testNodeFailuresAreProcessedOnce/ for a test that gets confused by the extra join

When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detec this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for ElasticsearchIllegalStateException. However, local ElasticsearchIllegalStateException can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.
@bleskes bleskes force-pushed the zen_disco_only_retry_on_remote_exception branch from d7fad69 to de2bb69 Compare December 16, 2014 21:52
@bleskes
Copy link
Contributor Author

bleskes commented Dec 16, 2014

@kimchy I updated the PR to use an explicit exception. Also cleaned up some unused exceptions. I'll update the PR description + commit msg once the review is done.

@kimchy
Copy link
Member

kimchy commented Dec 16, 2014

LGTM

@bleskes bleskes removed the v1.5.0 label Dec 16, 2014
@bleskes bleskes closed this in 8f146f9 Dec 16, 2014
@bleskes bleskes deleted the zen_disco_only_retry_on_remote_exception branch December 16, 2014 22:13
@bleskes bleskes changed the title Discovery: only retry join for remote exceptions Discovery: only retry join when other node is not (yet) a master Dec 16, 2014
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Dec 16, 2014
When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for `ElasticsearchIllegalStateException`. However, local `ElasticsearchIllegalStateException` can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

Since we can't introduce new exceptions in a BWC manner, we are forced to check the message of the exception.

Relates to elastic#8972
bleskes added a commit that referenced this pull request Dec 16, 2014
When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for `ElasticsearchIllegalStateException`. However, local `ElasticsearchIllegalStateException` can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

Since we can't introduce new exceptions in a BWC manner, we are forced to check the message of the exception.

Relates to #8972
Closes #8979
@clintongormley clintongormley added the :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure label Jun 7, 2015
@clintongormley clintongormley changed the title Discovery: only retry join when other node is not (yet) a master Only retry join when other node is not (yet) a master Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure resiliency v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants