Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the node retry mechanism which could fail without trying all the connected nodes #6829

Closed

Conversation

javanna
Copy link
Member

@javanna javanna commented Jul 11, 2014

The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed are not completed yet.

The TransportService already notifies the listener of any exception received from a separate thread through the request holder, no need to notify the retry listener again in any other place (either catch or onFailure method itself).

if (e.unwrapCause() instanceof ConnectTransportException) {
retryListener.onFailure(e);
} else {
if (!(e.unwrapCause() instanceof ConnectTransportException)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about why we ignore this?

onFailure(e1);
//no need to retry here, the transport service will notify this same listener
//of the failure through the request holder, which will retry
//ConnectTransportException gets thrown as well by TransportService due to throwConnectException option
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment can be clear, maybe:

"ConnectTransportException gets thrown as well by TransportService due to throwConnectException option, which is needed for the correct operation of execute(...). We can ignore it here because it will be passed through the listener interface"

import static org.hamcrest.Matchers.greaterThanOrEqualTo;

@ClusterScope(scope = TEST, numClientNodes = 0)
public class TransportClientRetryTests extends ElasticsearchIntegrationTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to have a unit test of TransportClientNodesService, with a custom tranport service implementation that both throw error and checks that all nodes were tried...

@bleskes
Copy link
Contributor

bleskes commented Jul 11, 2014

The fix look good to me (left some comments regarding comments :) ). I'd love to see a unit test as opposed to an integration test. I think we'd get much more out of it.

@s1monw s1monw added v1.4.0 and removed v1.3.0 labels Jul 14, 2014
@@ -143,7 +143,7 @@

static final boolean DEFAULT_ENABLE_RANDOM_BENCH_NODES = true;

private static final String NODE_MODE = nodeMode();
public static final String NODE_MODE = nodeMode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make nodeMode() public instead?

@s1monw s1monw removed the review label Jul 17, 2014
…hout trying all the connected nodes

The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed are not completed yet.

The TransportService notifies of any exception received from a separate thread through the request holder, no need to notify the retry listener in any other place (either catch or onFailure method itself).

Closes elastic#6829
…ception, which might get thrown before the transport service send request call

only ConnectTransportException gets thrown by transport service (throwConnectException option), if another exception gets thrown it comes from something that happened before, we need to notify the original listener and stop the retries.
…te variants: with or without listener.

The variant without listener throws exception on the calling thread (throwConnectException option).
…accurate in terms of number of expected nodes
…nsportService won't do it either

do notify the listener in case of throwable, cause they can't come from the transport service and the listener needs to be notified of those, otherwise it bubbles up
@javanna
Copy link
Member Author

javanna commented Jul 25, 2014

Pushed new commits to address comments and a unit test for it as suggested by @bleskes . I also changed a bit how we catch exceptions given how they get thrown by the TransportService. Ready for reviews!

@javanna javanna added the review label Jul 25, 2014
…lientNodesService and one a bit higher level that tests InternalTransportClient
@javanna javanna self-assigned this Jul 28, 2014
import java.util.concurrent.CopyOnWriteArraySet;
import java.util.concurrent.atomic.AtomicInteger;

public abstract class FailAndRetryMockTransport<Response extends TransportResponse> implements Transport {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this as an internal class to the test that use it? also, throw illegal state exception on implemented methods that should not be called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, it is shared between two different tests, one more low level (TransportClientNodesServiceTests) and one that tests the blocking version as well and involves the TransportActionNodeProxy too (InternalTransportClientTests). I'd make it package private if that's ok and move both tests to the same org.elasticsearch.client.transport package?

@kimchy
Copy link
Member

kimchy commented Jul 28, 2014

LGTM, very clean now indeed!

@javanna javanna closed this in fcf4d5a Jul 28, 2014
javanna added a commit that referenced this pull request Jul 28, 2014
…hout trying all the connected nodes

The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed were not completed yet.

The TransportService used to throw ConnectTransportException due to throwConnectException set to true, and also notify the listener of any exception received from a separate thread through the request holder.

Simplified exception handling by just removing the throwConnectException option from the TransportService, used only in the transport client. The transport client now relies solely on the request holder to notify of failures and eventually retry.

Closes #6829
@javanna
Copy link
Member Author

javanna commented Jul 28, 2014

Side note: as part of the work to fix this issue the throwConnectException option was removed from the TransportService.

@jpountz jpountz removed the review label Jul 29, 2014
javanna added a commit to javanna/elasticsearch that referenced this pull request Aug 27, 2014
…hout trying all the connected nodes

The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed were not completed yet.

The TransportService used to throw ConnectTransportException due to throwConnectException set to true, and also notify the listener of any exception received from a separate thread through the request holder.

Simplified exception handling by just removing the throwConnectException option from the TransportService, used only in the transport client. The transport client now relies solely on the request holder to notify of failures and eventually retry.

Closes elastic#6829
@javanna javanna added the v1.3.3 label Aug 27, 2014
@clintongormley clintongormley added the :Core/Infra/Transport API Transport client API label Jun 7, 2015
@clintongormley clintongormley changed the title Transport Client: fixed the node retry mechanism which could fail without trying all the connected nodes Fixed the node retry mechanism which could fail without trying all the connected nodes Jun 7, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
…hout trying all the connected nodes

The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed were not completed yet.

The TransportService used to throw ConnectTransportException due to throwConnectException set to true, and also notify the listener of any exception received from a separate thread through the request holder.

Simplified exception handling by just removing the throwConnectException option from the TransportService, used only in the transport client. The transport client now relies solely on the request holder to notify of failures and eventually retry.

Closes elastic#6829
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants