Fixed the node retry mechanism which could fail without trying all the connected nodes #6829

javanna · 2014-07-11T10:25:52Z

The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed are not completed yet.

The TransportService already notifies the listener of any exception received from a separate thread through the request holder, no need to notify the retry listener again in any other place (either catch or onFailure method itself).

bleskes · 2014-07-11T13:56:22Z

src/main/java/org/elasticsearch/client/transport/TransportClientNodesService.java

-            if (e.unwrapCause() instanceof ConnectTransportException) {
-                retryListener.onFailure(e);
-            } else {
+            if (!(e.unwrapCause() instanceof ConnectTransportException)) {


Add a comment about why we ignore this?

bleskes · 2014-07-11T13:58:44Z

src/main/java/org/elasticsearch/client/transport/TransportClientNodesService.java

-                        onFailure(e1);
+                        //no need to retry here, the transport service will notify this same listener
+                        //of the failure through the request holder, which will retry
+                        //ConnectTransportException gets thrown as well by TransportService due to throwConnectException option


I think the comment can be clear, maybe:

"ConnectTransportException gets thrown as well by TransportService due to throwConnectException option, which is needed for the correct operation of execute(...). We can ignore it here because it will be passed through the listener interface"

bleskes · 2014-07-11T14:02:29Z

src/test/java/org/elasticsearch/client/transport/TransportClientRetryTests.java

+import static org.hamcrest.Matchers.greaterThanOrEqualTo;
+
+@ClusterScope(scope = TEST, numClientNodes = 0)
+public class TransportClientRetryTests extends ElasticsearchIntegrationTest {


I'd love to have a unit test of TransportClientNodesService, with a custom tranport service implementation that both throw error and checks that all nodes were tried...

bleskes · 2014-07-11T14:03:55Z

The fix look good to me (left some comments regarding comments :) ). I'd love to see a unit test as opposed to an integration test. I think we'd get much more out of it.

s1monw · 2014-07-17T12:45:29Z

src/test/java/org/elasticsearch/test/InternalTestCluster.java

@@ -143,7 +143,7 @@

    static final boolean DEFAULT_ENABLE_RANDOM_BENCH_NODES = true;

-    private static final String NODE_MODE = nodeMode();
+    public static final String NODE_MODE = nodeMode();


can we make nodeMode() public instead?

…hout trying all the connected nodes The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed are not completed yet. The TransportService notifies of any exception received from a separate thread through the request holder, no need to notify the retry listener in any other place (either catch or onFailure method itself). Closes elastic#6829

…ception, which might get thrown before the transport service send request call only ConnectTransportException gets thrown by transport service (throwConnectException option), if another exception gets thrown it comes from something that happened before, we need to notify the original listener and stop the retries.

…te variants: with or without listener. The variant without listener throws exception on the calling thread (throwConnectException option).

…accurate in terms of number of expected nodes

…tic field

…nsportService won't do it either do notify the listener in case of throwable, cause they can't come from the transport service and the listener needs to be notified of those, otherwise it bubbles up

javanna · 2014-07-25T16:58:02Z

Pushed new commits to address comments and a unit test for it as suggested by @bleskes . I also changed a bit how we catch exceptions given how they get thrown by the TransportService. Ready for reviews!

…lientNodesService and one a bit higher level that tests InternalTransportClient

…d exception handling in transport client retry mechanism

kimchy · 2014-07-28T17:27:20Z

src/test/java/org/elasticsearch/client/transport/FailAndRetryMockTransport.java

+import java.util.concurrent.CopyOnWriteArraySet;
+import java.util.concurrent.atomic.AtomicInteger;
+
+public abstract class FailAndRetryMockTransport<Response extends TransportResponse> implements Transport {


can we move this as an internal class to the test that use it? also, throw illegal state exception on implemented methods that should not be called?

good point, it is shared between two different tests, one more low level (TransportClientNodesServiceTests) and one that tests the blocking version as well and involves the TransportActionNodeProxy too (InternalTransportClientTests). I'd make it package private if that's ok and move both tests to the same org.elasticsearch.client.transport package?

kimchy · 2014-07-28T17:37:55Z

LGTM, very clean now indeed!

…re not supposed to be used

…hout trying all the connected nodes The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed were not completed yet. The TransportService used to throw ConnectTransportException due to throwConnectException set to true, and also notify the listener of any exception received from a separate thread through the request holder. Simplified exception handling by just removing the throwConnectException option from the TransportService, used only in the transport client. The transport client now relies solely on the request holder to notify of failures and eventually retry. Closes #6829

javanna · 2014-07-28T18:59:45Z

Side note: as part of the work to fix this issue the throwConnectException option was removed from the TransportService.

…hout trying all the connected nodes The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed were not completed yet. The TransportService used to throw ConnectTransportException due to throwConnectException set to true, and also notify the listener of any exception received from a separate thread through the request holder. Simplified exception handling by just removing the throwConnectException option from the TransportService, used only in the transport client. The transport client now relies solely on the request holder to notify of failures and eventually retry. Closes elastic#6829

javanna added bug labels Jul 11, 2014

bleskes reviewed Jul 11, 2014
View reviewed changes

javanna mentioned this pull request Jul 11, 2014

NoNodeAvailableException after 2 hours of bulk indexing #1868

Closed

bleskes reviewed Jul 11, 2014
View reviewed changes

javanna mentioned this pull request Jul 11, 2014

TransportClient behavior when the server node is not available #5151

Closed

bleskes reviewed Jul 11, 2014
View reviewed changes

s1monw added v1.4.0 and removed v1.3.0 labels Jul 14, 2014

s1monw reviewed Jul 17, 2014
View reviewed changes

s1monw removed the review label Jul 17, 2014

javanna added 7 commits July 22, 2014 14:12

beefed up TransportClientRetryTests, make sure that we use both execu…

2b53739

…te variants: with or without listener. The variant without listener throws exception on the calling thread (throwConnectException option).

added check for transport client connected nodes and make check more …

6cbbcda

…accurate in terms of number of expected nodes

made InternalTestCluster.nodeMode() public rathern than NODE_MODE sta…

a2601a6

…tic field

improved comments

4dad463

catch only ConnectTransportException, don't even unwrap since the Tra…

1c98046

…nsportService won't do it either do notify the listener in case of throwable, cause they can't come from the transport service and the listener needs to be notified of those, otherwise it bubbles up

javanna added the review label Jul 25, 2014

added unit tests that simulate transport failures, one for TransportC…

c8d112c

…lientNodesService and one a bit higher level that tests InternalTransportClient

javanna self-assigned this Jul 28, 2014

javanna added 2 commits July 28, 2014 18:57

removed throwConnectException option from TransportService, simplifie…

4614780

…d exception handling in transport client retry mechanism

exceptions and unused code cleanup

1e55d05

kimchy reviewed Jul 28, 2014
View reviewed changes

used UnsupportedOperationException for all implemented methods that a…

b2d091e

…re not supposed to be used

javanna closed this in fcf4d5a Jul 28, 2014

javanna mentioned this pull request Jul 28, 2014

Transport client: Don't add listed nodes to connected nodes list in sniff mode #7067

Closed

jpountz removed the review label Jul 29, 2014

javanna added the v1.3.3 label Aug 27, 2014

javanna mentioned this pull request May 4, 2015

Centralize admin implementations and action execution #10955

Merged

clintongormley added the :Core/Infra/Transport API Transport client API label Jun 7, 2015

clintongormley changed the title ~~Transport Client: fixed the node retry mechanism which could fail without trying all the connected nodes~~ Fixed the node retry mechanism which could fail without trying all the connected nodes Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the node retry mechanism which could fail without trying all the connected nodes #6829

Fixed the node retry mechanism which could fail without trying all the connected nodes #6829

javanna commented Jul 11, 2014

bleskes Jul 11, 2014

bleskes Jul 11, 2014

bleskes Jul 11, 2014

bleskes commented Jul 11, 2014

s1monw Jul 17, 2014

javanna commented Jul 25, 2014

kimchy Jul 28, 2014

javanna Jul 28, 2014

kimchy commented Jul 28, 2014

javanna commented Jul 28, 2014

Fixed the node retry mechanism which could fail without trying all the connected nodes #6829

Fixed the node retry mechanism which could fail without trying all the connected nodes #6829

Conversation

javanna commented Jul 11, 2014

bleskes Jul 11, 2014

Choose a reason for hiding this comment

bleskes Jul 11, 2014

Choose a reason for hiding this comment

bleskes Jul 11, 2014

Choose a reason for hiding this comment

bleskes commented Jul 11, 2014

s1monw Jul 17, 2014

Choose a reason for hiding this comment

javanna commented Jul 25, 2014

kimchy Jul 28, 2014

Choose a reason for hiding this comment

javanna Jul 28, 2014

Choose a reason for hiding this comment

kimchy commented Jul 28, 2014

javanna commented Jul 28, 2014